This section covers performance tuning and cost efficiency.
- Bronze Layer: Native formats (JSON, CSV, PDF, JPG). Optimized for write speed and fidelity to source.
- Silver/Gold Layers: Columnar formats (Parquet / ORC).
- Why: Drastically improves I/O efficiency and compression for analytical queries. Parquet allows reading only specific columns (e.g., "grade") without scanning the entire file.
We physically organize files to enable "Partition Pruning".
- Structure:
bucket/domain/year=YYYY/month=MM/day=DD/file.parquet
- Benefit: Queries filtering by date can skip 99% of the data, reducing I/O and increasing speed.
- Hot: Recent data (last 30 days) on SSD/Standard S3.
- Cold: Archived data (older than 1 year) moved to cheaper storage (Glacier-equivalent) via lifecycle policies.