For the complete documentation index, see llms.txt. This page is also available as Markdown.

Folder Structure

Data is organized in a four-level hierarchy: dataset → year → month → daily files. A per-dataset manifest.json sits at the root of each dataset folder.

s3://gs-catalog-csv/
├── aeso_daily_average_pool_price/
│   ├── aeso_daily_average_pool_price_manifest.json
│   ├── year=2000/
│   │   ├── month=01/
│   │   │   ├── 2000-01-01.csv.gz
│   │   │   ├── 2000-01-02.csv.gz
│   │   │   └── ...
│   │   ├── month=02/
│   │   └── ...
│   ├── year=2001/
│   └── ...
├── caiso_lmp_day_ahead_hourly/
│   ├── caiso_lmp_day_ahead_hourly_manifest.json
│   ├── year=2019/
│   └── ...
└── ... (hundreds of datasets)
  • Each dataset is a top-level folder.

  • {dataset_id}_manifest.json describes the dataset: column names and types, primary key, time columns, and an export_watermark block recording the last successful export. See Manifest example below.

  • Data is partitioned by year=YYYY/month=MM/.

  • Individual files are gzipped CSVs named by date: YYYY-MM-DD.csv.gz.

    • For datasets with a publish_time_column, the date in the filename is the publish date of the data.

    • For datasets without a publish_time_column, the date in the filename is the time_index_column date of the data.

  • File sizes vary widely by dataset. Across the catalog, gzipped CSVs average around 1 MB; the largest individual files (typically high-resolution LMP datasets) approach 200 MB, and the smallest (header-only or sparse days) are a few hundred bytes.

  • The bucket is refreshed daily by an incremental export. Any daily partition containing rows that were inserted or updated upstream since the previous run is rewritten in full — corrections and late-arriving data flow through, not just newly published rows.

Manifest example

Each dataset folder ships a {dataset_id}_manifest.json describing the schema and the most recent export. Real manifest from aeso_daily_average_pool_price (with the verbose per-day runs history elided for brevity):

Key fields:

  • columns — full schema with name and type for every column in the CSV. Names are lower-case and match the CSV headers exactly.

  • primary_key_columns — uniquely identify a row.

  • time_index_column / publish_time_column — which column drives the per-file date partitioning.

  • export_watermark.sync_cursor_at — every upstream change with a sync timestamp at or before this value is reflected in the exported files. Compare two days' manifests to detect what changed.

  • runs — append-only log of every export run that wrote to the dataset. Each entry records the date list (succeeded and failed), the watermark window used, and the rollup totals.

Last updated

Was this helpful?