# Folder Structure

{% hint style="warning" %}
Bulk CSV Downloads is in **early beta**. We'd love your feedback — please reach out with any questions or issues.
{% endhint %}

Data is organized in a four-level hierarchy: **dataset → year → month → daily files**. A per-dataset `manifest.json` sits at the root of each dataset folder.

```
s3://gs-catalog-csv/
├── aeso_daily_average_pool_price/
│   ├── aeso_daily_average_pool_price_manifest.json
│   ├── year=2000/
│   │   ├── month=01/
│   │   │   ├── 2000-01-01.csv.gz
│   │   │   ├── 2000-01-02.csv.gz
│   │   │   └── ...
│   │   ├── month=02/
│   │   └── ...
│   ├── year=2001/
│   └── ...
├── caiso_lmp_day_ahead_hourly/
│   ├── caiso_lmp_day_ahead_hourly_manifest.json
│   ├── year=2019/
│   └── ...
└── ... (hundreds of datasets)
```

* Each dataset is a top-level folder.
* `{dataset_id}_manifest.json` describes the dataset: column names and types, primary key, time columns, and an `export_watermark` block recording the last successful export. See [Manifest example](#manifest-example) below.
* Data is partitioned by `year=YYYY/month=MM/`.
* Individual files are gzipped CSVs named by date: `YYYY-MM-DD.csv.gz`.
  * For datasets with a `publish_time_column`, the date in the filename is the publish date of the data.
  * For datasets without a `publish_time_column`, the date in the filename is the `time_index_column` date of the data.
* File sizes vary widely by dataset. Across the catalog, gzipped CSVs average around 1 MB; the largest individual files (typically high-resolution LMP datasets) approach 200 MB, and the smallest (header-only or sparse days) are a few hundred bytes.
* The bucket is refreshed daily by an incremental export. Any daily partition containing rows that were inserted **or updated** upstream since the previous run is rewritten in full — corrections and late-arriving data flow through, not just newly published rows.

## Manifest example

Each dataset folder ships a `{dataset_id}_manifest.json` describing the schema and the most recent export. Real manifest from `aeso_daily_average_pool_price` (with the verbose per-day `runs` history elided for brevity):

```json
{
  "dataset_id": "aeso_daily_average_pool_price",
  "name": "AESO Daily Average Pool Price",
  "source": "aeso",
  "primary_key_columns": ["interval_start_utc"],
  "publish_time_column": null,
  "time_index_column": "interval_start_utc",
  "columns": [
    {"name": "interval_start_local", "type": "TIMESTAMP"},
    {"name": "interval_start_utc", "type": "TIMESTAMP"},
    {"name": "interval_end_local", "type": "TIMESTAMP"},
    {"name": "interval_end_utc", "type": "TIMESTAMP"},
    {"name": "daily_average", "type": "DOUBLE PRECISION"},
    {"name": "daily_on_peak_average", "type": "DOUBLE PRECISION"},
    {"name": "daily_off_peak_average", "type": "DOUBLE PRECISION"},
    {"name": "30_day_average", "type": "DOUBLE PRECISION"}
  ],
  "manifest_written_at": "2026-04-30T17:13:30.341687+00:00",
  "export_watermark": {
    "sync_cursor_at": "2026-04-30T17:13:11.942Z",
    "run_completed_at": "2026-04-30T17:13:30.341661+00:00",
    "rows_exported": 1
  },
  "runs": ["... (per-run export history elided)"]
}
```

Key fields:

* `columns` — full schema with name and type for every column in the CSV. Names are lower-case and match the CSV headers exactly.
* `primary_key_columns` — uniquely identify a row.
* `time_index_column` / `publish_time_column` — which column drives the per-file date partitioning.
* `export_watermark.sync_cursor_at` — every upstream change with a sync timestamp at or before this value is reflected in the exported files. Compare two days' manifests to detect what changed.
* `runs` — append-only log of every export run that wrote to the dataset. Each entry records the date list (succeeded and failed), the watermark window used, and the rollup totals.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.gridstatus.io/developers/bulk-csv-downloads/folder-structure.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
