For the complete documentation index, see llms.txt. This page is also available as Markdown.

Getting Started

Bulk CSV Downloads provides our entire data catalog as compressed CSV flat files, delivered through AWS S3. The full export is approximately 1 TB compressed, and S3 gives you high-throughput parallel downloads to efficiently backfill your systems with our data.

We will share access to the export using AWS Security Token Service (STS), which grants you a temporary, limited-privilege credential to access the files in S3.

Prerequisites

Please send us:

  • AWS Account ID

  • Confirmation you can use sts:AssumeRole

You must use an IAM user or role when downloading files. A root account will not work (this is a limitation of AWS).

We will send you the following credentials so you can then access the data.

  • RoleArn

  • ExternalId

First install the AWS CLI.

In ~/.aws/config, add a gridstatus profile. Using a named profile will allow the CLI to handle credential refresh automatically.

[profile gridstatus]
role_arn = <RoleArn>
external_id = <ExternalId>
source_profile = default

s3 =
  max_concurrent_requests = 64

max_concurrent_requests = 64 is a good starting point for bulk downloads — the AWS CLI default of 10 leaves throughput on the table.

Verify it works by listing the available datasets:

If you see dataset folders listed, your credentials are working. See Example Usage for more commands.

Transfer to Google Cloud Storage

If your destination is Google Cloud Storage, Storage Transfer Service can sync gs-catalog-csv directly into a GCS bucket. It authenticates with AWS IAM role for federated identity, so let us know you want to use this path and send us the Subject ID of your Google-managed service account. We'll add it to the role's trust policy, after which you can point a transfer job at s3://gs-catalog-csv using the RoleArn we provide you.

If federated identity isn't a fit, rclone supports sts:AssumeRole with ExternalId (role_arn + role_external_id) and can sync S3 → GCS from a Compute Engine VM using the credentials we already provide.

Refresh schedule

The bucket is refreshed once per day around 06:00 UTC by an incremental export, typically finishing within an hour. Any daily partition with rows that were inserted or updated upstream since the previous run is rewritten in full, so historical files can change on any given day when corrections or late-arriving data flow through. To avoid pulling files mid-rewrite, schedule large aws s3 sync jobs outside the 06:00–07:00 UTC window.

Other Options

  • Python with s3fs - Use S3FileSystem with assume_role_arn and assume_role_kwargs to download files.

  • Python with boto3 - Use RefreshableCredentials via STS AssumeRole to list and download objects with concurrent transfers.

Last updated

Was this helpful?