# Best Practices

## Query Optimization

{% stepper %}
{% step %}

### Always Use Time Filters

Unbounded queries can be slow and expensive. Always specify `start_time` and `end_time` when querying time-series data.

{% tabs %}
{% tab title="Bad" %}

```shell
# Queries entire dataset - slow and expensive
curl "https://api.gridstatus.io/v1/datasets/ercot_lmp_by_settlement_point/query?api_key=YOUR_API_KEY"
```

{% endtab %}

{% tab title="Good" %}

```shell
# Bounded time range with location filter - fast and efficient
curl "https://api.gridstatus.io/v1/datasets/ercot_lmp_by_settlement_point/query?\
start_time=2026-01-01&\
end_time=2026-01-02&\
filter_column=location&\
filter_value=HB_HOUSTON&\
api_key=YOUR_API_KEY"
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Filter by Subseries

Most LMP and load datasets contain data for many locations. Filter to the specific locations you need.

Why this matters: LMP datasets like `pjm_lmp_real_time_5_min` contain 10,000+ pricing nodes. Querying all nodes for a single day can return millions of rows. If you only need hub prices for trading decisions, filter to those specific hubs.

{% tabs %}
{% tab title="Bad" %}

```shell
# Returns data for all 10,000+ nodes
curl "https://api.gridstatus.io/v1/datasets/pjm_lmp_real_time_5_min/query?api_key=YOUR_API_KEY"
```

{% endtab %}

{% tab title="Good" %}

```shell
# Returns data for one specific hub
curl "https://api.gridstatus.io/v1/datasets/pjm_lmp_real_time_5_min/query?\
start_time=2026-01-01&\
end_time=2026-01-02&\
filter_column=location&\
filter_value=WESTERN%20HUB&\
api_key=YOUR_API_KEY"
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Select Only Needed Columns

Reduce response size by requesting only the columns you need.

{% tabs %}
{% tab title="Shell" %}

```shell
# Only get timestamp, location, and LMP
curl "https://api.gridstatus.io/v1/datasets/caiso_lmp_real_time_5_min/query?\
start_time=2026-01-01&\
end_time=2026-01-02&\
columns=interval_start_utc,location,lmp&\
filter_column=location&\
filter_value=TH_SP15_GEN-APND&\
api_key=YOUR_API_KEY"
```

{% endtab %}

{% tab title="Python" %}

```python
# Only get timestamp, location, and LMP - skip energy, congestion, loss components
df = client.get_dataset(
    "caiso_lmp_real_time_5_min",
    start="2026-01-01",
    end="2026-01-02",
    columns=["interval_start_utc", "location", "lmp"],
    filter_column="location",
    filter_value="TH_SP15_GEN-APND"
)

# Response is ~60% smaller than without column selection
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Use Appropriate Limits

Set explicit limits to control data volume and costs.

{% tabs %}
{% tab title="Python" %}

```python
# Always set a limit during development
data = client.get_dataset(
    "ercot_fuel_mix",
    start="2026-01-01",
    end="2026-01-02",
    limit=1000  # Prevent accidental large queries
)
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Resample Locally

If a query involving resampling is timing out on the API, download the raw data and perform resampling locally.
{% endstep %}
{% endstepper %}

## Pagination Strategies

### Use Cursor Pagination for Large Datasets

Cursor-based pagination is more efficient than offset-based for large result sets. The client handles this automatically. For more see [Pagination documentation.](https://docs.gridstatus.io/developers/concepts/pagination)

{% tabs %}
{% tab title="Python" %}

```python
# The client handles pagination automatically - no manual cursor management needed
df = client.get_dataset(
    "ercot_fuel_mix",
    start="2026-01-01",
    end="2026-01-02"
)

# All pages are fetched and combined into a single DataFrame
print(f"Retrieved {len(df)} total rows")
```

{% endtab %}
{% endtabs %}

### Batch Large Date Ranges

For queries spanning months or years, batch into smaller chunks.

{% tabs %}
{% tab title="Python" %}

```python
import pandas as pd
from datetime import datetime, timedelta

def fetch_in_batches(
    client,
    dataset: str,
    start: datetime,
    end: datetime,
    batch_days: int = 7,
    **kwargs
) -> pd.DataFrame:
    """Fetch data in batches to avoid timeouts."""
    all_data = []
    current = start

    while current < end:
        batch_end = min(current + timedelta(days=batch_days), end)

        print(f"Fetching {current.date()} to {batch_end.date()}...")

        data = client.get_dataset(
            dataset,
            start=current.isoformat(),
            end=batch_end.isoformat(),
            **kwargs
        )

        if len(data) > 0:
            all_data.append(data)

        current = batch_end

    return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()

# Usage
df = fetch_in_batches(
    client,
    "ercot_fuel_mix",
    datetime(2026, 1, 1),
    datetime(2026, 2, 1),
    batch_days=7
)
```

{% endtab %}
{% endtabs %}

## Caching Strategies

Caching is essential for energy market applications where you repeatedly query the same historical data for backtesting, model training, or report generation.

### Cache Static Data

Dataset metadata and column values change infrequently. Cache them to reduce API calls.

Use case: When building a location selector dropdown, cache the list of available pricing nodes rather than fetching on every page load.

{% tabs %}
{% tab title="Python" %}

```python
import json
from pathlib import Path
from datetime import datetime, timedelta, timezone

class MetadataCache:
    def __init__(self, cache_dir: str = ".gs_cache"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.ttl = timedelta(hours=24)

    def _cache_path(self, key: str) -> Path:
        return self.cache_dir / f"{key}.json"

    def get(self, key: str):
        path = self._cache_path(key)
        if not path.exists():
            return None

        with open(path) as f:
            cached = json.load(f)

        cached_at = datetime.fromisoformat(cached['cached_at'])
        if datetime.now(tz=timezone.utc) - cached_at > self.ttl:
            return None

        return cached['data']

    def set(self, key: str, data):
        path = self._cache_path(key)
        with open(path, 'w') as f:
            json.dump({
                'cached_at': datetime.now(tz=timezone.utc).isoformat(),
                'data': data
            }, f)

# Usage
cache = MetadataCache()

def get_dataset_metadata_cached(dataset_id: str):
    cached = cache.get(f"metadata_{dataset_id}")
    if cached:
        return cached

    metadata = client.get(
            f"{client.host}/datasets/{dataset_id}",
            return_raw_response_json=True
        )
    cache.set(f"metadata_{dataset_id}", metadata)
    return metadata
```

{% endtab %}
{% endtabs %}

### Cache Frequently Accessed Data

For data that's queried repeatedly, implement a local cache.

Use case: When backtesting a trading strategy across multiple parameter combinations, cache the underlying price data locally so you're not re-fetching the same historical data for each backtest run.

{% tabs %}
{% tab title="Python" %}

```python
import pandas as pd
from pathlib import Path

def get_or_fetch(
    client,
    dataset: str,
    start: str,
    end: str,
    cache_dir: str = ".data_cache",
    **kwargs
) -> pd.DataFrame:
    """Fetch data, caching locally to avoid repeated API calls."""
    cache_path = Path(cache_dir)
    cache_path.mkdir(exist_ok=True)

    # Create cache key from parameters
    cache_key = f"{dataset}_{start}_{end}_{hash(frozenset(kwargs.items()))}"
    cache_file = cache_path / f"{cache_key}.csv"

    if cache_file.exists():
        print(f"Loading from cache: {cache_file}")
        return pd.read_csv(cache_file)

    print(f"Fetching from API...")
    data = client.get_dataset(dataset, start=start, end=end, **kwargs)

    if len(data) > 0:
        data.to_csv(cache_file, index=False)

    return data

# Usage - first call fetches from API
df = get_or_fetch(client, "ercot_fuel_mix", "2026-01-01", "2026-01-02")

# Second call loads from cache
df = get_or_fetch(client, "ercot_fuel_mix", "2026-01-01", "2026-01-02")
```

{% endtab %}
{% endtabs %}

## Error Handling

### Implement Retry Logic

Handle transient errors gracefully with exponential backoff.

{% tabs %}
{% tab title="Python" %}

```python
# The client handles retries automatically with exponential backoff

# Configure retry behavior when creating the client:
retry_client = gs.GridStatusClient(
    api_key="YOUR_API_KEY",
    max_retries=5,      # Retry up to 5 times
    base_delay=2.0,     # Start with 2 second delay
    exponential_base=2  # Double delay each retry
)

# All queries will automatically retry on transient errors (429, 500, 502, 503, 504)
df = retry_client.get_dataset("ercot_fuel_mix", start="2026-01-01", end="2026-01-02")
```

{% endtab %}
{% endtabs %}

### Monitor API Usage

Check usage before large operations to avoid hitting limits.

{% tabs %}
{% tab title="Python" %}

```python
def safe_query(client, dataset: str, expected_rows: int, **kwargs):
    """Check quota before querying."""
    usage = client.get_api_usage()

    rows_used = usage['current_period_usage']['total_api_rows_returned']
    rows_limit = usage['limits']['api_rows_returned_limit']
    rows_remaining = rows_limit - rows_used

    if rows_remaining < expected_rows:
        raise Exception(
            f"Insufficient quota: need ~{expected_rows:,}, "
            f"have {rows_remaining:,}"
        )

    return client.get_dataset(dataset, **kwargs)

# Usage
try:
    # Estimate: 288 5-min intervals/day * 1 location = ~288 rows
    df = safe_query(
        client,
        "ercot_fuel_mix",
        expected_rows=500,
        start="2026-01-01",
        end="2026-01-02"
    )
except Exception as e:
    print(f"Cannot proceed: {e}")
```

{% endtab %}
{% endtabs %}

## Data Quality

Data quality checks are critical for energy trading and operational applications where decisions are time-sensitive.

### Verify Data Freshness

Check that data is current before using it in production.

Use case: Before executing trades based on current prices, verify the data is within an acceptable age threshold. Stale data could lead to trading on outdated price signals.

{% tabs %}
{% tab title="Python" %}

```python
from datetime import datetime, timezone, timedelta

def check_data_freshness(dataset_id: str, max_age_minutes: int = 30):
    """Check if dataset data is fresh enough."""
    metadata = client.get(
            f"{client.host}/datasets/{dataset_id}",
            return_raw_response_json=True
         )

    latest = datetime.fromisoformat(
        metadata['latest_available_time_utc'].replace('Z', '+00:00')
    )
    now = datetime.now(timezone.utc)
    age = now - latest

    if age > timedelta(minutes=max_age_minutes):
        print(f"Warning: Data is {age.total_seconds()/60:.0f} minutes old")
        return False

    return True

# Usage
if check_data_freshness("ercot_fuel_mix"):
    print("Data is fresh - proceeding with analysis")
else:
    print("Data may be stale - check for issues")
```

{% endtab %}
{% endtabs %}

### Validate Query Results

Verify that returned data meets expectations.

{% tabs %}
{% tab title="Python" %}

```python
import pandas as pd

def validate_results(df: pd.DataFrame, expected_cols: list, time_col: str = None):
    """Basic validation of query results."""
    errors = []

    # Check for expected columns
    missing_cols = set(expected_cols) - set(df.columns)
    if missing_cols:
        errors.append(f"Missing columns: {missing_cols}")

    # Check for empty result
    if len(df) == 0:
        errors.append("No data returned")

    # Check for time continuity (if time column specified)
    if time_col and time_col in df.columns:
        df[time_col] = pd.to_datetime(df[time_col])
        gaps = df[time_col].diff().dropna()

        # Check for unexpected gaps (>2x median interval)
        median_gap = gaps.median()
        large_gaps = gaps[gaps > median_gap * 2]

        if len(large_gaps) > 0:
            errors.append(f"Found {len(large_gaps)} time gaps")

    if errors:
        print("Validation errors:", errors)
        return False

    return True

# Usage
df = client.get_dataset("ercot_fuel_mix", start="2026-01-01", end="2026-01-02")

if validate_results(df, ['interval_start_utc', 'solar', 'wind'], 'interval_start_utc'):
    print("Data validated successfully")
```

{% endtab %}
{% endtabs %}

## Performance Summary

| Practice             | Impact | Implementation                                                |
| -------------------- | ------ | ------------------------------------------------------------- |
| Use time filters     | High   | Always set `start_time` and `end_time`                        |
| Filter by location   | High   | Use `filter_column`/`filter_value`                            |
| Resampling           | High   | Remove resampling. Fetch raw data first and resample locally. |
| Batch large requests | High   | Split into smaller chunks                                     |
| Retry on errors      | High   | Implement exponential backoff (included in the client)        |
| Select columns       | Medium | Use `columns` parameter                                       |
| Cursor pagination    | Medium | Use `cursor` instead of `page`                                |
| Cache data           | Medium | Store frequently accessed data locally                        |
| Monitor usage        | Medium | Check quota before large queries                              |

## Related Documentation

* [Advanced Query Features](file:///docs/api/advanced-query-features) - Filtering, resampling, timezone
* [Error Handling](file:///docs/api/error-handling) - Handle errors gracefully
* [Utility Endpoints](file:///docs/api/utility-endpoints) - API usage, metadata, column values


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.gridstatus.io/developers/concepts/best-practices.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
