"Torture the data, and it will confess to anything." - Ronald Coase
One file format change, two tools, and a benchmark that will change how you work locally. Here's what Parquet and DuckDB do and why they matter for anyone doing serious work on large energy datasets.
The Problem with CSV at Scale
For my energy analytics work, CSVs are the default for a reason - they're human readable, easy to generate, and just about every tool can read them. The problem arrives when we go to scale and need to do so locally without the cloud infrastructure.
Parquet: 34.9× smaller, faster to query
Parquet is a columnar file format. A CSV stores data row by row - every column for record 1, then every column for record 2, across the full width of the table. When your query touches three columns out of 193, the entire file gets read regardless.
Parquet stores data column by column. Output Schedule together. Resource
Type together. Telemetered Net Output together. A query that needs
three columns reads those three and skips the rest entirely.
Here's what that looks like on a real file - a single day's SCED Gen Resource disclosure from ERCOT. 319,000 rows. 193 columns. One day.
| Format | Size | Queryable Without Full Load? |
|---|---|---|
| CSV | 262.6 MB | No |
| CSV.GZ | ~18 MB | No - decompress first |
| Parquet | 7.5 MB | Yes |
34.9× smaller than raw CSV - and queryable as well! CSV.GZ gets you compression but pandas still decompresses it entirely into memory before doing anything with it. You traded storage for nothing analytically. Parquet gives you both simultaneously.
Converting to Parquet takes only two lines:
import pandas as pd
df = pd.read_csv("sced_genresource.csv")
df.to_parquet("sced_genresource.parquet", index=False, engine="pyarrow")
You only need to do this step once. The Parquet becomes your working dataset, and the original CSV becomes the archive.
DuckDB: SQL directly on your files
DuckDB is an in-process analytical database - runs inside your Python environment, queries Parquet files directly from disk, and needs nothing else to get started.
pip install duckdb pyarrow
import duckdb
result = duckdb.query("""
SELECT
"Resource Type",
ROUND(AVG("Output Schedule"), 2) AS avg_dispatched_mw,
ROUND(AVG("HSL"), 2) AS avg_hsl_mw,
COUNT(*) AS intervals
FROM 'sced_genresource.parquet'
GROUP BY "Resource Type"
ORDER BY avg_dispatched_mw DESC
""").df()
Notice there is no read_parquet() call. DuckDB directly reaches into the file,
reads the columns it needs, executes the logic, and returns a result. The file itself is never
loaded into memory.
The Benchmark
Real energy market data from ERCOT - SCED Gen Resource file from a day in 2025. The ERCOT API provides this file in CSV format. 262.6 MB CSV, 319,000 rows, 193 columns. One day.
import time, pandas as pd, duckdb
# CSV - full file load
start = time.time()
df_csv = pd.read_csv("sced_genresource.csv")
result = df_csv.groupby("Resource Type")["Output Schedule"].mean()
csv_time = time.time() - start
print(f"CSV (pandas): {csv_time:.2f}s")
# CSV.GZ - decompress then load
start = time.time()
df_gz = pd.read_csv("sced_genresource.csv.gz", compression="gzip")
result = df_gz.groupby("Resource Type")["Output Schedule"].mean()
gz_time = time.time() - start
print(f"CSV.GZ (pandas): {gz_time:.2f}s")
# Parquet - DuckDB, no load step
start = time.time()
result = duckdb.query("""
SELECT "Resource Type",
ROUND(AVG("Output Schedule"), 2) AS avg_output_mw
FROM 'sced_genresource.parquet'
GROUP BY "Resource Type"
ORDER BY avg_output_mw DESC
""").df()
parquet_time = time.time() - start
print(f"Parquet (DuckDB): {parquet_time:.2f}s")
CSV (pandas): 3.76s
CSV.GZ (pandas): 3.81s
Parquet (DuckDB): 0.82s
DuckDB vs CSV: 4.6x faster
DuckDB vs CSV.GZ: 4.7x faster
CSV.GZ is nearly identical to raw CSV. Decompression plus full memory load still happen before a single aggregation runs. You paid the storage cost and got nothing back at query time.
This is one day. Scale to a month of SCED data and the gap becomes the difference between an overnight process and a query that returns results within seconds.
Possibilities with Parquet and DuckDB
The benchmarking above is just a scratch on the surface of what you can do with Parquet and DuckDB. What matters more is the class of questions that opens up for you as an energy analyst when you stop fighting the tools and start using them.
Take a simple one: what did the ERCOT grid actually look like, interval by interval, on a given day - broken out by fuel type across every 5-minute SCED dispatch?
That question requires joining two separate SCED disclosure files - gen resources and energy storage - aggregating telemetered output by fuel category across 290 intervals, and plotting the result. In the old workflow that's a multi-step pandas process involving two full CSV loads, a merge, and enough memory to hold both in RAM simultaneously. With DuckDB querying Parquet it's a single CTE that aggregates each file first and joins the 290-row summaries - neither file loaded into memory.
I ran both approaches 10 times to get an honest average:
| Approach | Average Time | Std Dev | Speedup | Notes |
|---|---|---|---|---|
| Pandas (CSV) | 4.48s | 0.13s | 1× | Full file load for every run |
| DuckDB (Parquet) | 0.073s | 0.098s | 61.6× | Loads only required columns |
The DuckDB variance is worth noting - run 5 hit 0.335s against a consistent ~0.043s for the other nine, likely an OS scheduling event.
Here's what that query returns when you plot it - December 10, 2025, every 5-minute SCED interval across the full day:
This was a normal day in Texas, where wind dominated overnight at 26,000 MW, solar ramped hard early morning typical of a winter morning. Nuclear didn't move - 5,100 MW all day. All normal, easily queried and viewed.
There are a ton of small queries that can be run locally using this method (for example: bid curve analysis, ancillary awards, etc.) without needing cloud infrastructure. Of course, a lot of data analytics and the tools you use comes down to personal preference - this is a great way to get started with big data analytics.
When to use what
Pandas - small datasets, quick transformations, data that fits comfortably in memory.
DuckDB - multi-year price history, ERCOT disclosure datasets, any workflow where you are currently waiting on a CSV load before asking an interesting question. For energy analysts, that middle ground covers most of the real work.
A relational database - production systems, concurrent users, transactional writes at scale.
The signal to switch: if you have ever found yourself waiting on pandas, you are past the threshold where this pays for itself.