jason blahovec
open source · v0.4.0

statcast-bigquery

statcast-bigquery ingests MLB Statcast pitch-by-pitch data into BigQuery idempotently, and ships the documentation a SQL or LLM agent needs to query it. Its headline feature is a round-trip integrity proof: it reconstructs each team's season win-loss-run-diff from the raw pitch rows and reconciles that against MLB's official standings — so you can prove no games are missing.

PythonBigQuerypybaseballCLIrepo ↗PyPI ↗
architecture
statcast_pitchesone row per pitch — the primary fact table
_statcast_ingest_runsappend-only log of every sync chunk (status, rows, window) powering --resume
umpire crews · schedule · team-season standingsadded across v0.2–v0.4 for verification + context
pybaseball pull → normalize → per-chunk DELETE-then-INSERT into BigQuery → verify reconstructed standings against MLB statsapi (±1 game / ±5 runs).
install & usage
pip install statcast-bigquery

gcloud auth application-default login
statcast-bigquery sync \
    --start 2024-04-01 --end 2024-10-31 \
    --table myproject.mydataset.statcast_pitches

# resumable multi-season backfill
statcast-bigquery sync --start 2015-04-01 --end 2026-05-11 \
    --chunk-by year --resume \
    --table myproject.mydataset.statcast_pitches
design decisions
Round-trip validation against Baseball Savant
The verifier rebuilds season W-L-run-diff purely from ingested pitch rows and reconciles against MLB statsapi standings to within ±1 game / ±5 runs — an end-to-end 'no games missing' proof rather than a row-count check.
Resumable, idempotent chunked sync
Backfills run in year/month chunks tracked in a `_statcast_ingest_runs` log; `--resume` skips chunks already recorded as success. Each chunk is DELETE-then-INSERT, so re-running the same window is safe.
ColumnSpec → multi-format docs
A single ColumnSpec source of truth renders to BigQuery schema, an LLM doc bundle, a markdown dictionary, and can `--apply` directly into a shared `data_dictionary` table (atomically per dataset/table).
interview talking points