jason blahovec
open source · v0.1.0

yfinance-bigquery

yfinance-bigquery ingests Yahoo Finance OHLCV bars into BigQuery across five intervals, manages the S&P 500 symbol universe, and verifies its own data entirely inside BigQuery — no external source required to confirm consistency.

PythonBigQueryyfinanceCLIrepo ↗PyPI ↗
architecture
5 per-interval OHLCV tablesone table each for 1d / 60m / 15m / 5m / 1m bars, with different retention windows
dim_symbolsS&P 500 universe with membership tracking (date_removed for constituents that leave)
ingest-runs logper-chunk run log enabling resumable syncs
resolve active universe → pull yfinance bars per interval → idempotent load into the interval's table → SQL-only verify (OHLC monotonicity, non-negative volume, no future bars, trading-day alignment, no duplicates).
install & usage
pip install yfinance-bigquery

gcloud auth application-default login
# 1. seed the S&P 500 universe from Wikipedia
yfinance-bigquery universe init \
    --dim-symbols myproject.mydataset.dim_symbols --create-if-missing
# 2. sync daily bars for every active ticker
yfinance-bigquery sync --interval 1d \
    --dataset myproject.mydataset.yfinance_v2_analytics \
    --dim-symbols myproject.mydataset.dim_symbols
# 3. verify internal consistency
yfinance-bigquery verify --source internal --interval 1d
design decisions
Five intervals, per-interval retention
1d / 60m / 15m / 5m / 1m each land in their own table with retention matched to how far Yahoo serves that granularity — a non-obvious design that keeps intraday tables from unbounded growth.
Universe management from a moving source
`universe init/refresh` tracks S&P 500 constituents scraped from Wikipedia and marks departures with `date_removed`, so historical bars for delisted/removed names stay queryable without polluting the active set.
Verification done entirely inside BigQuery
Five consistency metrics (OHLC monotonicity, non-negative volume, no future-dated bars, trading-day alignment, no duplicate bars) run as SQL against the warehouse — no second data source needed to assert correctness.
interview talking points