← All projects
Case Study · Data Infrastructure

QuantFinance-Databases

A reproducible data warehouse for quantitative research — fundamentals, prices, and macro indicators in one Dockerized MySQL store, populated by idempotent ETL jobs.

Status · Active Year · 2025–present Stack · Python · MySQL · Docker · pandas · SQLAlchemy Sources · FMP · SEC EDGAR · Polygon · yfinance · FRED
01 Problem

Quant research starts with clean, joinable data — and most public sources don't give you that

Equity fundamentals live in SEC EDGAR. Daily prices live with one vendor. Macro indicators live with another. Each source ships its own ticker schemes, currencies, frequencies, and revision histories. Before any research can happen, all of it has to be normalized and made joinable on a primary key you can trust.

This project is the layer that lives between raw vendor APIs and downstream research: a single store I can query for any (ticker, date, field) tuple without re-thinking ingestion every time.

02 Architecture

One MySQL store, many idempotent ETL jobs, all containerized

The system is intentionally boring: a relational core, pure-function loaders, and Docker for reproducibility.

┌──────────────────────────────────────────────────────────────┐
│  Sources                                                     │
│  FMP · SEC EDGAR · Polygon · yfinance · FRED                 │
└────────────┬──────────────┬───────────────┬──────────────────┘
             │              │               │
       ┌─────▼──────┐ ┌─────▼──────┐ ┌──────▼────────┐
       │ extractors │ │ extractors │ │  extractors   │  (Python, requests / SDKs)
       └─────┬──────┘ └─────┬──────┘ └──────┬────────┘
             │              │               │
             └──────┬───────┴───────────────┘
                    │
              ┌─────▼──────┐
              │ transforms │  pandas — typing, dedup,
              │            │  currency & period normalization
              └─────┬──────┘
                    │
              ┌─────▼──────┐
              │   loader   │  SQLAlchemy upsert
              └─────┬──────┘
                    │
        ┌───────────▼────────────┐
        │  MySQL  (Docker)       │
        │   ─ securities         │   primary key registry
        │   ─ prices_daily       │   OHLCV by (symbol, date)
        │   ─ fundamentals       │   period-end financials
        │   ─ macro_series       │   FRED time series
        │   ─ corporate_actions  │   splits / dividends
        └────────────────────────┘

Loaders are idempotent — re-running a job for the same window updates existing rows instead of duplicating them. That makes recovery from a bad day of vendor data a one-command operation.

03 Data sources

Each source has a single, well-defined responsibility

Picking one source per concern keeps semantics clean — no ambiguity about which adjusted-close I'm reading or which FY end is canonical.

SourceOwnsWhy
FMPFundamentals, ratios, earningsWide coverage of US equities with consistent field naming.
SEC EDGARRaw filings, period endsGround truth for fiscal calendar and as-reported numbers.
PolygonIntraday & corporate actionsHigh-resolution prints and reliable split/dividend feed.
yfinanceBackfill EOD pricesFree, well-known, useful for sanity-checking Polygon EOD bars.
FREDMacro time seriesAuthoritative, versioned, and updated on a predictable schedule.
04 Design choices

Trade-offs I'd defend in a review

MySQL over a time-series DB

Why Most of the workload is point-in-time joins between fundamentals and prices, not high-cardinality tick storage. Relational beats columnar here. Cost Worse for sub-second bar storage — but that's not what this layer is for.

Docker Compose for the whole stack

Why Anyone can clone the repo and run docker compose up to reproduce my schema exactly. No "works on my machine." Cost Slight overhead for new dev setup; worth it for reproducibility.

Idempotent upserts, not append-only

Why Vendor data gets revised. Append-only means I'd be carrying around stale prints. Upsert on (symbol, date) keeps the latest authoritative row. Cost No write-time audit trail by default — added a revisions log to recover that.

Centralized symbol registry

Why Every other table foreign-keys into securities. One place to handle CUSIP/CIK/ticker changes; downstream joins stay clean across corporate actions.

Secrets via .env + python-dotenv

Why No credentials in source. TODO: confirm .env.example is committed and config/config.py reads from os.environ only

05 Tests

What gets verified before a load is trusted

  • Schema contracts — every loader asserts dtype, non-null, and primary-key shape on its DataFrame before writing.
  • Cross-source reconciliation — Polygon vs. yfinance EOD prices on a sampled tickers/dates set; flagged if any close diverges by >0.5%.
  • Point-in-time joins — a unit test loads a known earnings event and asserts the as-of join against prices_daily returns the expected fundamentals snapshot for the previous filing window (no look-ahead).
  • Idempotency — running the same backfill twice produces identical row counts and identical row hashes.
  • TODO: add coverage % once pytest-cov is wired in CI
06 Results

What this unlocks downstream

The point of this project isn't the database itself — it's the research velocity that comes from never having to wrangle vendors again.

5
Data sources unified
TODO
Tickers under coverage
TODO
Years of history
TODO
Rows in prices_daily
TODO
Avg backfill runtime
1
Command to bootstrap

Once the warehouse exists, a research notebook becomes a SQL query plus a model — not three days of ingestion plumbing.

07 Screenshots

Schema, dashboards, and ETL output

TODO: drop PNG/JPG into ../assets/img/quant-databases/ and replace the four placeholders below.

DataGrip ERD
(placeholder)
ETL CLI run output
(placeholder)
Sample notebook query
(placeholder)
docker compose ps
(placeholder)
08 Limitations

What this is — and what it isn't

  • US equities only. No FX, futures, or international listings yet. Adding them means a serializer per asset class, not a schema rewrite.
  • EOD-first. Intraday is stored but not the primary use case; sub-minute bars would need a different storage layout.
  • Single node. Fine for a research workstation; not designed for multi-tenant production use.
  • Vendor coupling. Schema field names track FMP's naming where convenient — a vendor swap means rewriting the relevant loader, not the warehouse.
  • No streaming. Loads run on a cron, not on push. Live trading would require a separate path.