Case Study · Data Infrastructure

QuantFinance-Databases

A reproducible data warehouse for quantitative research — fundamentals, prices, and macro indicators in one Dockerized MySQL store, populated by idempotent ETL jobs.

Status · Active Year · 2025–present Stack · Python · MySQL · Docker · pandas · SQLAlchemy Sources · FMP · SEC EDGAR · Polygon · yfinance · FRED

View on GitHub All projects

01 Problem

Quant research starts with clean, joinable data — and most public sources don't give you that

Equity fundamentals live in SEC EDGAR. Daily prices live with one vendor. Macro indicators live with another. Each source ships its own ticker schemes, currencies, frequencies, and revision histories. Before any research can happen, all of it has to be normalized and made joinable on a primary key you can trust.

This project is the layer that lives between raw vendor APIs and downstream research: a single store I can query for any (ticker, date, field) tuple without re-thinking ingestion every time.

02 Architecture

One MySQL store, many idempotent ETL jobs, all containerized

The system is intentionally boring: a relational core, pure-function loaders, and Docker for reproducibility.

┌──────────────────────────────────────────────────────────────┐
│  Sources                                                     │
│  FMP · SEC EDGAR · Polygon · yfinance · FRED                 │
└────────────┬──────────────┬───────────────┬──────────────────┘
             │              │               │
       ┌─────▼──────┐ ┌─────▼──────┐ ┌──────▼────────┐
       │ extractors │ │ extractors │ │  extractors   │  (Python, requests / SDKs)
       └─────┬──────┘ └─────┬──────┘ └──────┬────────┘
             │              │               │
             └──────┬───────┴───────────────┘
                    │
              ┌─────▼──────┐
              │ transforms │  pandas — typing, dedup,
              │            │  currency & period normalization
              └─────┬──────┘
                    │
              ┌─────▼──────┐
              │   loader   │  SQLAlchemy upsert
              └─────┬──────┘
                    │
        ┌───────────▼────────────┐
        │  MySQL  (Docker)       │
        │   ─ securities         │   primary key registry
        │   ─ prices_daily       │   OHLCV by (symbol, date)
        │   ─ fundamentals       │   period-end financials
        │   ─ macro_series       │   FRED time series
        │   ─ corporate_actions  │   splits / dividends
        └────────────────────────┘

Loaders are idempotent — re-running a job for the same window updates existing rows instead of duplicating them. That makes recovery from a bad day of vendor data a one-command operation.

03 Data sources

Each source has a single, well-defined responsibility

Picking one source per concern keeps semantics clean — no ambiguity about which adjusted-close I'm reading or which FY end is canonical.

Source	Owns	Why
FMP	Fundamentals, ratios, earnings	Wide coverage of US equities with consistent field naming.
SEC EDGAR	Raw filings, period ends	Ground truth for fiscal calendar and as-reported numbers.
Polygon	Intraday & corporate actions	High-resolution prints and reliable split/dividend feed.
yfinance	Backfill EOD prices	Free, well-known, useful for sanity-checking Polygon EOD bars.
FRED	Macro time series	Authoritative, versioned, and updated on a predictable schedule.

04 Design choices

Trade-offs I'd defend in a review

MySQL over a time-series DB

Why Most of the workload is point-in-time joins between fundamentals and prices, not high-cardinality tick storage. Relational beats columnar here. Cost Worse for sub-second bar storage — but that's not what this layer is for.

Docker Compose for the whole stack

Why Anyone can clone the repo and run docker compose up to reproduce my schema exactly. No "works on my machine." Cost Slight overhead for new dev setup; worth it for reproducibility.

Idempotent upserts, not append-only

Why Vendor data gets revised. Append-only means I'd be carrying around stale prints. Upsert on (symbol, date) keeps the latest authoritative row. Cost No write-time audit trail by default — added a revisions log to recover that.

Centralized symbol registry

Why Every other table foreign-keys into securities. One place to handle CUSIP/CIK/ticker changes; downstream joins stay clean across corporate actions.

Secrets via `.env` + `python-dotenv`

Why No credentials in source. TODO: confirm .env.example is committed and config/config.py reads from os.environ only

05 Tests

What gets verified before a load is trusted

Schema contracts — every loader asserts dtype, non-null, and primary-key shape on its DataFrame before writing.
Cross-source reconciliation — Polygon vs. yfinance EOD prices on a sampled tickers/dates set; flagged if any close diverges by >0.5%.
Point-in-time joins — a unit test loads a known earnings event and asserts the as-of join against prices_daily returns the expected fundamentals snapshot for the previous filing window (no look-ahead).
Idempotency — running the same backfill twice produces identical row counts and identical row hashes.
TODO: add coverage % once pytest-cov is wired in CI

06 Results

What this unlocks downstream

The point of this project isn't the database itself — it's the research velocity that comes from never having to wrangle vendors again.

Data sources unified

TODO

Tickers under coverage

TODO

Years of history

TODO

Rows in prices_daily

TODO

Avg backfill runtime

Command to bootstrap

Once the warehouse exists, a research notebook becomes a SQL query plus a model — not three days of ingestion plumbing.

07 Screenshots

Schema, dashboards, and ETL output

TODO: drop PNG/JPG into ../assets/img/quant-databases/ and replace the four placeholders below.

DataGrip ERD
(placeholder)

ETL CLI run output
(placeholder)

Sample notebook query
(placeholder)

docker compose ps
(placeholder)

08 Limitations

What this is — and what it isn't

US equities only. No FX, futures, or international listings yet. Adding them means a serializer per asset class, not a schema rewrite.
EOD-first. Intraday is stored but not the primary use case; sub-minute bars would need a different storage layout.
Single node. Fine for a research workstation; not designed for multi-tenant production use.
Vendor coupling. Schema field names track FMP's naming where convenient — a vendor swap means rewriting the relevant loader, not the warehouse.
No streaming. Loads run on a cron, not on push. Live trading would require a separate path.

Quant research starts with clean, joinable data — and most public sources don't give you that

One MySQL store, many idempotent ETL jobs, all containerized

Each source has a single, well-defined responsibility

Trade-offs I'd defend in a review

MySQL over a time-series DB

Docker Compose for the whole stack

Idempotent upserts, not append-only

Centralized symbol registry

Secrets via .env + python-dotenv

What gets verified before a load is trusted

What this unlocks downstream

Schema, dashboards, and ETL output

What this is — and what it isn't

Secrets via `.env` + `python-dotenv`