QuantFinance-Databases
A reproducible data warehouse for quantitative research — fundamentals, prices, and macro indicators in one Dockerized MySQL store, populated by idempotent ETL jobs.
Quant research starts with clean, joinable data — and most public sources don't give you that
Equity fundamentals live in SEC EDGAR. Daily prices live with one vendor. Macro indicators live with another. Each source ships its own ticker schemes, currencies, frequencies, and revision histories. Before any research can happen, all of it has to be normalized and made joinable on a primary key you can trust.
This project is the layer that lives between raw vendor APIs and downstream research: a single store I can query for any (ticker, date, field) tuple without re-thinking ingestion every time.
One MySQL store, many idempotent ETL jobs, all containerized
The system is intentionally boring: a relational core, pure-function loaders, and Docker for reproducibility.
┌──────────────────────────────────────────────────────────────┐
│ Sources │
│ FMP · SEC EDGAR · Polygon · yfinance · FRED │
└────────────┬──────────────┬───────────────┬──────────────────┘
│ │ │
┌─────▼──────┐ ┌─────▼──────┐ ┌──────▼────────┐
│ extractors │ │ extractors │ │ extractors │ (Python, requests / SDKs)
└─────┬──────┘ └─────┬──────┘ └──────┬────────┘
│ │ │
└──────┬───────┴───────────────┘
│
┌─────▼──────┐
│ transforms │ pandas — typing, dedup,
│ │ currency & period normalization
└─────┬──────┘
│
┌─────▼──────┐
│ loader │ SQLAlchemy upsert
└─────┬──────┘
│
┌───────────▼────────────┐
│ MySQL (Docker) │
│ ─ securities │ primary key registry
│ ─ prices_daily │ OHLCV by (symbol, date)
│ ─ fundamentals │ period-end financials
│ ─ macro_series │ FRED time series
│ ─ corporate_actions │ splits / dividends
└────────────────────────┘
Loaders are idempotent — re-running a job for the same window updates existing rows instead of duplicating them. That makes recovery from a bad day of vendor data a one-command operation.
Each source has a single, well-defined responsibility
Picking one source per concern keeps semantics clean — no ambiguity about which adjusted-close I'm reading or which FY end is canonical.
| Source | Owns | Why |
|---|---|---|
| FMP | Fundamentals, ratios, earnings | Wide coverage of US equities with consistent field naming. |
| SEC EDGAR | Raw filings, period ends | Ground truth for fiscal calendar and as-reported numbers. |
| Polygon | Intraday & corporate actions | High-resolution prints and reliable split/dividend feed. |
| yfinance | Backfill EOD prices | Free, well-known, useful for sanity-checking Polygon EOD bars. |
| FRED | Macro time series | Authoritative, versioned, and updated on a predictable schedule. |
Trade-offs I'd defend in a review
MySQL over a time-series DB
Why Most of the workload is point-in-time joins between fundamentals and prices, not high-cardinality tick storage. Relational beats columnar here. Cost Worse for sub-second bar storage — but that's not what this layer is for.
Docker Compose for the whole stack
Why Anyone can clone the repo and run docker compose up to reproduce my schema exactly. No "works on my machine." Cost Slight overhead for new dev setup; worth it for reproducibility.
Idempotent upserts, not append-only
Why Vendor data gets revised. Append-only means I'd be carrying around stale prints. Upsert on (symbol, date) keeps the latest authoritative row. Cost No write-time audit trail by default — added a revisions log to recover that.
Centralized symbol registry
Why Every other table foreign-keys into securities. One place to handle CUSIP/CIK/ticker changes; downstream joins stay clean across corporate actions.
Secrets via .env + python-dotenv
Why No credentials in source. TODO: confirm .env.example is committed and config/config.py reads from os.environ only
What gets verified before a load is trusted
- Schema contracts — every loader asserts dtype, non-null, and primary-key shape on its DataFrame before writing.
- Cross-source reconciliation — Polygon vs. yfinance EOD prices on a sampled tickers/dates set; flagged if any close diverges by >0.5%.
- Point-in-time joins — a unit test loads a known earnings event and asserts the as-of join against prices_daily returns the expected fundamentals snapshot for the previous filing window (no look-ahead).
- Idempotency — running the same backfill twice produces identical row counts and identical row hashes.
- TODO: add coverage % once pytest-cov is wired in CI
What this unlocks downstream
The point of this project isn't the database itself — it's the research velocity that comes from never having to wrangle vendors again.
Once the warehouse exists, a research notebook becomes a SQL query plus a model — not three days of ingestion plumbing.
Schema, dashboards, and ETL output
TODO: drop PNG/JPG into ../assets/img/quant-databases/ and replace the four placeholders below.
(placeholder)
(placeholder)
(placeholder)
(placeholder)
What this is — and what it isn't
- US equities only. No FX, futures, or international listings yet. Adding them means a serializer per asset class, not a schema rewrite.
- EOD-first. Intraday is stored but not the primary use case; sub-minute bars would need a different storage layout.
- Single node. Fine for a research workstation; not designed for multi-tenant production use.
- Vendor coupling. Schema field names track FMP's naming where convenient — a vendor swap means rewriting the relevant loader, not the warehouse.
- No streaming. Loads run on a cron, not on push. Live trading would require a separate path.