Cross-Sectional Signals Engine:
A buy-side research toolkit that combines SEC EDGAR XBRL financial-statement data with daily price history to build, rank, and evaluate cross-sectional equity factor signals. Fundamental signals — earnings yield, asset growth, accruals, quality (ROA), leverage, and sales growth — are extracted from annual 10-K filings via the free EDGAR API and normalised across a user-defined universe. Momentum (12-1 month, Jegadeesh & Titman 1993) and a low-volatility filter (Ang et al. 2006) extend the signal set. A cross-sectional ranking engine assigns decile buckets at each monthly rebalance, and a rolling backtest engine computes the Information Coefficient (IC), IC Information Ratio (ICIR), long-short spread returns, and turnover for each signal over multi-year windows. A composite signal builder z-scores and linearly combines any subset of signals. Output is exposed through a typer CLI and a three-tab Streamlit dashboard. Built on Python 3.11+ using pandas, numpy, scipy, statsmodels, scikit-learn, plotly, streamlit, duckdb, and pydantic v2; packaged with hatchling and tested with pytest against a deterministic 756-day synthetic fixture.
%==========%
I. Streamlit Dashboard (app.py):
The dashboard exposes the full module via a browser UI. The sidebar controls the ticker universe, signal date, forward return horizon, and backtest date range. Three tabs present the output: Universe & Data shows fundamentals coverage per ticker (period count, field completeness, date range) and a normalised price history chart; Factor Explorer renders the signal value table with a diverging colour gradient, the factor exposure heatmap, cross-sectional ranking tables, and composite signal scores; Backtester runs the selected signals against the historical return series and produces a summary table, overlaid cumulative spread chart, per-signal IC bar chart, and IC heatmap.
Sidebar controls:
| Control | Default | Effect |
|---|---|---|
| Universe | 20 large-cap US equities | Comma-separated list of any yfinance-valid tickers |
| Signal date (as-of) | Today | Date at which fundamentals and price signals are evaluated |
| Forward return horizon | 21 days | Horizon for IC computation and quantile return analysis: 21, 63, or 126 days |
| Quantiles | 5 | Number of ranking buckets (3–5) for long-short spread and ranking display |
| Backtest start / end | 2020 / today | Date range for the rolling IC backtest |
| Tab | Content |
|---|---|
| 📊 Universe & Data | Fundamentals coverage table (periods, earliest/latest filing, field completeness) and normalised price history line chart. |
| 🔍 Factor Explorer | Signal values with diverging colour gradient, z-scored factor exposure heatmap, composite signal scores (Quality-Value, Fundamentals, Momentum-Quality). |
| 📈 Backtester | Summary statistics table (IC, ICIR, hit rate, spread, turnover), cumulative spread return overlay, per-signal IC bar chart, IC heatmap, and quantile return bar chart. |
cd assets/projects/signals_engine
python -m venv .venv && .venv\Scripts\Activate.ps1 # Windows
pip install -e ".[dev]"
python scripts/download_data.py # ~5–15 min for 20 tickers via SEC EDGAR + yfinance
streamlit run src/signals_engine/app.py
The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server required). It uses a synthetic 20-stock universe with deterministic seed-42 prices and fundamentals; all signal and backtest logic runs client-side. First load downloads Pyodide and may take 20–40 seconds; subsequent loads are cached.
%==========%
II. Project Layout:
signals-engine/
├── pyproject.toml # Build config, deps, ruff + pytest settings
├── .env.example # AV_API_KEY, DB_PATH, DATA_SOURCE, EDGAR_SLEEP
├── data/ # Populated by scripts/download_data.py
│ ├── signals.duckdb # DuckDB: fundamentals + prices tables
│ ├── prices.csv # Wide (date × ticker) adjusted close CSV
│ └── cik_map.json # ticker → 10-digit EDGAR CIK
├── notebooks/
│ └── factor_research.ipynb # Research memo: data → signals → backtest
├── scripts/
│ └── download_data.py # SEC EDGAR + yfinance → DuckDB
├── src/signals_engine/
│ ├── data/
│ │ ├── schemas.py # Pydantic v2: Universe, FundamentalsRecord, PriceHistory, SignalFrame
│ │ ├── edgar.py # SEC EDGAR XBRL fetchers (no API key required)
│ │ ├── prices.py # yfinance + Alpha Vantage price fetchers
│ │ └── store.py # DuckDB read/write for fundamentals and prices
│ ├── signals/
│ │ ├── fundamentals.py # earnings_yield, asset_growth, accruals, ROA, leverage, sales_growth
│ │ ├── momentum.py # 12-1 month momentum, 1-month reversal
│ │ ├── volatility.py # realized_vol, idiosyncratic_vol, low_vol_filter
│ │ └── composite.py # CompositeSignal: z-score + linear combiner
│ ├── rank/
│ │ ├── crosssection.py # rank_cross_section, assign_quantiles, long_short_portfolio
│ │ └── backtest.py # information_coefficient, run_backtest, BacktestResult
│ ├── report/
│ │ └── plots.py # Plotly: spread returns, IC bar, quantile bars, IC heatmap, exposure map
│ ├── cli.py # Typer CLI: fetch | build | rank | backtest
│ └── app.py # Streamlit: 3 tabs (Universe & Data, Factor Explorer, Backtester)
└── tests/
├── conftest.py # Deterministic 756-day price + 5-year fundamentals fixtures
├── test_signals.py # Fundamental, momentum, and composite signal invariants
├── test_rank.py # Ranking and long-short portfolio mechanics
└── test_backtest.py # IC, backtest engine, rebalance date generation
%==========%
III. Data Sources — SEC EDGAR & Price Ingestion:
Financial statement data comes from the SEC EDGAR XBRL APIs, which serve structured JSON for every 10-K filing without requiring an API key. The companyfacts endpoint returns the full history of every XBRL tag for a given CIK, enabling multi-year annual panels from a single HTTP request. The module parses the us-gaap taxonomy for seven fields — revenues, net income, EPS, total assets, long-term debt, stockholders’ equity, and operating cash flow — filtering to FY (full-year) periods from 10-K filings. Revenue has a fallback chain across tag name variants to accommodate different filer conventions.
Price data is fetched via yfinance (default) or Alpha Vantage (optional). Adjusted closes are stored in DuckDB in both long (date, ticker, close) and wide (date × ticker) formats. Forward returns are computed as \(r_{t,h} = (P_{t+h} - P_t) / P_t\), shifted by \(-h\) to align the return with its signal date. The SEC requests a maximum of 10 requests per second; the module enforces a 0.12-second sleep between calls.
# edgar.py — XBRL extraction
_XBRL_TAGS = {
"revenue": "Revenues",
"net_income": "NetIncomeLoss",
"eps_basic": "EarningsPerShareBasic",
"total_assets": "Assets",
"total_debt": "LongTermDebt",
"stockholders_equity": "StockholdersEquity",
"operating_cash_flow": "NetCashProvidedByUsedInOperatingActivities",
}
def _extract_tag(facts, tag, taxonomy="us-gaap",
form_filter=("10-K",)) -> pd.DataFrame:
"""Filter companyfacts JSON to annual 10-K entries for one XBRL tag."""
entries = facts["facts"][taxonomy][tag]["units"]["USD"]
rows = [
{"period_end": e["end"], "value": float(e["val"])}
for e in entries
if e.get("form") in form_filter and e.get("fp") == "FY"
]
return pd.DataFrame(rows).drop_duplicates("period_end", keep="last")
def build_fundamentals_df(cik, ticker) -> pd.DataFrame:
"""One HTTP call per company → tidy annual fundamentals DataFrame."""
facts = fetch_company_facts(cik)
series = {field: _extract_tag(facts, tag).set_index("period_end")["value"]
for field, tag in _XBRL_TAGS.items()}
combined = pd.DataFrame(series)
combined["ticker"] = ticker.upper()
combined["cik"] = str(cik)
return combined.sort_index()
# prices.py — forward return alignment
def compute_forward_returns(prices: pd.DataFrame, horizon: int = 21) -> pd.DataFrame:
"""h-day forward return aligned to signal date.
fwd_ret[t] = (price[t+h] / price[t]) - 1, shifted so the result sits at t.
"""
return prices.shift(-horizon) / prices - 1.0
%==========%
IV. Data Schemas (schemas.py):
Universe holds a list of tickers and normalises them on construction. FundamentalsRecord represents one annual period for one company, with optional fields for every financial statement line. PriceHistory wraps a dates–×–tickers adjusted-close DataFrame with Pydantic v2 validation (non-empty, no all-NaN columns). SignalFrame wraps a MultiIndex DataFrame of cross-sectional signal values.
# schemas.py
from pydantic import BaseModel, field_validator, model_validator
import pandas as pd
class Universe(BaseModel):
tickers: list[str]
name: str = "custom"
@field_validator("tickers")
@classmethod
def normalise(cls, v: list[str]) -> list[str]:
return [t.strip().upper() for t in v if t.strip()]
class FundamentalsRecord(BaseModel):
ticker: str; cik: str; period_end: date; period: FilingPeriod = "annual"
revenue: float | None = None
net_income: float | None = None
eps_basic: float | None = None
total_assets: float | None = None
total_debt: float | None = None
stockholders_equity: float | None = None
operating_cash_flow: float | None = None
class PriceHistory(BaseModel):
prices: pd.DataFrame # dates × tickers, adjusted close
start: date; end: date
model_config = {"arbitrary_types_allowed": True}
@model_validator(mode="after")
def validate_prices(self) -> "PriceHistory":
if self.prices.empty:
raise ValueError("price history is empty")
if self.prices.isnull().all().any():
bad = self.prices.columns[self.prices.isnull().all()].tolist()
raise ValueError(f"all-NaN columns: {bad}")
return self
def returns(self, fill: bool = True) -> pd.DataFrame:
df = self.prices.ffill() if fill else self.prices.copy()
return df.pct_change().dropna(how="all")
%==========%
V. Fundamental Signals (signals/fundamentals.py):
Six fundamental signals are computed as of an as_of date by looking up each company’s most recent annual filing with period_end ≤ as_of. Signals are signed so that higher values are more desirable (e.g. leverage is negated so low-debt firms rank high). Missing fields produce NaN, which propagates cleanly through the ranking and backtest layers.
| Signal | Formula | Economic rationale |
|---|---|---|
| Earnings Yield | \(EPS / P\) | Value signal: higher yield indicates cheaper valuation relative to earnings (Basu 1977). |
| Asset Growth | \((A_t - A_{t-1}) / A_{t-1}\) | Aggressive balance-sheet expansion predicts lower future returns (Cooper et al. 2008). |
| Accruals | \(-(NI - OCF) / A\) | Low accruals signal high earnings quality: cash-backed income predicts higher future returns (Sloan 1996). |
| Return on Assets | \(NI / A\) | Profitability factor: high-ROA firms earn a persistent premium (Fama & French 2015). |
| Leverage | \(-D / A\) | Negated debt/assets ratio; low-leverage firms have historically outperformed (Penman et al. 2007). |
| Sales Growth | \((R_t - R_{t-1}) / R_{t-1}\) | Sustained revenue growth is a proxy for competitive advantage and future profitability. |
# fundamentals.py
def _latest_as_of(fund: pd.DataFrame, as_of: date) -> pd.DataFrame:
"""Per ticker, keep only the most recent row with period_end ≤ as_of."""
sub = fund[fund["period_end"] <= as_of]
idx = sub.groupby("ticker")["period_end"].idxmax()
return sub.loc[idx].set_index("ticker")
def accruals(fund: pd.DataFrame, as_of: date) -> pd.Series:
"""(Net income − Operating cash flow) / Assets, sign-flipped.
Low accruals → high earnings quality → high signal value.
"""
latest = _latest_as_of(fund, as_of)
ni = latest["net_income"]
ocf = latest["operating_cash_flow"]
ta = latest["total_assets"].replace(0, np.nan)
raw = (ni - ocf) / ta
return (-raw).rename("accruals")
def return_on_assets(fund: pd.DataFrame, as_of: date) -> pd.Series:
latest = _latest_as_of(fund, as_of)
return (latest["net_income"] / latest["total_assets"].replace(0, np.nan)
).rename("return_on_assets")
def asset_growth(fund: pd.DataFrame, as_of: date) -> pd.Series:
sub = fund[fund["period_end"] <= as_of].dropna(subset=["total_assets"]).sort_values("period_end")
last2 = sub.groupby("ticker").tail(2)
result = {}
for ticker, grp in last2.groupby("ticker"):
if len(grp) < 2: continue
a0, a1 = grp["total_assets"].iloc[0], grp["total_assets"].iloc[1]
if a0 > 0: result[ticker] = (a1 - a0) / a0
return pd.Series(result, name="asset_growth")
%==========%
VI. Momentum & Volatility Filters (signals/momentum.py, volatility.py):
The 12-1 month momentum signal measures the cumulative return from \(t-13\) to \(t-2\) months (skipping the most recent month to avoid short-term reversal contamination). The 1-month reversal signal is the negated recent return: recent losers are expected to bounce (Jegadeesh 1990). Both signals operate entirely on the price matrix and require no fundamentals data, making them applicable to any liquid universe with sufficient price history.
The realized volatility filter uses a 63-day rolling window of log returns, annualised by \(\sqrt{252}\). An idiosyncratic volatility filter regresses each stock against a market proxy and computes the annualised residual standard deviation. The low_vol_filter utility NaN-masks any stock above a given volatility percentile, allowing practitioners to screen high-vol names out of any signal before ranking.
# momentum.py
def momentum_12_1(prices: pd.DataFrame, as_of: date,
skip_months: int = 1, formation_months: int = 12) -> pd.Series:
"""12-1 month momentum: cumulative return from t-13m to t-2m."""
end_idx = last_row_on_or_before(prices, as_of)
near_end = max(end_idx - skip_months * 21, 0)
near_start = max(near_end - formation_months * 21, 0)
return (prices.iloc[near_end] / prices.iloc[near_start].replace(0, np.nan) - 1
).rename("momentum_12_1")
# volatility.py
def realized_vol(prices: pd.DataFrame, as_of: date, window: int = 63) -> pd.Series:
"""Trailing annualised realized vol over a [window]-day window."""
end_i = last_row_on_or_before(prices, as_of)
start_i = max(end_i - window, 0)
log_rets = np.log(prices.iloc[start_i:end_i+1] /
prices.iloc[start_i:end_i+1].shift(1)).dropna()
return (log_rets.std() * np.sqrt(252)).rename("realized_vol")
def low_vol_filter(signal: pd.Series, rvol: pd.Series,
vol_percentile_cap: float = 0.75) -> pd.Series:
"""NaN-mask tickers above the vol cap percentile."""
common = signal.index.intersection(rvol.index)
sig = signal[common].copy()
sig[rvol[common] > rvol[common].quantile(vol_percentile_cap)] = np.nan
return sig
%==========%
VII. Composite Signal Construction (signals/composite.py):
Multiple signals are combined into a single composite score in three steps: (1) each signal is independently winsorised at ±3σ to limit the influence of extreme outliers; (2) it is z-scored cross-sectionally to equalise scale across signals with very different natural units; (3) the z-scores are summed with user-supplied weights. Missing values are dropped per-signal before z-scoring, so a NaN in one signal does not cancel out the contribution from another.
Three pre-built composites are provided. Fundamentals Composite equally weights earnings yield, ROA, accruals, and leverage. Quality-Value tilts toward profitability and earnings quality while penalising aggressive asset growers. Momentum-Quality combines 12-1 momentum with ROA and accruals to filter momentum winners with a quality screen.
# composite.py
@dataclass
class CompositeSignal:
name: str
weights: dict[str, float]
winsor_sigma: float = 3.0
def build(self, signals_df: pd.DataFrame) -> pd.Series:
zscores = []
for col, w in self.weights.items():
s = signals_df[col].dropna()
s = s.clip(s.mean() - self.winsor_sigma * s.std(),
s.mean() + self.winsor_sigma * s.std())
zscores.append((s - s.mean()) / s.std() * w)
return pd.concat(zscores, axis=1).sum(axis=1, min_count=1).rename(self.name)
QUALITY_VALUE = CompositeSignal(
name="quality_value",
weights={
"earnings_yield": 1.0,
"return_on_assets": 1.0,
"accruals": 1.0,
"leverage": 0.5,
"asset_growth": -0.5, # penalise aggressive balance-sheet expanders
},
)
%==========%
VIII. Cross-Sectional Ranking (rank/crosssection.py):
At each rebalance date, stocks are ranked by signal value within the universe. rank_cross_section() converts raw signal values to percentile ranks (0 = worst, 1 = best) using pd.Series.rank(pct=True), preserving NaNs. assign_quantiles() bins the percentile ranks into \(n\) equal-sized buckets labelled 1 (bottom) through \(n\) (top) using pd.qcut. The rank_universe() convenience function runs both transformations for every signal column in a DataFrame. long_short_portfolio() computes the top-bucket minus bottom-bucket mean forward return for a quick spread estimate at a single cross-section.
# crosssection.py
def rank_cross_section(signal: pd.Series, ascending: bool = True) -> pd.Series:
"""Percentile rank in [0, 1]; NaN inputs remain NaN."""
return signal.rank(pct=True, na_option="keep", ascending=ascending)
def assign_quantiles(signal: pd.Series, n: int = 10) -> pd.Series:
"""Bin valid observations into quantile buckets 1 (bottom) … n (top)."""
valid = signal.dropna()
bucket = pd.qcut(valid, n, labels=False, duplicates="drop") + 1
return bucket.reindex(signal.index)
def long_short_portfolio(signal, fwd_returns, n_quantiles=5) -> dict[str, float]:
common = signal.dropna().index.intersection(fwd_returns.dropna().index)
q = pd.qcut(signal[common], n_quantiles, labels=False, duplicates="drop") + 1
long_ret = float(fwd_returns[common][q == q.max()].mean())
short_ret = float(fwd_returns[common][q == q.min()].mean())
return {"long_return": long_ret, "short_return": short_ret,
"spread_return": long_ret - short_ret}
%==========%
IX. Backtest Engine (rank/backtest.py):
The backtest engine evaluates whether a signal has historically explained cross-sectional return variation. At each monthly rebalance date it computes:
| Metric | Formula | Interpretation |
|---|---|---|
| IC (Information Coefficient) | \(\rho_S(\text{signal}_t,\, r_{t,h})\) | Spearman rank correlation. Measures the ordinal predictive power of the signal for \(h\)-day forward returns. IC > 0 = signal is directionally correct. |
| ICIR | \(\mu_{IC} / \sigma_{IC}\) | IC Information Ratio. Analogous to Sharpe ratio for IC time series; measures consistency. ICIR > 0.5 is considered a useful signal in practice. |
| Hit Rate | \(\Pr(IC_t > 0)\) | Fraction of rebalance periods with positive IC. >55% suggests reliable directional consistency. |
| Spread Return | \(\bar r_{\text{long}} - \bar r_{\text{short}}\) | Mean return of top quantile minus mean return of bottom quantile at each period. |
| Turnover | \(|\Delta \text{portfolio}| / |\text{portfolio}|\) | Symmetric difference of long portfolios between consecutive periods, normalised by size. High turnover signals are costly to implement. |
# backtest.py
def information_coefficient(signal: pd.Series, fwd_returns: pd.Series) -> float:
"""Spearman rank correlation on the intersection of non-NaN observations."""
common = signal.dropna().index.intersection(fwd_returns.dropna().index)
if len(common) < 5:
return np.nan
corr, _ = spearmanr(signal[common], fwd_returns[common])
return float(corr)
def run_backtest(signals_panel: pd.DataFrame, returns_panel: pd.DataFrame,
signal_col: str, n_quantiles: int = 5) -> BacktestResult:
"""Rolling IC backtest: iterate over rebalance dates, compute IC + spread each period."""
ic_vals, long_rets, short_rets, spread_rets = {}, {}, {}, {}
prev_portfolio: set[str] = set()
turnovers: list[float] = []
for d in sorted(signals_panel.index.intersection(returns_panel.index)):
sig_row = signals_panel.loc[d].dropna()
ret_row = returns_panel.loc[d].dropna()
common = sig_row.index.intersection(ret_row.index)
if len(common) < n_quantiles * 2: continue
ic_vals[d] = information_coefficient(sig_row[common], ret_row[common])
q = pd.qcut(sig_row[common], n_quantiles, labels=False, duplicates="drop") + 1
long_rets[d] = ret_row[common][q == q.max()].mean()
short_rets[d] = ret_row[common][q == q.min()].mean()
spread_rets[d] = long_rets[d] - short_rets[d]
curr_long = set(sig_row[common][q == q.max()].index)
if prev_portfolio:
union = prev_portfolio | curr_long
if union: turnovers.append(len(prev_portfolio ^ curr_long) / len(union))
prev_portfolio = curr_long
ic_s = pd.Series(ic_vals, name="ic")
return BacktestResult(
signal_name=signal_col,
ic_series=ic_s,
spread_returns=pd.Series(spread_rets, name="spread_return"),
long_returns=pd.Series(long_rets),
short_returns=pd.Series(short_rets),
mean_ic=ic_s.mean(),
icir=ic_s.mean() / ic_s.std() if ic_s.std() > 0 else np.nan,
hit_rate=(ic_s > 0).mean(),
cum_spread_return=(1 + pd.Series(spread_rets).fillna(0)).prod() - 1,
turnover_mean=np.mean(turnovers) if turnovers else np.nan,
rebalance_dates=[],
)
%==========%
X. Visualization (report/plots.py):
Five Plotly figures are produced by the report module and embedded in the Streamlit dashboard. Cumulative spread returns: overlaid lines for each backtested signal, making it easy to compare compounded long-short performance. IC bar chart: monthly IC bars coloured green/red with a 6-month rolling mean overlay — the standard factor research diagnostic. Quantile bar chart: mean forward return per signal quantile, confirming monotone relationship between signal and returns. IC heatmap: calendar heatmap of monthly IC coloured by diverging red-yellow-green scale, surfacing seasonality. Factor exposure heatmap: z-scored signal values across stocks (rows) and signals (columns), produced via a RdBu diverging colour scale.
# plots.py — IC bar chart
def plot_ic_bar(result, title=None) -> go.Figure:
ic = result.ic_series.dropna()
rolling = ic.rolling(6).mean()
colors = ["#00CC96" if v >= 0 else "#EF553B" for v in ic.values]
fig = go.Figure()
fig.add_trace(go.Bar(x=ic.index, y=ic.values, name="Monthly IC",
marker_color=colors, opacity=0.7))
fig.add_trace(go.Scatter(x=rolling.index, y=rolling.values,
name="6-month rolling IC",
line=dict(color="#636EFA", width=2)))
fig.add_hline(y=0, line_dash="dot", line_color="grey")
fig.update_layout(title=title or f"IC — {result.signal_name}",
template="plotly_white")
return fig
%==========%
XI. CLI — cli.py:
Four subcommands share a common --tickers / --db interface. All subcommands work fully offline once signals fetch has populated the DuckDB database.
# Install
pip install -e ".[dev]"
# Fetch SEC fundamentals + yfinance prices for 20 large-caps
signals fetch --tickers "AAPL,MSFT,GOOGL,AMZN,META,JPM,BAC,WFC,JNJ,UNH,XOM,CVX,PG,KO,PEP,HD,LOW,CAT,GE,MMM"
# Show all signal values as of a specific date
signals build --as-of 2024-06-30
# Rank universe by earnings yield, show quintiles
signals rank --signal earnings_yield --quantiles 5
# Run rolling IC backtest for ROA over 5 years, 21-day forward returns
signals backtest --signal return_on_assets --start 2019-01-01 --horizon 21
# Launch Streamlit dashboard
streamlit run src/signals_engine/app.py
| Command | Key options | Output |
|---|---|---|
signals fetch | --tickers, --lookback, --cik-map | Populates DuckDB with fundamentals + prices |
signals build | --as-of | Rich table of all signal values for the universe |
signals rank | --signal, --quantiles | Quantile breakdown with tickers and mean signal values |
signals backtest | --signal, --start, --end, --horizon | Mean IC, ICIR, hit rate, cumulative spread, turnover |
%==========%
XII. Test Suite:
All tests are fully offline. The shared fixtures in conftest.py generate a deterministic 756-day price matrix (geometric random walk, seed 42, 10 tickers) and a 5-year annual fundamentals table with synthetic but internally consistent field values. Tests verify mathematical invariants: earnings yield positivity for profitable firms, accruals sign-flip identity, leverage sign-flip identity, asset growth bounds, IC bounds (−1 to +1), IC = 1 for perfectly correlated signal, spread = long − short identity, hit rate in [0, 1], and that a predictive signal produces positive mean IC over 36 periods.
# conftest.py
@pytest.fixture(scope="session")
def prices_df(trade_dates) -> pd.DataFrame:
"""Deterministic 756-day price matrix — geometric random walk, seed 42."""
rng = np.random.default_rng(42)
log_returns = rng.normal(0.0003, 0.015, size=(756, 10))
start_prices = rng.uniform(50, 500, size=10)
prices = start_prices * np.exp(np.cumsum(log_returns, axis=0))
return pd.DataFrame(prices, index=trade_dates, columns=_TICKERS)
@pytest.fixture(scope="session")
def fund_df() -> pd.DataFrame:
"""Synthetic 5-year annual fundamentals — 10 tickers, all fields populated."""
rng = np.random.default_rng(42)
rows = []
for ticker in _TICKERS:
rev_base = rng.uniform(1e9, 5e10)
for year in range(2019, 2024):
rev = rev_base * (1 + rng.uniform(-0.05, 0.20))
ni = rev * rng.uniform(0.05, 0.25)
ta = rev_base * rng.uniform(2, 5)
rows.append({"ticker": ticker, "period_end": date(year, 12, 31),
"revenue": rev, "net_income": ni,
"eps_basic": ni / rng.integers(5e8, 2e9),
"total_assets": ta,
"total_debt": ta * rng.uniform(0.1, 0.5),
"operating_cash_flow": ni * rng.uniform(0.8, 1.4), ...})
return pd.DataFrame(rows)
# test_backtest.py — key invariants
def test_spread_return_identity(self):
sig, ret = self._make_panels()
res = run_backtest(sig, ret, "test_signal")
diff = (res.long_returns - res.short_returns - res.spread_returns).abs()
assert (diff < 1e-10).all()
def test_predictive_signal_positive_mean_ic(self):
"""A signal with genuine predictive power produces positive mean IC."""
sig, ret = self._make_panels(n_dates=36, n_stocks=50)
res = run_backtest(sig, ret, "test_signal")
assert res.mean_ic > 0.0
%==========%
XIII. Configuration & Data Sources:
| Variable | Default | Description |
|---|---|---|
DATA_SOURCE | yfinance | Primary price source: yfinance or alpha_vantage |
AV_API_KEY | (none) | Required only when DATA_SOURCE=alpha_vantage |
DB_PATH | data/signals.duckdb | DuckDB database path for fundamentals and prices |
EDGAR_SLEEP | 0.12 | Seconds between EDGAR API requests (SEC fair-use: ≤10 req/s) |
| Data | Source | Notes |
|---|---|---|
| Annual financial statements | SEC EDGAR XBRL (data.sec.gov) | Free, no API key. CIK lookup via sec.gov/files/company_tickers.json. |
| Adjusted daily close prices | Yahoo Finance via yfinance | Free, no API key. Any exchange-listed symbol supported. |
| Adjusted daily close prices (alt) | Alpha Vantage | Free tier: 25 req/day. Requires AV_API_KEY. |
Team:
Theodosios Dimitrasopoulos, personal project.
Tools & methods:
Python 3.11, pandas, NumPy, SciPy, statsmodels, scikit-learn, Pydantic v2 (schema validation), DuckDB (local OLAP storage), Typer (CLI), rich (console output), Plotly (interactive figures), Streamlit (dashboard), SEC EDGAR XBRL APIs (fundamentals), yfinance / Alpha Vantage (prices), pytest, ruff, hatchling. Factor methodology: earnings yield (Basu 1977), asset growth anomaly (Cooper et al. 2008), accruals quality (Sloan 1996), ROA profitability (Fama & French 2015), leverage (Penman et al. 2007), sales growth, 12-1 momentum (Jegadeesh & Titman 1993), 1-month reversal (Jegadeesh 1990), low-volatility anomaly (Ang et al. 2006); cross-sectional z-score composite signals; IC/ICIR rolling backtest; quantile long-short spread performance attribution.