Projects — Theo Dimitrasopoulos

Gradient Boosting Alpha:

A cross-sectional return forecasting system that trains XGBoost and LightGBM ensembles on a rich panel of fundamental, technical, and momentum features to predict next-month stock returns ranked across a user-defined equity universe. Unlike neural networks, gradient boosted trees natively handle mixed tabular data, are robust to irrelevant inputs, and produce SHAP values that decompose each prediction into factor-level contributions — giving the model an interpretability that practitioners require before trusting an alpha signal in production. The feature set spans 18 factors across value (earnings yield, book yield), quality (ROE, gross profitability, accruals, asset growth, leverage), momentum (1-, 3-, 6-, 12-month cumulative returns, 12-1 cross-sectional momentum, 52-week high ratio, short-term reversal), and liquidity (average dollar volume, realised and idiosyncratic volatility). Cross-sectional rank normalisation is applied independently at each month-end rebalance to eliminate scale differences and suppress outliers. A walk-forward expanding-window protocol ensures that no future distributional information contaminates normalisation or model fitting at any point in the history. Built on Python 3.11+ using pandas, numpy, scipy, scikit-learn, xgboost, lightgbm, shap, plotly, streamlit, duckdb, and pydantic v2; packaged with hatchling and tested with pytest against deterministic seed-42 fixtures.

%==========%

I. Interactive Dashboard:

The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server required). It uses a synthetic universe generated with a seed-42 random state; all factor, SHAP, and portfolio logic runs client-side in Pyodide. Sidebar controls let you vary the random seed, history length, model type, and universe size to explore how the charts respond to different simulation regimes. First load downloads Pyodide and may take 20–40 seconds; subsequent loads are cached.

%==========%

II. Project Layout:


gradient-boosting/
├── pyproject.toml                              # Build config, deps, ruff + pytest settings
├── .env.example                                # DB_PATH, FRED_API_KEY, EDGAR_SLEEP
├── data/                                       # Populated by scripts/download_data.py
│   ├── gbm.duckdb                              # DuckDB: fundamentals + features tables
│   └── prices.csv                              # Wide (date × ticker) adjusted-close CSV
├── scripts/
│   └── download_data.py                        # EDGAR + yfinance → DuckDB
├── src/gradient_boosting/
│   ├── data/
│   │   ├── schemas.py                          # Pydantic v2: FeatureRecord, FundamentalsRecord, PortfolioStats
│   │   ├── fetchers.py                         # yfinance prices, EDGAR fundamentals, FRED macro
│   │   └── store.py                            # DuckDB init, upsert, read for fundamentals + features
│   ├── features/
│   │   ├── engineering.py                      # 18 factor builders (value, quality, momentum, risk, liquidity)
│   │   └── normalise.py                        # winsorise, rank_normalise, normalise_panel
│   ├── model/
│   │   ├── trainer.py                          # PredictionRow, WalkForwardResult, walk_forward_train
│   │   └── shap_analysis.py                    # compute_shap, aggregate_shap_importance, rolling_shap_importance
│   ├── backtest/
│   │   └── portfolio.py                        # evaluate_portfolio, compute_decile_returns, signal_decay
│   ├── report/
│   │   └── plots.py                            # Plotly: SHAP bar, rolling SHAP, decile returns, IC heatmap, equity curve
│   ├── cli.py                                  # Typer CLI: fetch | build | train | decay
│   └── app.py                                  # Streamlit: 5 tabs (SHAP, Deciles, IC, Equity, Decay)
└── tests/
    ├── conftest.py                             # Seed-42 price matrix + synthetic fundamentals fixtures
    ├── test_features.py                        # Factor invariants + normalisation unit tests
    ├── test_model.py                           # Walk-forward + portfolio evaluation invariants
    └── test_portfolio.py                       # Decile returns, equity curve, signal decay tests

%==========%

III. Data Sources:

Fundamental features come from SEC EDGAR XBRL APIs (data.sec.gov/api/xbrl/companyfacts/), which serve structured JSON for every 10-K filing without requiring an API key. The module requests six financial statement fields per company — revenue, net income, total assets, stockholders’ equity, operating cash flow, and gross profit — using the us-gaap taxonomy filtered to FY annual periods from 10-K filings. Multiple tag-name variants are tried in order to accommodate different XBRL reporting conventions across filers. The SEC enforces a fair-use rate limit of ten requests per second; the module sleeps 0.12 seconds between calls.

Price-based features are computed from adjusted daily closes fetched via yfinance. Dollar volume (price × volume) feeds the average dollar volume liquidity factor. FRED macro conditioning variables — the 10Y−2Y treasury spread, CBOE VIX, and Moody’s BAA minus 10-year spread — are fetched via fredapi with a free API key and joined to the feature panel as cross-sectional constants (regime overlays rather than firm-level features). Forward returns are computed as \(r_{t,h} = P_{t+h}/P_t - 1\) and aligned to the signal date by a forward shift of \(h\) trading days.


# fetchers.py — EDGAR extraction with multi-variant tag fallback
_XBRL_TAGS = {
    "revenue": ["Revenues", "RevenueFromContractWithCustomerExcludingAssessedTax"],
    "net_income":          ["NetIncomeLoss"],
    "total_assets":        ["Assets"],
    "stockholders_equity": ["StockholdersEquity",
                            "StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest"],
    "operating_cash_flow": ["NetCashProvidedByUsedInOperatingActivities"],
    "gross_profit":        ["GrossProfit"],
}

def _extract_tag(facts: dict, tag_variants: list[str]) -> pd.DataFrame:
    for tag in tag_variants:
        try:
            entries = facts["facts"]["us-gaap"][tag]["units"]["USD"]
        except KeyError:
            continue
        rows = [
            {"period_end": e["end"], "value": float(e["val"])}
            for e in entries
            if e.get("form") in ("10-K",) and e.get("fp") == "FY"
        ]
        if rows:
            return pd.DataFrame(rows).drop_duplicates("period_end", keep="last")
    return pd.DataFrame(columns=["period_end", "value"])

%==========%

IV. Feature Matrix — 18 Alpha Factors:

The feature matrix is engineered at each month-end rebalance date using only information available at or before that date. Fundamental factors use each company’s most recently filed 10-K with period_end ≤ as_of — the same look-ahead-free construction used in the signals engine. Price factors are computed from the trailing window ending on the last trading day at or before as_of.

Category	Factor	Formula	Rationale
Value	Earnings Yield	\(NI / TA\)	Asset-level earnings yield proxy; higher = cheaper relative to asset base (Basu 1977).
Value	Book Yield	\(Equity / TA\)	Book value density; penalises high-leverage balance sheets.
Quality	ROE	\(NI / Equity\)	Profitability relative to shareholders’ capital (Fama & French 2015).
	Gross Profitability	\(GP / TA\)	Operating efficiency signal robust to accounting accruals (Novy-Marx 2013).
	Accruals	\(-(NI - OCF) / TA\)	Earnings quality: cash-backed income predicts higher future returns (Sloan 1996).
	Asset Growth	\(-(A_t - A_{t-1})/A_{t-1}\)	Negated: aggressive balance-sheet expansion predicts lower returns (Cooper et al. 2008).
	Leverage	\(-Debt/TA\)	Negated debt ratio; low-leverage firms earn a persistent premium (Penman et al. 2007).
Momentum	1-month Return	\(P_t/P_{t-21}-1\)	Short-term price change; combined with reversal for sign flip.
	3-month Return	\(P_t/P_{t-63}-1\)	Medium-term momentum.
	6-month Return	\(P_t/P_{t-126}-1\)	Semi-annual momentum.
	12-month Return	\(P_t/P_{t-252}-1\)	Annual cumulative return.
	12−1 Momentum	\(P_{t-21}/P_{t-252}-1\)	Jegadeesh & Titman (1993): 12-month formation skipping 1-month reversal period.
	52-week High Ratio	\(P_t / \max(P_{t-252\ldots t})\)	George & Hwang (2004): nearness to 52-week high predicts continuations.
	Short-term Reversal	\(-(P_t/P_{t-21}-1)\)	Negated 1-month return; recent losers tend to bounce (Jegadeesh 1990).
Risk	Realised Volatility	\(\sigma_{\log r} \times \sqrt{252}\)	Annualised trailing 63-day vol; low-vol anomaly (Ang et al. 2006).
Risk	Idiosyncratic Volatility	\(\sigma_{\epsilon} \times \sqrt{252}\)	Residual vol after removing equal-weight market return; Ang et al. (2006).
Liquidity	Avg. Dollar Volume	\(\overline{P \cdot V}_{21d}\)	21-day mean daily dollar volume; small/illiquid stocks earn a liquidity premium.

%==========%

V. Cross-Sectional Rank Normalisation (`features/normalise.py`):

Raw factor values span wildly different units and scales: dollar volume is measured in billions, accruals in fractions of assets, and momentum in log-return space. Before training the gradient boosted model, each factor is normalised in two steps independently within each month-end cross-section:

Winsorisation at ±3σ: the factor values across the universe are clipped to mean ± 3 standard deviations, suppressing the influence of extreme outliers without dropping stocks from the universe.
Percentile ranking: winsorised values are converted to cross-sectional percentile ranks in [0, 1] using pd.Series.rank(pct=True). This maps the best and worst stocks in the universe to 1.0 and 0.0 respectively, regardless of the factor’s natural scale.

Normalisation is applied cross-sectionally at each date and never uses information from future dates, ensuring that the model sees the same relative ordering signal regardless of market regime or raw factor level. The same normalisation pipeline applied to the training set is applied identically to the test-month features in the walk-forward loop, using each month’s own cross-sectional distribution — so the rank of a stock in month T is computed using only the other stocks observed in month T.


# normalise.py
def winsorise(s: pd.Series, n_sigma: float = 3.0) -> pd.Series:
    """Clip to mean ± n_sigma * std; preserves NaN."""
    valid = s.dropna()
    if len(valid) < 3:
        return s
    mu, sigma = valid.mean(), valid.std()
    if sigma == 0 or np.isnan(sigma):
        return s
    return s.clip(mu - n_sigma * sigma, mu + n_sigma * sigma)

def rank_normalise(df: pd.DataFrame, feature_cols: list[str] | None = None) -> pd.DataFrame:
    """Winsorise + percentile-rank each feature column cross-sectionally."""
    cols = feature_cols or _FEATURE_COLS
    out = df.copy()
    for col in cols:
        if col not in out.columns:
            continue
        s = winsorise(out[col].copy())
        out[col] = s.rank(pct=True, na_option="keep")
    return out

def normalise_panel(panel: pd.DataFrame, date_col: str = "month_end") -> pd.DataFrame:
    """Apply rank_normalise independently at each date in the panel."""
    parts = [rank_normalise(grp) for _, grp in panel.groupby(date_col)]
    return pd.concat(parts, ignore_index=True) if parts else panel.iloc[0:0].copy()

Why rank normalisation rather than z-scoring? Z-scoring preserves the relative distances between stocks but leaves the distribution sensitive to outliers even after winsorisation. Rank normalisation removes all information about the cross-sectional distribution beyond ordering, making the model robust to fat-tailed factor distributions and consistent across regimes with different dispersion levels. Gradient boosted trees are invariant to monotone transformations of inputs, so the percentile rank representation does not reduce the information available to the splits — it merely standardises the input space.

%==========%

VI. Walk-Forward Validation (`model/trainer.py`):

The walk-forward expanding-window protocol is the correct validation framework for cross-sectional return prediction because it mirrors the actual deployment setting: a portfolio manager trained on all past data makes predictions for the next month, then re-trains incorporating that new month before predicting the month after.

For prediction month \(T\) (starting after min_train_months of history, default 12):

Training set: all stock-month observations with month_end < T that have non-NaN target \(r_{t+1}\).
Test set: all stocks observed at month \(T\) (target unknown at prediction time).
A fresh model is fit on the training set and used to predict scores for the test set.
Predicted scores are stored as PredictionRow objects alongside the realised return for later evaluation.

No cross-sectional normalisation is recomputed across time; each month’s rank-normalised features are independent, so the training set correctly contains the pre-normalised values from each of the historical cross-sections.


# trainer.py
def walk_forward_train(
    panel: pd.DataFrame,
    date_col: str = "month_end",
    target_col: str = "fwd_ret_1m",
    model_type: str = "xgboost",
    min_train_months: int = 12,
) -> WalkForwardResult:
    feat_cols = [c for c in _FEATURE_COLS if c in panel.columns]
    dates = sorted(panel[date_col].unique())
    result = WalkForwardResult(model_type=model_type)
    importances: list[np.ndarray] = []

    for idx, pred_date in enumerate(dates):
        if idx < min_train_months:
            continue
        train = panel[panel[date_col] < pred_date].dropna(subset=feat_cols + [target_col])
        test  = panel[panel[date_col] == pred_date].dropna(subset=feat_cols)
        if len(train) < 30 or test.empty:
            continue

        model = _build_model(model_type)
        model.fit(train[feat_cols].values, train[target_col].values)
        preds = model.predict(test[feat_cols].values)

        if hasattr(model, "feature_importances_"):
            importances.append(model.feature_importances_)

        for i, (_, row) in enumerate(test.iterrows()):
            result.predictions.append(PredictionRow(
                month_end=pred_date, ticker=str(row["ticker"]),
                predicted_score=float(preds[i]),
                actual_return=row.get(target_col),
                model_type=model_type,
            ))

    if importances:
        imp_arr = np.array(importances)
        for j, col in enumerate(feat_cols):
            result.feature_importance[col] = imp_arr[:, j].tolist()

    return result

XGBoost hyperparameters: 300 trees, max depth 4, learning rate 0.05, subsample 0.8, column subsample 0.7, L2 regularisation 1.0. LightGBM uses the same settings with verbosity=-1. The shallow depth (4) and column subsampling prevent individual trees from memorising the training set. Ridge regression with α = 1 serves as a linear baseline on the same rank-normalised features.

%==========%

VII. SHAP Value Attribution (`model/shap_analysis.py`):

SHAP (SHapley Additive exPlanations, Lundberg & Lee 2017) decomposes each model prediction into additive feature contributions grounded in cooperative game theory. For a prediction \(\hat{y}_i\), the base value \(\phi_0\) equals the mean model output, and each feature \(j\) contributes \(\phi_j\) such that:

\[\hat{y}_i = \phi_0 + \sum_{j=1}^{p} \phi_{j,i}\]

For tree ensembles, the TreeSHAP algorithm (Lundberg et al. 2018) computes exact Shapley values in polynomial time by exploiting the tree structure. The key properties that make SHAP superior to raw Gini importance for financial attribution are:

Signed values: positive \(\phi_j\) means feature \(j\) pushed stock \(i\)’s predicted score up. A portfolio manager can therefore say “stock X was scored high primarily because of its momentum rank”.
Consistency: if feature \(j\) affects model output more than feature \(k\) for a given stock, \(|\phi_j| \ge |\phi_k|\) for that stock. Gini importance has no such guarantee.
Local accuracy: SHAP values sum to the prediction minus the base rate, so there is no unexplained residual.

In this module, SHAP values are computed for every out-of-sample prediction using shap.TreeExplainer. Rolling mean absolute SHAP importance (averaged over a 6-month window) tracks factor regime shifts — for example, whether momentum importance rose during 2020–2021 trend environments and fell during the 2022 factor rotation.


# shap_analysis.py
def compute_shap(model, X: pd.DataFrame) -> ShapResult:
    """TreeSHAP for a fitted XGBoost or LightGBM model."""
    explainer = shap.TreeExplainer(model)
    sv = explainer.shap_values(X)
    return ShapResult(
        shap_values=pd.DataFrame(sv, columns=X.columns, index=X.index),
        feature_names=list(X.columns),
        base_value=float(explainer.expected_value),
    )

def aggregate_shap_importance(shap_df: pd.DataFrame) -> pd.Series:
    """Mean |SHAP| per feature across all observations; sorted descending."""
    feat_cols = [c for c in shap_df.columns if c in _FEATURE_COLS]
    return shap_df[feat_cols].abs().mean().sort_values(ascending=False)

def rolling_shap_importance(shap_by_date: dict, window: int = 6) -> pd.DataFrame:
    """Sliding-window mean-|SHAP| importance: dates as index, features as columns."""
    dates = sorted(shap_by_date.keys())
    rows = {}
    for i, d in enumerate(dates):
        start = max(0, i - window + 1)
        frames = [shap_by_date[dates[j]].shap_values for j in range(start, i + 1)]
        rows[d] = aggregate_shap_importance(pd.concat(frames, ignore_index=True))
    return pd.DataFrame(rows).T

%==========%

VIII. Portfolio Construction & Evaluation (`backtest/portfolio.py`):

At each monthly rebalance date, stocks are ranked into deciles by their predicted return score. The long-short portfolio goes long the top decile (D10) and short the bottom decile (D1), equal-weighted within each leg. The spread return for month \(t\) is:

\[\text{Spread}_t = \bar{r}_{t,\,D10} - \bar{r}_{t,\,D1}\]

where \(\bar{r}_{t,\,D10}\) is the equal-weighted mean realised return of stocks in the top decile at month \(t\). Four metrics summarise performance:

Metric	Formula	Interpretation
IC	\(\rho_S(\hat{y}_t, r_{t+1})\)	Spearman rank correlation of predicted score with realised forward return. IC > 0 means directionally correct; 0.05 is practically significant for a diversified universe.
ICIR	\(\mu_{IC}/\sigma_{IC}\)	IC Information Ratio — consistency of the IC signal. ICIR > 0.5 is the practitioner threshold for a reliable signal.
Sharpe	\(\mu_{\text{spread}} / \sigma_{\text{spread}} \times \sqrt{12}\)	Annualised Sharpe ratio of the monthly long-short spread returns, before transaction costs.
Hit Rate	\(\Pr(IC_t > 0)\)	Fraction of months with positive IC. Above 55% suggests robust directional consistency.

Nonlinear interaction effects captured by the tree model — for example, whether momentum only predicts returns in low-volatility stocks — are not directly reported in a linear IC but are reflected in the SHAP values: momentum’s SHAP contribution will be larger for low-volatility stocks and near-zero for high-volatility stocks if such an interaction is learned.


# portfolio.py
def evaluate_portfolio(panel: pd.DataFrame, rf: float = 0.0) -> PortfolioEvaluation:
    ic_by_date = panel.groupby("month_end").apply(_ic_at_date).dropna()
    spread_by_date = {}
    for dt, grp in panel.groupby("month_end"):
        valid = grp.dropna(subset=["score", "actual_return"])
        if len(valid) < 10:
            continue
        q = pd.qcut(valid["score"], 10, labels=False, duplicates="drop") + 1
        long_ret  = valid.loc[q == q.max(), "actual_return"].mean()
        short_ret = valid.loc[q == q.min(), "actual_return"].mean()
        spread_by_date[dt] = long_ret - short_ret

    spread = pd.Series(spread_by_date).dropna()
    equity = (1 + spread.fillna(0)).cumprod()
    ann_ret = spread.mean() * 12
    ann_vol = spread.std() * np.sqrt(12)

    return PortfolioEvaluation(
        ic_series=ic_by_date,
        mean_ic=float(ic_by_date.mean()),
        icir=float(ic_by_date.mean() / ic_by_date.std()) if ic_by_date.std() > 0 else np.nan,
        sharpe=float((ann_ret - rf) / ann_vol) if ann_vol > 0 else np.nan,
        cum_return=float((1 + spread.fillna(0)).prod() - 1),
        hit_rate=float((ic_by_date > 0).mean()),
        decile_returns=compute_decile_returns(panel),
        equity_curve=equity,
    )

%==========%

IX. Signal Decay Analysis (`backtest/portfolio.py`):

Signal decay measures how quickly the predictive power of the month-0 scores diminishes over longer hold periods. For each hold horizon \(h \in \{1, 3, 6, 12\}\) months, the IC is computed between the scores formed at month 0 and the realised returns in month \(h\) (without rebalancing). A signal with a long half-life will have high IC at \(h = 6\) or \(h = 12\); a signal driven primarily by short-term momentum or reversal will collapse quickly to zero.

\[IC_h = \rho_S\bigl(\hat{y}_{t=0},\, r_{t=h}\bigr)\]

In practice, cross-sectional return signals have IC half-lives of 1–3 months for most factor categories. Momentum signals decay faster than value and quality factors because momentum is driven by transient price trends rather than persistent firm-level characteristics. This analysis is valuable for calibrating rebalancing frequency and estimating the maximum transaction-cost budget that can be absorbed while remaining profitable.


# portfolio.py — signal_decay
def signal_decay(panel, hold_months=[1, 3, 6, 12]) -> pd.DataFrame:
    dates = sorted(panel["month_end"].unique())
    rows = []
    for h in hold_months:
        ics = []
        for i, d0 in enumerate(dates):
            j = i + h
            if j >= len(dates): break
            d_h = dates[j]
            scores = panel[panel["month_end"] == d0][["ticker","score"]].set_index("ticker")
            rets   = panel[panel["month_end"] == d_h][["ticker","actual_return"]].set_index("ticker")
            merged = scores.join(rets, how="inner").dropna()
            if len(merged) < 5: continue
            corr, _ = spearmanr(merged["score"], merged["actual_return"])
            ics.append(float(corr))
        ic_s = pd.Series(ics)
        rows.append({"hold_months": h, "mean_ic": ic_s.mean(),
                     "icir": ic_s.mean() / ic_s.std() if ic_s.std() > 0 else np.nan})
    return pd.DataFrame(rows).set_index("hold_months")

%==========%

X. Nonlinear Interaction Effects:

The principal advantage of gradient boosted trees over ridge regression on the same features is their ability to discover and exploit interaction effects that the linear model systematically ignores. In cross-sectional equity factor research, the most practically important interactions involve conditioning signals on a regime variable or a risk characteristic:

Momentum × Volatility: the momentum signal (12-1 month return) tends to be much stronger for low-volatility stocks (Ang et al. 2006). High-volatility stocks have noisier price paths, and momentum in those stocks is more likely to be driven by noise rather than trend. A tree model learns to assign high momentum SHAP only when the realized_vol feature is in a low rank; a linear model applies the same momentum coefficient to all volatility levels.
Value × Quality: the combination of cheap valuation (high earnings yield) and high quality (strong ROE, low accruals) is more predictive than either alone. The Piotroski F-score exploits a similar interaction. Tree splits discover these conjunctions naturally; in linear models they must be manually engineered as interaction terms.
Momentum × Reversal interaction: very high 1-month returns predict a reversal, but very high 12-1 momentum predicts continuation. The relative magnitudes of the two signals determine which effect dominates, and the relationship is non-linear near the boundary.

These effects are invisible in a linear IC comparison but visible in SHAP values: stocks where an interaction boosted a prediction will have large positive SHAP from both interacting features simultaneously; stocks where only one condition is met will show large SHAP from one feature and near-zero from the other. Aggregating conditional SHAP values (e.g. “mean momentum SHAP for stocks in the bottom volatility decile vs top decile”) directly quantifies the interaction magnitude.

%==========%

XI. CLI — `cli.py`:

Four subcommands share a common --db interface pointing to the DuckDB database. All commands work fully offline once gbm fetch has populated the database.


# Install
pip install -e ".[dev]"

# Fetch EDGAR fundamentals + yfinance prices for 30 large-caps (~15 min)
gbm fetch --tickers "AAPL,MSFT,GOOGL,AMZN,META,JPM,BAC,WFC,JNJ,UNH,XOM,CVX,PG,KO,PEP,HD,LOW,CAT,GE,MMM,NVDA,AMD,QCOM,DIS,NFLX,GS,MS,BLK,PFE,LLY" --start 2015-01-01

# Display rank-normalised feature matrix for a specific date
gbm build --as-of 2024-06-30

# Walk-forward train XGBoost and print portfolio statistics
gbm train --model xgboost --start 2018-01-01

# Analyse signal decay over 1, 3, 6, and 12-month hold periods
gbm decay --model xgboost

# Launch Streamlit server-side dashboard
streamlit run src/gradient_boosting/app.py

Command	Key options	Output
`gbm fetch`	`--tickers`, `--start`, `--db`	Populates DuckDB with fundamentals + prices; saves `prices.csv`
`gbm build`	`--as-of`, `--db`	Rich table of rank-normalised factors for the universe on a given date
`gbm train`	`--model`, `--start`, `--end`, `--min-train`	Walk-forward portfolio stats: IC, ICIR, Sharpe, hit rate, cumulative return
`gbm decay`	`--model`, `--db`	IC decay table for hold periods of 1, 3, 6, and 12 months

%==========%

XII. Test Suite:

All tests are fully offline. The shared fixtures in conftest.py generate a deterministic 756-day price matrix (geometric random walk, seed 42, 10 tickers) and a 5-year annual fundamentals table with synthetic but internally consistent field values. Factor tests verify mathematical invariants: ROE is positive for profitable firms; accruals equals the negation of \((NI - OCF)/TA\); leverage is non-positive; book yield lies strictly in (0, 1) for solvent firms; the 52-week high ratio never exceeds 1.0; short-term reversal is exactly the negation of the 1-month momentum. Normalisation tests verify that rank output lies in [0, 1] and panel row count is preserved. Walk-forward tests verify no data leakage (first prediction date follows min-train window), finite output scores, and that a strongly predictive synthetic signal produces positive mean IC over 36 periods. Portfolio tests verify that the equity curve product equals 1 + cumulative return, hit rate lies in [0, 1], and signal decay ordering is monotone for a predictive signal.


# conftest.py — fixtures
@pytest.fixture(scope="session")
def prices_df(trade_dates) -> pd.DataFrame:
    rng = np.random.default_rng(42)
    log_rets = rng.normal(0.0003, 0.015, (_N_DAYS, len(_TICKERS)))
    starts   = rng.uniform(50, 500, len(_TICKERS))
    prices   = starts * np.exp(np.cumsum(log_rets, axis=0))
    return pd.DataFrame(prices, index=trade_dates, columns=_TICKERS)

# test_features.py — selected invariants
def test_reversal_is_negation_of_1m_mom(self, prices_df):
    moms = momentum_factors(prices_df, date(2023, 12, 31))
    pd.testing.assert_series_equal(
        moms["reversal_1m"].round(8),
        (-moms["mom_1m"]).rename("reversal_1m").round(8),
    )

def test_high_52w_bounded(self, prices_df):
    moms = momentum_factors(prices_df, date(2023, 12, 31))
    h52 = moms["high_52w"].dropna()
    assert (h52 <= 1.0).all(), "Price / 52w high must never exceed 1"

# test_model.py — no data leakage check
def test_no_data_leakage(self):
    panel = _make_panel()
    result = walk_forward_train(panel, model_type="ridge", min_train_months=6)
    dates  = sorted(panel["month_end"].unique())
    first_pred = result.predictions[0].month_end
    assert first_pred > dates[5], "First prediction must follow min_train_months"

# test_portfolio.py — signal decay monotonicity
def test_month1_ic_highest(self):
    panel = _make_pred_panel(n_dates=60, n_stocks=200, seed=42)
    df = signal_decay(panel, hold_months=[1, 3, 6, 12])
    assert df.loc[1, "mean_ic"] >= df.loc[12, "mean_ic"]

%==========%

XIII. Configuration & Setup:

Setup and launch (local):


cd assets/projects/gradient_boosting
python -m venv .venv && .venv\Scripts\Activate.ps1        # Windows
pip install -e ".[dev]"
cp .env.example .env                                       # add FRED_API_KEY if using macro features
python scripts/download_data.py                            # ~15–30 min for 30 tickers
gbm train --model xgboost
streamlit run src/gradient_boosting/app.py

Variable	Default	Description
`DB_PATH`	`data/gbm.duckdb`	DuckDB database path for fundamentals and features
`FRED_API_KEY`	(none)	Free FRED API key for macro conditioning variables (yield slope, VIX, credit spread)
`AV_API_KEY`	(none)	Optional Alpha Vantage key; `yfinance` is used by default for prices
`EDGAR_SLEEP`	`0.12`	Seconds between EDGAR requests (SEC fair-use: ≤10 req/s)

Data	Source	Notes
Annual financial statements	SEC EDGAR XBRL (`data.sec.gov`)	Free, no API key. CIK lookup via `sec.gov/files/company_tickers.json`.
Adjusted daily closes + volume	Yahoo Finance via `yfinance`	Free, no API key.
Macro conditioning (yield slope, VIX, credit spread)	FRED via `fredapi`	Free API key at `fred.stlouisfed.org`.

Team:

Theodosios Dimitrasopoulos, personal project.

Tools & methods:

Python 3.11, pandas, NumPy, SciPy, scikit-learn, XGBoost 2.x, LightGBM 4.x, SHAP (TreeExplainer), Pydantic v2, DuckDB, Typer, rich, Plotly, Streamlit, SEC EDGAR XBRL APIs, yfinance, FRED / fredapi, pytest, ruff, hatchling. Factor methodology: Basu (1977) earnings yield; Novy-Marx (2013) gross profitability; Sloan (1996) accruals; Cooper et al. (2008) asset growth; Penman et al. (2007) leverage; Fama & French (2015) profitability; Jegadeesh & Titman (1993) 12-1 momentum; George & Hwang (2004) 52-week high; Jegadeesh (1990) short-term reversal; Ang et al. (2006) low-volatility anomaly. Walk-forward expanding-window validation; cross-sectional rank normalisation; SHAP TreeExplainer attribution; IC/ICIR/Sharpe long-short evaluation; signal decay analysis.

Gradient Boosting Alpha:

I. Interactive Dashboard:

II. Project Layout:

III. Data Sources:

IV. Feature Matrix — 18 Alpha Factors:

V. Cross-Sectional Rank Normalisation (features/normalise.py):

VI. Walk-Forward Validation (model/trainer.py):

VII. SHAP Value Attribution (model/shap_analysis.py):

VIII. Portfolio Construction & Evaluation (backtest/portfolio.py):

IX. Signal Decay Analysis (backtest/portfolio.py):

X. Nonlinear Interaction Effects:

XI. CLI — cli.py:

XII. Test Suite:

XIII. Configuration & Setup:

Team:

Tools & methods:

V. Cross-Sectional Rank Normalisation (`features/normalise.py`):

VI. Walk-Forward Validation (`model/trainer.py`):

VII. SHAP Value Attribution (`model/shap_analysis.py`):

VIII. Portfolio Construction & Evaluation (`backtest/portfolio.py`):

IX. Signal Decay Analysis (`backtest/portfolio.py`):

XI. CLI — `cli.py`: