End-to-End Differentiable Portfolio Optimisation:

The standard quantitative pipeline is two decoupled stages: a model predicts expected returns, then a separate optimiser turns those forecasts into portfolio weights. The predictor is trained for forecast accuracy (mean-squared error) with no knowledge of the optimiser that consumes its output — so it spends capacity on names the optimiser barely uses, and the optimiser amplifies the forecast’s estimation error into corner solutions. This project replaces the seam with a differentiable convex optimisation layer: the mean-variance quadratic program is embedded as a layer in a neural network (via cvxpylayers), and the gradient of a decision-quality loss — the negative out-of-sample Sharpe ratio — flows back through the optimiser into the return-and-risk prediction head by implicitly differentiating the QP’s Karush-Kuhn-Tucker conditions. The predictor is then trained for decision quality rather than predictive accuracy. Crucially, the project does not merely assert the fashionable claim that this “systematically” wins; it tests it on controlled experiments and reports an honest, more interesting answer: the edge of end-to-end training lives in the risk model, not the return forecast. Built on Python 3.11+ with torch, cvxpy and cvxpylayers (differentiable QP), scikit-learn/shap for attribution, numpy, pandas, plotly, streamlit, duckdb, pydantic v2, and typer; packaged with hatchling and tested with pytest against deterministic seed-42 fixtures.


%==========%


I. Interactive Dashboard:

The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server). Because a differentiable QP needs cvxpylayers and PyTorch, the in-browser demo uses a transparent stand-in: a linear return head and the closed-form mean-variance solution (which is differentiable directly in NumPy), so the contrast between MSE training and Sharpe (decision-quality) training runs live. Drag the seed slider to see that the decision-aware model improves out-of-sample Sharpe in roughly four of five simulated universes by learning to shrink its covariance. First load downloads Pyodide and may take 20–40 seconds.


%==========%


II. Project Layout:

e2e-portfolio/
├── pyproject.toml                              # Build config, deps, ruff + pytest settings
├── .env.example                                # DB_PATH, SIGNALS_DB_PATH
├── dashboard.html                              # Self-contained stlite browser demo
├── scripts/
│   └── download_data.py                        # yfinance prices → factor panel (optional)
├── src/e2e_portfolio/
│   ├── data/
│   │   ├── synthetic.py                        # Factor → return DGP (priced + diversifying factors)
│   │   ├── schemas.py                          # Pydantic v2: FactorExposure, StrategyStats
│   │   ├── store.py                            # DuckDB factor-panel persistence
│   │   └── fetchers.py                         # yfinance prices, price-based factor builder
│   ├── models/
│   │   ├── qp_layer.py                         # MeanVarianceQP — DPP cvxpylayers QP + cvxpy fallback
│   │   ├── predictor.py                        # ReturnPredictor MLP → μ, low-rank Σ̂
│   │   ├── losses.py                           # MSE, negative-Sharpe, MV-utility losses
│   │   ├── train.py                            # train_decoupled, train_e2e, Ledoit-Wolf
│   │   └── decision_shrinkage.py               # Decision-optimised shrinkage experiment
│   ├── backtest/
│   │   └── walkforward.py                       # Out-of-sample backtest + run_comparison
│   ├── attribution/
│   │   └── saliency.py                          # Gradient + SHAP factor importance
│   ├── report/
│   │   └── plots.py                             # Plotly: loss curves, Sharpe bars, equity, importance
│   ├── cli.py                                  # Typer CLI: compare | shrink | dashboard
│   └── app.py                                  # Streamlit server-side dashboard
└── tests/                                      # Seed-42 fixtures; QP, training, backtest, shrinkage
  

%==========%


III. The Decoupled Pipeline and “Error Maximisation”:

The classical two-stage estimator solves two unrelated problems. Stage one fits a predictor \(\hat{\mu}_\theta(x)\) by minimising forecast error \(\frac{1}{N}\sum_i(\hat{\mu}_i - r_i)^2\). Stage two plugs \(\hat{\mu}\) and a covariance estimate \(\hat{\Sigma}\) into a mean-variance optimiser to get weights \(w^\star(\hat{\mu}, \hat{\Sigma})\). The flaw is that MSE weights every asset equally, but the optimiser does not: a small forecast error on a low-variance, low-correlation asset moves the portfolio far more than a large error on a name the optimiser would never hold. Worse, the mean-variance map is extraordinarily sensitive to its inputs — Michaud’s “error maximisation”: because \(w^\star \propto \Sigma^{-1}\mu\), estimation noise in \(\mu\) and (especially) in \(\Sigma\) is inverted and amplified into extreme, unstable, often corner-solution weights. Training the forecast to be accurate does nothing to protect against this; it is the wrong objective.

The end-to-end remedy is to make the optimiser part of the model and train the whole pipeline on the objective we actually care about — realised risk-adjusted return. This is the Smart “Predict, then Optimise” (SPO / SPO+) programme of Elmachtoub & Grigas (2022) and the differentiable-optimisation work of Agrawal, Amos, Barratt, Boyd et al. (2019).


%==========%


IV. The Mean-Variance QP as a Differentiable Layer (models/qp_layer.py):

The optimiser is a convex quadratic program. Writing the risk model as a factor matrix \(R\) with \(R^\top R = \hat{\Sigma}\) (so \(w^\top\hat{\Sigma}w = \lVert Rw\rVert^2\)), the layer solves

\[w^\star(\mu, R) = \arg\max_{w}\; \mu^\top w - \gamma\,\lVert Rw\rVert^2 - \kappa\,\lVert w - w_{\text{prev}}\rVert_1 \quad \text{s.t.}\quad \mathbf{1}^\top w = 1,\; 0 \le w \le w_{\max}\]

where \(\gamma\) is risk aversion, \(\kappa\) a turnover (L1) penalty against last period’s book, and \(w_{\max}\) an optional per-name cap. The parameters fed by the network are \(\mu\) (expected returns) and \(R\) (the risk factors); \(\gamma\), \(\kappa\) and \(w_{\max}\) are structural constants baked into the problem. This matters: cvxpylayers requires the problem to be disciplined parametrised programming (DPP) compliant, and a parameter multiplying a non-affine term (e.g. a learnable \(\gamma\) times \(\lVert Rw\rVert^2\)) breaks DPP. Every parameter here enters affinely, so the layer is differentiable. When cvxpylayers is unavailable the same object still solves the forward QP with raw cvxpy (CLARABEL), so inference and the decoupled baseline keep working.

def build_problem(n, k, gamma, w_max, kappa):
    w  = cp.Variable(n, name="w")
    mu = cp.Parameter(n, name="mu")
    R  = cp.Parameter((k, n), name="R")
    objective   = -mu @ w + gamma * cp.sum_squares(R @ w)      # DPP: params enter affinely
    constraints = [cp.sum(w) == 1, w >= 0]
    if w_max is not None:
        constraints.append(w <= w_max)
    if kappa > 0:
        w_prev = cp.Parameter(n, name="w_prev")
        objective = objective + kappa * cp.norm1(w - w_prev)
    return cp.Problem(cp.Minimize(objective), constraints)

# cvxpylayers turns the problem into an autograd-differentiable torch layer:
self._layer = CvxpyLayer(prob, parameters=[mu, R, ...], variables=[w])

%==========%


V. Implicit Differentiation through the KKT Conditions:

How does a gradient pass through an argmax? The optimal \(w^\star\) is characterised by the stationarity, primal- and dual-feasibility, and complementary-slackness conditions of the QP — a system of equations \(g(w^\star, \nu^\star, \lambda^\star;\, \mu, R) = 0\) in the primal and dual variables. By the implicit function theorem, differentiating this system gives the Jacobian of the solution with respect to the problem parameters without ever needing a closed form for \(w^\star\):

\[\frac{\partial w^\star}{\partial (\mu, R)} = -\left(\frac{\partial g}{\partial (w,\nu,\lambda)}\right)^{-1}\frac{\partial g}{\partial (\mu, R)}\]

cvxpylayers (via diffcp) evaluates exactly this: it solves the conic program in the forward pass and differentiates the residual map of the homogeneous self-dual embedding in the backward pass. The upshot is a torch layer whose backward() returns \(\partial \mathcal{L}/\partial\mu\) and \(\partial \mathcal{L}/\partial R\) for any downstream loss \(\mathcal{L}\), which the autograd graph then propagates into the MLP weights. SPO+ (Elmachtoub & Grigas) is the convex-surrogate analogue of the same idea: a tractable upper bound on decision regret whose subgradient has this implicit-differentiation form.


%==========%


VI. Return-and-Risk Prediction Head (models/predictor.py):

A single MLP maps each asset’s factor exposures to both moments the QP needs: the expected return \(\mu\) and a low-rank-plus-diagonal risk model \(\hat{\Sigma} = FF^\top + \operatorname{diag}(d)\), parametrised by factor loadings \(F \in \mathbb{R}^{n\times k_f}\) and a positive idiosyncratic diagonal \(d\). Stacking \(R = [\,F^\top;\, \operatorname{diag}(\sqrt{d})\,]\) gives \(R^\top R = \hat{\Sigma}\) directly, so no Cholesky factorisation (which is awkward to differentiate) is needed. The same architecture serves both regimes; only the loss and whether gradients flow through the QP differ.

class ReturnPredictor(nn.Module):
    def forward(self, x):                       # x: (n_assets, n_factors)
        h = self.body(x)
        mu = self.mu_head(h).squeeze(-1)        # expected returns        (n,)
        F  = self.load_head(h)                  # factor loadings         (n, k_f)
        d  = F.softplus(self.idio_head(h)) + 1e-4   # idiosyncratic var   (n,)
        return mu, F, d

    def risk_matrix(self, F, d):                # R with RᵀR = FFᵀ + diag(d)
        return torch.cat([F.t(), torch.diag(torch.sqrt(d))], dim=0)

%==========%


VII. The Decision-Quality Loss (models/losses.py, models/train.py):

The end-to-end model is trained on the negative annualised Sharpe ratio of the realised portfolio return series. Each training epoch is a full pass over the training months: the QP is solved month by month, the realised net return \(r^p_t = w^\star_t{}^\top r_t - \mathrm{tc}\,\lVert w^\star_t - w^\star_{t-1}\rVert_1\) is assembled, and the loss

\[\mathcal{L}_{\text{Sharpe}} = -\sqrt{12}\;\frac{\operatorname{mean}_t(r^p_t)}{\operatorname{std}_t(r^p_t)}\]

is back-propagated through every QP solve into the predictor. Using the whole training period (rather than one random rolling window) keeps the decision loss low-variance and the comparison against the decoupled baseline reproducible. The decoupled baseline, by contrast, is trained on MSE and hands a Ledoit-Wolf sample covariance to the optimiser — the textbook two-stage estimator.


%==========%


VIII. Empirical Findings — Testing the Claim Honestly:

The literature often claims end-to-end training “systematically” produces higher Sharpe. On controlled experiments with a synthetic factor universe (priced value/quality/momentum/low-vol factors, a quality×value interaction, block-correlated assets), the honest answer is more nuanced and, I think, more interesting:

In the realistic large-universe regime (\(N \approx T\), where the sample covariance is near-singular), mean out-of-sample minimum-Sharpe portfolio performance over ten simulated universes:

Risk modelMean OOS SharpeNote
Plug-in two-stage MV (\(\delta = 0\))0.56Naive sample covariance; the “error maximisation” regime.
Ledoit-Wolf (decision-blind shrinkage)1.85Standard statistical shrinkage toward the diagonal.
Decision-aware shrinkage (end-to-end)2.18Beats Ledoit-Wolf in ~7–8 of 8 seeds.

The decision-aware model discovers it should shrink hard (\(\delta^\star \to 1\), a decision-aware analogue of Ledoit-Wolf) and tunes the level for realised net Sharpe rather than for statistical covariance accuracy — which is why it edges past Ledoit-Wolf. The takeaway: differentiating through the optimiser pays off precisely where the hard, ill-conditioned estimation problem is — the covariance — not in squeezing the return forecast.


%==========%


IX. Where the Edge Lives: Decision-Optimised Shrinkage (models/decision_shrinkage.py):

To isolate the effect cleanly, this module uses a linear return head \(\theta\) (so the only difference from the baseline is the decision-awareness itself) and the closed-form mean-variance optimiser (differentiable in \(\theta\)), with one extra decision variable: the shrinkage \(\delta\) selected to maximise realised net Sharpe. The decoupled model uses the plug-in sample covariance (\(\delta = 0\)); the end-to-end model chooses \(\delta\) for decision quality. This is the mechanism the browser dashboard reproduces live.

# Out-of-sample Sharpe edge across simulated universes (decision-aware vs plug-in)
from e2e_portfolio.models.decision_shrinkage import edge_across_seeds
agg = edge_across_seeds(seeds=range(10), n_assets=40, n_months=96)
agg["mean_edge"], agg["win_rate"]            # ≈ +1.6 Sharpe, ~100% of seeds
agg["mean_edge_vs_lw"], agg["win_rate_vs_lw"]  # ≈ +0.3 Sharpe vs Ledoit-Wolf, ~7/8

This is not a tautology — that decision-aware shrinkage beats the naive plug-in is expected, but it also beats a strong, standard Ledoit-Wolf baseline, because statistical shrinkage minimises covariance error while decision-aware shrinkage minimises the quantity that actually matters at the end of the pipeline.


%==========%


X. Factor Attribution (attribution/saliency.py):

To ask what the optimiser-aware predictor learns differently, the module compares factor importance between the decoupled and end-to-end models via mean absolute gradient saliency \(\frac{1}{n}\sum\lvert\partial\hat{\mu}/\partial\text{factor}\rvert\) (and SHAP values via shap.GradientExplainer when available). The hypothesis — that the decision-aware model up-weights low-correlation diversifying factors — holds in some universes but, like the raw Sharpe edge of the full \(\hat\Sigma\) model, is not robust across all seeds; again the durable, reproducible effect is the covariance-shrinkage one. The attribution tooling is included so the comparison can be run and judged directly rather than asserted.


%==========%


XI. CLI — cli.py:
# Install
pip install -e ".[dev]"

# Train decoupled and end-to-end pipelines; print out-of-sample stats
e2e compare --gamma 8 --tc 0.005

# Headline result: decision-optimised shrinkage vs plug-in & Ledoit-Wolf
e2e shrink --n-seeds 10 --n-assets 40

# Launch the server-side Streamlit dashboard
streamlit run src/e2e_portfolio/app.py
  
CommandKey optionsOutput
e2e compare--gamma, --w-max, --kappa, --tcOOS Sharpe / return / vol / turnover for equal-weight, decoupled, end-to-end
e2e shrink--n-seeds, --n-assets, --n-monthsMean OOS Sharpe and win rate vs plug-in and Ledoit-Wolf

%==========%


XII. Test Suite:

Seventeen tests, fully offline, seed-42. The QP tests verify that the forward solve returns valid simplex weights, respects the max-weight cap, that the low-rank construction satisfies \(R^\top R = FF^\top + \operatorname{diag}(d)\), and — the key one — that gradients flow through the cvxpylayers layer (non-zero \(\partial w^\star/\partial\mu\) by implicit differentiation). Training tests check that MSE decreases, Ledoit-Wolf output is PSD, the end-to-end backtest produces valid weights and respects constraints, and the decision-shrinkage experiment beats the plug-in baseline on average and stays competitive with Ledoit-Wolf.

def test_gradient_flows_through_layer():
    """Implicit differentiation: ∂w*/∂mu must be non-trivial."""
    qp = MeanVarianceQP(n=6, n_factors=6, gamma=5.0)
    mu = torch.randn(6, requires_grad=True) * 0.02
    w  = qp.forward(mu, torch.eye(6) * 0.2)
    (w * torch.linspace(0, 1, 6)).sum().backward()
    assert mu.grad.abs().sum() > 0

def test_decision_aware_beats_plugin_on_average():
    r = edge_across_seeds(seeds=range(6), n_assets=40, n_months=96)
    assert r["mean_edge"] > 0.0 and r["win_rate"] >= 0.8

%==========%


XIII. Configuration & Setup:

cd assets/projects/e2e_portfolio
python -m venv .venv && .venv\Scripts\Activate.ps1        # Windows
pip install -e ".[dev]"
e2e shrink                                                 # reproduce the headline result
pytest -q                                                  # 17 tests, offline
streamlit run src/e2e_portfolio/app.py
  

No data download is required: the models, tests and dashboard all run on the synthetic factor generator with no API keys. The optional scripts/download_data.py pulls yfinance prices to build the same price-based factor panel used by the cross-sectional signals engine.


Team:

Theodosios Dimitrasopoulos, personal project.

Tools & methods:

Python 3.11, PyTorch, CVXPY, cvxpylayers / diffcp (differentiable convex optimisation), scikit-learn, SHAP, NumPy, SciPy, Pydantic v2, DuckDB, Typer, rich, Plotly, Streamlit, yfinance, pytest, ruff, hatchling. Methods: differentiable convex optimisation layers (Agrawal, Amos, Barratt, Boyd, Diamond, Kolter 2019); Smart “Predict, then Optimise” / SPO+ (Elmachtoub & Grigas 2022); disciplined parametrised programming (DPP); implicit differentiation of the KKT system; mean-variance optimisation and Michaud (1989) error-maximisation; Ledoit-Wolf (2004) covariance shrinkage; decision-focused learning.