Variational Autoencoder Factor Model:
Classical factor models — Fama-French, principal component analysis — impose linear structure on the return covariance matrix, so they cannot capture the interaction effects and regime-dependent factor loadings visible in real equity returns. This project trains a variational autoencoder on the daily cross-section of returns: the encoder maps each day’s vector of \(N\) stock returns to a Gaussian distribution in a \(k\)-dimensional latent space, and the decoder reconstructs the returns from a sampled latent code. The latent dimensions are therefore data-driven daily factors — the sequence of posterior means is the factor time series, and the decoder columns are the loadings. The model is probed in four ways: its latent factors are correlated against known priced risks (market, a volatility index, a macro slope) to ask whether it rediscovers familiar structure; the return covariance reconstructed from the latent space is pitted against Ledoit-Wolf shrinkage and PCA in an out-of-sample minimum-variance horse race; stocks with anomalously high reconstruction error are tested as a drawdown signal; and a conditional VAE bridges to regime detection by conditioning the latent space on a macro label. Built on Python 3.11+ with torch, numpy, pandas, scikit-learn (PCA, Ledoit-Wolf), umap-learn (embedding), scipy, plotly, streamlit, duckdb, pydantic v2 and typer; packaged with hatchling and tested with pytest against deterministic seed-42 fixtures.
%==========%
I. Interactive Dashboard:
The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server). Training a nonlinear VAE needs PyTorch, so the in-browser demo uses the VAE’s linear-Gaussian limit — probabilistic PCA, whose exact solution is the top-\(k\) principal components — to illustrate the same ideas instantly: data-driven latent factors, the fidelity–complexity trade-off, the covariance min-variance test, reconstruction-error anomalies, and the sector embedding. The full project trains the nonlinear torch VAE and conditional VAE. First load downloads Pyodide and may take 20–40 seconds.
%==========%
II. Project Layout:
vae-factors/
├── pyproject.toml # Build config, deps, ruff + pytest settings
├── .env.example # DB_PATH
├── dashboard.html # Self-contained stlite browser demo (pPCA limit)
├── scripts/
│ └── download_data.py # yfinance returns + Fama-French 5 (optional)
├── src/vae_factors/
│ ├── data/
│ │ ├── synthetic.py # Latent-factor return panel: sectors, macro, anomalies
│ │ ├── schemas.py # Pydantic v2: CovarianceScore, LatentInterpretation
│ │ ├── store.py # DuckDB returns persistence
│ │ └── fetchers.py # yfinance returns, Kenneth French 5-factor
│ ├── model/
│ │ ├── vae.py # VAE + ConditionalVAE, ELBO (recon + β·KL)
│ │ └── train.py # train_vae, beta_sweep, standardiser
│ ├── analysis/
│ │ ├── benchmarks.py # Sample / Ledoit-Wolf / PCA / VAE covariance + min-var
│ │ ├── interpret.py # Latent↔observable corr, true-factor recovery R²
│ │ ├── anomaly.py # Reconstruction error → idiosyncratic drawdown
│ │ └── embedding.py # UMAP / PCA stock embedding in loading space
│ ├── report/
│ │ └── plots.py # Plotly: loss, β trade-off, trajectories, heatmap, embedding
│ ├── cli.py # Typer CLI: train | dashboard
│ └── app.py # Streamlit server-side dashboard
└── tests/ # Seed-42 fixtures; model, covariance, interpret, anomaly
%==========%
III. Why Linear Factor Models Fall Short:
A linear factor model writes returns as \(x_t = B f_t + \varepsilon_t\) with constant loadings \(B\), so the covariance \(\Sigma = B\operatorname{Cov}(f)B^\top + D\) is a fixed low-rank-plus-diagonal object. PCA finds the \(B\) that maximises explained variance; Fama-French fixes \(f\) to be observable long-short portfolios. Both are linear and stationary by construction. Real equity returns violate this: factor loadings drift and flip across regimes (a stock’s market beta in a crisis differs from its calm-market beta), and there are interaction effects (momentum pays off mainly in low-volatility names). A VAE relaxes both assumptions — the decoder \(g_\phi(z)\) is a nonlinear map, so the implied covariance is state-dependent, and conditioning the latent space on a regime label lets the loadings shift explicitly.
%==========%
IV. The VAE Objective (model/vae.py):
The VAE maximises a lower bound (the ELBO) on the marginal log-likelihood of the data. For a daily cross-section \(x\), encoder \(q_\theta(z\mid x) = \mathcal{N}(\mu_\theta(x), \operatorname{diag}\sigma_\theta^2(x))\) and Gaussian decoder, the (negative) training loss is
\[\mathcal{L} = \underbrace{\mathbb{E}_{q_\theta(z\mid x)}\big[\lVert x - g_\phi(z)\rVert^2\big]}_{\text{reconstruction}} + \beta\;\underbrace{D_{\mathrm{KL}}\!\big(q_\theta(z\mid x)\,\|\,\mathcal{N}(0, I)\big)}_{\text{regularisation toward the prior}}\]with the closed-form KL for two Gaussians \(D_{\mathrm{KL}} = -\tfrac{1}{2}\sum_j\big(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\big)\). The expectation is estimated by the reparameterisation trick \(z = \mu + \sigma\odot\epsilon,\;\epsilon\sim\mathcal{N}(0,I)\), which makes the sampling differentiable. The weight \(\beta\) (the β-VAE of Higgins et al. 2017) trades reconstruction fidelity against latent disentanglement: large \(\beta\) pushes unused latent dimensions toward the prior, pruning the code; small \(\beta\) reconstructs better but entangles factors.
def vae_loss(x_hat, x, mu, logvar, beta=1.0):
recon = F.mse_loss(x_hat, x, reduction="sum") / x.shape[0]
kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) / x.shape[0]
return recon + beta * kl, recon, kl
def forward(self, x, c=None): # c: optional regime one-hot (CVAE)
mu, logvar = self.encode(x, c)
z = mu + torch.exp(0.5 * logvar) * torch.randn_like(logvar) # reparameterise
return self.decode(z, c), mu, logvar, z
%==========%
V. Interpreting the Latent Dimensions (analysis/interpret.py):
Are the data-driven factors economically meaningful? Two diagnostics. First, the absolute correlation of each latent posterior-mean series with observable variables (market return, a VIX-like realised-vol index, a yield-curve-slope macro series). Second, true-factor recovery: regress each (synthetic) ground-truth factor on the latent means and report the \(R^2\) — the share of that factor’s variation the latent space spans, a canonical-correlation-style alignment. On the synthetic panel (60 names, 4 true factors, \(k = 4\)), the VAE recovers the factor space well and rediscovers the market without supervision:
| Diagnostic | Result |
|---|---|
| True-factor recovery \(R^2\) (market / 3 style factors) | 0.83 / 0.64 / 0.68 / 0.68 |
| Strongest latent–market \(\lvert\text{corr}\rvert\) | 0.91 — one latent dimension is the market factor |
The leading latent dimension aligns almost perfectly with the observable market return even though the network was never told what the market is — it rediscovers the dominant priced risk from the cross-section alone.
%==========%
VI. Covariance Reconstruction & the Economic Test (analysis/benchmarks.py):
The decisive question for a factor model is whether its covariance generalises. The VAE-implied covariance is a proper factor-model object: a Monte-Carlo systematic part (sample \(z\sim\mathcal{N}(0,I)\), decode, de-standardise, take the sample covariance of the generated returns) plus a diagonal idiosyncratic part (the per-stock variance of the training reconstruction residual). Omitting that idiosyncratic term is a classic mistake — it under-estimates the diagonal and the minimum-variance optimiser then over-levers into noise. Each estimator is built on the training window, used to form the global minimum-variance portfolio \(w \propto \Sigma^{-1}\mathbf{1}\), and scored on realised out-of-sample volatility (lower is better):
| Covariance estimator | OOS min-variance vol |
|---|---|
| Sample covariance | 10.65% |
| Ledoit-Wolf shrinkage | 10.64% |
| PCA (\(k = 4\)) | 10.38% |
| VAE (\(k = 4\), systematic + idiosyncratic) | 10.44% |
Honestly reported: on a linear data-generating process the linear PCA is essentially the ideal model, so the VAE matching it — and beating the raw sample and Ledoit-Wolf covariances — is the expected, correct result, not a disappointment. The VAE’s distinctive value over linear methods does not come from covariance accuracy on stationary linear data; it comes from the regime-conditioning and anomaly-detection capabilities below, which linear factor models structurally cannot provide.
%==========%
VII. Reconstruction Error as an Anomaly Signal (analysis/anomaly.py):
Stocks the encoder/decoder cannot explain — high reconstruction error \(\lVert x_i - \hat{x}_i\rVert^2\) — are carrying variation outside the learned factor structure. The synthetic panel plants episodic idiosyncratic shocks followed by a negative drift in a subset of names; the test is whether high reconstruction error flags them and whether those names subsequently suffer worse drawdowns. Because total drawdown is dominated by market beta, the drawdown is measured on market-residual (idiosyncratic) return paths, isolating the channel the anomalies live in:
| Group (top-20% reconstruction error vs rest) | Mean worst idiosyncratic drawdown |
|---|---|
| High reconstruction error | −39.7% |
| Low reconstruction error | −34.1% |
The rank correlation between reconstruction error and the planted-anomaly flag is positive and robust across seeds (\(\rho \approx 0.2\!-\!0.4\)), and the high-error group reliably draws down more. Reconstruction error is thus a model-free idiosyncratic-risk flag — a by-product of the factor model that a covariance matrix alone does not provide.
%==========%
VIII. Conditional VAE & Regime-Dependent Loadings (model/vae.py):
The conditional VAE appends a one-hot macro-regime label (expansion / contraction, derived from the trailing market trend) to both the encoder input and the decoder input. The decoder \(g_\phi(z, c)\) then learns regime-specific loadings: the same latent factor maps to different cross-sectional return patterns depending on \(c\). This is the direct bridge to the regime-detection work elsewhere in the portfolio — instead of fitting one stationary covariance, the model carries a distribution over factor structures indexed by macro state, and the reconstructed covariance shifts as the conditioning label changes.
# Conditioning is a one-hot regime appended to encoder and decoder inputs
model = VAE(n_assets, latent_dim=3, cond_dim=2) # cond_dim=2 → CVAE
x_hat, mu, logvar, z = model(x, regime_onehot) # loadings depend on regime
%==========%
IX. Stock Embedding (analysis/embedding.py):
Each stock is summarised by its sensitivity to the latent factors (a regression of its returns on the latent means), and those loading vectors are projected to two dimensions with UMAP (falling back to PCA when umap-learn is unavailable). Colouring by sector shows that names which respond alike to the data-driven factors cluster together — the embedding recovers the sector blocks the model was never told about. This is the t-SNE/UMAP visualisation of the learned representation called for in the work programme.
%==========%
X. CLI — cli.py:
# Install
pip install -e ".[dev]"
# Train the VAE and print covariance, interpretation and anomaly diagnostics
vae train --latent 4 --beta 1.0 --epochs 300
# Conditional VAE on the macro regime label
vae train --conditional
# Launch the server-side Streamlit dashboard
streamlit run src/vae_factors/app.py
| Command | Key options | Output |
|---|---|---|
vae train | --latent, --beta, --epochs, --conditional | Min-variance OOS table, true-factor recovery \(R^2\), latent–observable correlations, anomaly drawdown test |
%==========%
XI. Test Suite:
Eighteen tests, fully offline, seed-42. Model tests verify encoder/decoder shapes, the conditional path, non-negative loss components, and that training reduces the loss. Synthetic-data tests confirm the market factor drives the cross-section. Analysis tests verify that every covariance estimator (including the VAE’s) is positive semi-definite, min-variance weights sum to one, the latent space recovers the market factor (\(R^2 > 0.5\), \(\lvert\text{corr}\rvert > 0.6\)), the stock embedding is 2-D, and — the substantive one — that reconstruction error tracks the planted anomalies and the high-error group draws down more.
def test_factor_recovery_market(trained, panel):
rec = factor_recovery(trained, panel)
assert rec["true_factor_0"] > 0.5 # market factor recovered
def test_anomaly_error_tracks_planted_flag(trained, panel):
an = anomaly_drawdown_test(trained, panel)
assert an["error_vs_anomaly_rho"] > 0.0
assert an["high_error_mean_dd"] <= an["low_error_mean_dd"]
%==========%
XII. Configuration & Setup:
cd assets/projects/vae_factors
python -m venv .venv && .venv\Scripts\Activate.ps1 # Windows
pip install -e ".[dev]"
vae train # reproduce the diagnostics
pytest -q # 18 tests, offline
streamlit run src/vae_factors/app.py
No data download is required: the model, tests and dashboard all run on the synthetic latent-factor generator with no API keys. The optional scripts/download_data.py pulls yfinance returns and the Kenneth French 5-factor daily series for a live-data study.
Team:
Theodosios Dimitrasopoulos, personal project.
Tools & methods:
Python 3.11, PyTorch, scikit-learn (PCA, Ledoit-Wolf), umap-learn, NumPy, SciPy, pandas, pandas-datareader (Fama-French), Pydantic v2, DuckDB, Typer, rich, Plotly, Streamlit, yfinance, pytest, ruff, hatchling. Methods: variational autoencoders and the ELBO / reparameterisation trick (Kingma & Welling 2014); β-VAE disentanglement (Higgins et al. 2017); conditional VAEs (Sohn et al. 2015); factor-model covariance estimation; Ledoit-Wolf (2004) shrinkage; PCA factor models; minimum-variance portfolio construction; UMAP (McInnes et al. 2018) representation embedding.