Deep Learning for Financial Time-Series Forecasting:

This project trains a multi-layer LSTM — and a Transformer with multi-head self-attention — to forecast next-period realised volatility and return direction, and then asks the only question that matters for deep learning on markets: does it actually beat a simple benchmark out of sample? The emphasis is as much on methodology as on the models. Every result comes from strict walk-forward validation with the model re-fit at each step and feature scaling fit on the training window only, so no future information — not even a normalisation constant — leaks backward. Volatility forecasts are scored against GARCH(1,1) and EWMA using MSE, the robust QLIKE loss and a Mincer-Zarnowitz predictive regression; direction forecasts are scored against a naive momentum rule. Failure modes are diagnosed systematically — training-vs-validation loss for overfitting, rolling performance for regime sensitivity, and a permutation test to confirm the model uses signal rather than noise. The honest headline: a well-trained LSTM carries genuine volatility signal (permutation \(p < 0.001\)) yet still does not beat GARCH(1,1), and return direction is essentially a coin flip. Built on Python 3.11+ with torch, numpy, pandas, scipy (GARCH QMLE), scikit-learn, plotly, streamlit, duckdb, pydantic v2 and typer; packaged with hatchling and tested with pytest against deterministic seed-42 fixtures.


%==========%


I. Interactive Dashboard:

The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server). PyTorch cannot run in Pyodide, so the in-browser “ML” forecaster is a walk-forward linear feature model (HAR-style) trained by gradient descent — it demonstrates the identical methodology: strict walk-forward, no-leak scaling, GARCH/EWMA baselines, a train-vs-validation loss curve, a permutation test, and a self-attention map. The full project trains the real torch LSTM and Transformer, and the punchline is the same in both: a GARCH(1,1) benchmark is very hard to beat. First load downloads Pyodide and may take 20–40 seconds.


%==========%


II. Project Layout:

deep-forecasting/
├── pyproject.toml                              # Build config, deps, ruff + pytest settings
├── .env.example                                # FRED_API_KEY (optional)
├── dashboard.html                              # Self-contained stlite browser demo (linear stand-in)
├── scripts/
│   └── download_data.py                        # yfinance prices + ^VIX + FRED slope → DuckDB (optional)
├── src/deep_forecasting/
│   ├── data/
│   │   ├── synthetic.py                        # GARCH(1,1) index + volume / VIX / macro-slope channels
│   │   ├── features.py                         # No-leak feature matrix, windowing, train-only scaler
│   │   ├── schemas.py                          # Pydantic v2: PriceRecord, MacroRecord
│   │   └── fetchers.py                         # yfinance / FRED (optional live data)
│   ├── models/
│   │   ├── lstm.py                             # Multi-layer LSTM + dropout + feed-forward head
│   │   ├── transformer.py                      # Positional encoding + multi-head self-attention
│   │   └── baselines.py                        # GARCH(1,1) QMLE, EWMA, momentum
│   ├── eval/
│   │   ├── walkforward.py                      # Re-fit-per-fold walk-forward driver (expanding/rolling)
│   │   ├── metrics.py                          # MSE, QLIKE, Mincer-Zarnowitz, direction accuracy
│   │   └── diagnostics.py                      # Permutation test, attention extraction
│   ├── report/
│   │   └── plots.py                            # Plotly: forecast, loss curves, rolling, attention, perm
│   ├── cli.py                                  # Typer CLI: forecast | dashboard
│   └── app.py                                  # Streamlit server-side dashboard
└── tests/                                      # Seed-42 fixtures; features, models, baselines, walk-forward
  

%==========%


III. The Look-Ahead Trap & Walk-Forward Validation (eval/walkforward.py):

The single most common way deep-learning-on-markets results are overstated is data leakage: a model is trained and tested on a random shuffle, or features are standardised using the full-sample mean and variance, so the test period silently informs training. This project forbids both. Validation is walk-forward: starting from an initial training window, the model is re-fit from scratch, used to predict the next out-of-sample block, then the window advances and the process repeats — predictions are accumulated across the whole timeline and every test point is genuinely out-of-sample. Two window modes are supported: expanding (training grows from a fixed origin) and rolling (fixed-length window that adapts to regime). Crucially, the feature standardiser is fit inside the loop on the training slice only:

while train_end + step <= T:
    train_start = origin if mode == "expanding" else max(origin, train_end - window)
    scaler = Standardizer.fit(fs.X[train_start:train_end])   # TRAIN ONLY — no leak
    Xz = scaler.transform(fs.X)

    Xtr, ytr, _    = make_windows(Xz, fs.y, lookback, np.arange(train_start, train_end))
    Xte, yte, ends = make_windows(Xz, fs.y, lookback, np.arange(train_end, train_end + step))

    model = build_model(kind, fs.n_features)                 # re-fit from scratch each fold
    train_curve, val_curve = _train_torch(model, Xtr, ytr, ...)
    preds = _predict(model, Xte, fs.task)                    # accumulate OOS predictions
    train_end += step

%==========%


IV. Feature Engineering Without Leakage (data/features.py):

Two rules are enforced everywhere. First, every feature at time \(t\) uses only information available at the close of \(t\) — all rolling statistics are trailing. The per-timestep feature vector is the day’s return and its absolute value, trailing realised vol over 5 and 21 days, the log-volume change, RSI, a VIX-like level and a yield-curve-slope macro series. Second, the target refers strictly to the future and is never an input. For the volatility task the target is the next-\(h\) realised variance, modelled in logs for stability; for direction it is the next day’s sign:

\[ y^{\text{vol}}_t = \log\!\Big(\tfrac{1}{h}\textstyle\sum_{k=1}^{h} r_{t+k}^2\Big), \qquad y^{\text{dir}}_t = \mathbf{1}\{r_{t+1} > 0\} \]

The sequence models consume a sliding lookback window \(X_{t-L+1:t}\) to predict \(y_t\); windows that run off the start of the series or whose target is undefined are dropped, and the surviving end-indices are tracked so predictions can be scattered back onto the original timeline. The synthetic data itself is drawn from a genuine GARCH(1,1) process, so conditional variance really does cluster and persist — which is exactly what makes the GARCH benchmark a fair, hard target rather than a straw man.


%==========%


V. The Models — LSTM & Transformer (models/):

The LSTM is a stacked, dropout-regularised recurrent network reading the lookback window and emitting one scalar from the final hidden state — a predicted log-variance, or a logit for direction. The Transformer projects each timestep to a model dimension, adds sinusoidal positional encoding, applies a single multi-head self-attention block with a feed-forward sublayer and residual/layer-norm connections, and reads out from the last position. A single attention layer is used (rather than a deep stack) precisely so the per-head attention map is easy to expose for visualisation.

class LSTMForecaster(nn.Module):
    def __init__(self, n_features, hidden=32, num_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(n_features, hidden, num_layers, batch_first=True,
                            dropout=dropout if num_layers > 1 else 0.0)
        self.head = nn.Sequential(nn.Linear(hidden, hidden), nn.ReLU(),
                                  nn.Dropout(dropout), nn.Linear(hidden, 1))

    def forward(self, x):
        out, _ = self.lstm(x)            # (B, L, H)
        return self.head(out[:, -1, :]).squeeze(-1)   # read from the final timestep

%==========%


VI. Volatility Results vs GARCH & EWMA (cli.py forecast):

On 1,500 days of the synthetic GARCH index, expanding-window walk-forward with 693 out-of-sample predictions and a 40-day lookback, the LSTM is re-fit at every fold. Lower QLIKE and MSE are better; an unbiased forecast has Mincer-Zarnowitz \(\beta = 1\):

ModelMSEQLIKEMZ \(\beta\)MZ \(R^2\)
LSTM6.1×10−90.340.880.06
Transformer9.1×10−91.330.220.02
GARCH(1,1)5.6×10−90.300.700.09
EWMA5.7×10−90.300.600.11

Reported honestly: GARCH(1,1) and EWMA win. The LSTM is close and, importantly, it is learning — its Mincer-Zarnowitz \(\beta\) of 0.88 is the most unbiased of the four and its permutation test is overwhelmingly significant — but “close and genuinely skilful” is not “better.” The Transformer fares worse still: with only ~700 training windows it is data-starved, and a model designed for large corpora cannot calibrate on a few hundred noisy financial sequences. This is the central, deliberately un-oversold lesson: on data whose conditional variance is itself GARCH-like, a two-parameter GARCH is near-optimal and added neural capacity mostly adds variance.


%==========%


VII. Evaluation Metrics (eval/metrics.py):

Volatility forecasts are scored three ways. MSE on variance is intuitive but dominated by the noise in the realised-variance proxy. QLIKE (Patton 2011) is robust to that proxy noise and is the preferred volatility loss; it is minimised only when the forecast equals the truth:

\[ \text{QLIKE} = \frac{1}{n}\sum_t\Big(\frac{\sigma^2_t}{\hat\sigma^2_t} - \log\frac{\sigma^2_t}{\hat\sigma^2_t} - 1\Big), \qquad \text{MZ:}\;\; \sigma^2_t = a + b\,\hat\sigma^2_t + e_t \]

The Mincer-Zarnowitz regression of realised on predicted variance diagnoses bias: a perfect forecast gives \(a = 0,\ b = 1\). For direction the metric is simply the fraction of days whose up/down sign is predicted correctly. Note how low every model’s MZ \(R^2\) is (0.06–0.11): a 5-day realised variance is one noisy draw, so even the optimal forecaster explains little of its variation — a reminder that low \(R^2\) on financial targets is the rule, not a bug.


%==========%


VIII. Return-Direction Forecasting vs Momentum:

The harder task is predicting the sign of tomorrow’s return. Here the LSTM is pitted against a trailing-sign momentum rule:

ModelDirectional accuracyPermutation \(p\)
LSTM0.5150.14
Momentum0.522

Both hover at the 50% coin-flip line, and the LSTM’s permutation test (\(p \approx 0.14\)) fails to reject the null of no signal. This is the expected and correct result: daily index direction is close to a martingale, and no amount of architecture conjures predictability that is not in the data. Showing this clearly — rather than cherry-picking a lucky seed — is the point.


%==========%


IX. Diagnosing Failure Modes (eval/diagnostics.py):

Three diagnostics guard against fooling oneself. (1) Overfitting — the training-vs-validation loss curve (the validation split is the chronological tail of the training window, never shuffled in) shows whether the model memorises. (2) Regime sensitivity — per-fold rolling performance reveals that both model and benchmark errors rise together in turbulent stretches and that the win/loss gap opens and closes over time, so a single full-sample average hides large period-to-period swings. (3) Genuine signal — a permutation test holds the out-of-sample predictions fixed and shuffles the realised outcomes thousands of times to build a null distribution of a skill statistic (negative QLIKE for vol, accuracy for direction):

def permutation_test(pred, realized, task, n_perm=2000, seed=42):
    skill = lambda p, r: -qlike(p, r) if task == "vol" else ((p >= .5) == (r > 0)).mean()
    observed = skill(pred, realized)
    null = [skill(pred, rng.permutation(realized)) for _ in range(n_perm)]
    return {"observed": observed, "p_value": (np.array(null) >= observed).mean()}

The volatility LSTM’s observed skill sits far in the right tail of the null (\(p < 0.001\)): it is using real structure, not noise — even though it still loses to GARCH. The direction model, by contrast, sits inside the null. The permutation test cleanly separates “has signal but loses” from “has no signal,” which point estimates alone cannot.


%==========%


X. Attention Visualisation (models/transformer.py):

The Transformer keeps the (head-averaged) attention weights from its self-attention block, so we can ask which past timesteps the model looks at before predicting. Averaging the \(L\times L\) attention map over many windows shows the prediction query concentrating weight on the most recent, most informative lags — consistent with the short memory of conditional variance. The map is exposed directly from the forward pass:

attn_out, attn_w = self.attn(h, h, h, need_weights=True, average_attn_weights=True)
self._last_attn = attn_w.detach()      # (B, L, L) — query × key attention map
...
def average_attention(model, X_windows):
    model(torch.tensor(X_windows, dtype=torch.float32))
    return model.last_attention().mean(0).numpy()   # (L, L) for the heatmap

%==========%


XI. CLI — cli.py:
# Install
pip install -e ".[dev]"

# Walk-forward volatility forecast (LSTM) vs GARCH(1,1) / EWMA
deepcast forecast --task vol --model lstm --epochs 100

# Transformer variant, rolling training window
deepcast forecast --task vol --model transformer --mode rolling

# Return-direction forecast vs the momentum baseline
deepcast forecast --task direction --model lstm

# Launch the server-side Streamlit dashboard
streamlit run src/deep_forecasting/app.py
  
CommandKey optionsOutput
deepcast forecast--task (vol/direction), --model (lstm/transformer), --mode (expanding/rolling), --lookback, --horizon, --epochsOOS metric table vs baselines and a permutation-test verdict

%==========%


XII. Test Suite:

Twenty-six tests, fully offline, seed-42. Feature tests verify the anti-leak guarantees directly — that the volatility target equals realised variance over \(t+1\ldots t+h\), that the final rows have no forward target, that window/target alignment is exact, and that the standardiser is fit on the training slice only. Model tests check LSTM/Transformer output shapes and that attention weights form convex rows summing to one. Baseline tests confirm GARCH QMLE recovers the high persistence of the DGP and that EWMA is strictly causal. Walk-forward tests confirm every predicted index is out-of-sample, appears exactly once, is chronological, and that variance forecasts stay positive.

def test_target_is_strictly_forward_vol(series):
    fs = build_features(series, task="vol", horizon=5)
    expected = np.mean(series.ret[101:106] ** 2)        # uses only future returns
    assert np.isclose(np.exp(fs.y[100]), expected, rtol=1e-6)

def test_walk_forward_predictions_are_oos(fs_vol):
    res = walk_forward(fs_vol, kind="lstm", lookback=20, step=120, epochs=5)
    assert res.ends.min() >= int(len(fs_vol.y) * 0.5)   # nothing inside the train window

def test_ewma_is_causal():
    full = ewma_variance(s.ret); truncated = ewma_variance(s.ret[:200])
    assert np.allclose(full[:200], truncated)           # no peeking forward

%==========%


XIII. Configuration & Setup:

cd assets/projects/deep_forecasting
python -m venv .venv && .venv\Scripts\Activate.ps1        # Windows
pip install -e ".[dev]"
deepcast forecast --task vol --model lstm                 # reproduce the vol horse-race
pytest -q                                                 # 26 tests, offline
streamlit run src/deep_forecasting/app.py
  

No data download is required: the models, tests and dashboard all run on the synthetic GARCH(1,1) generator with no API keys. The optional scripts/download_data.py pulls yfinance prices, the ^VIX index and a FRED yield-curve-slope series into DuckDB for a live-data study.


Team:

Theodosios Dimitrasopoulos, personal project.

Tools & methods:

Python 3.11, PyTorch (LSTM, Transformer / multi-head attention), NumPy, SciPy (GARCH QMLE), pandas, scikit-learn, Pydantic v2, DuckDB, Typer, rich, Plotly, Streamlit, yfinance, pandas-datareader (FRED), pytest, ruff, hatchling. Methods: recurrent and attention sequence models for time series; walk-forward (rolling-origin) out-of-sample validation; leakage-free feature engineering; GARCH(1,1) and EWMA volatility benchmarks; QLIKE loss (Patton 2011) and Mincer-Zarnowitz (1969) forecast evaluation; permutation testing for signal significance; overfitting and regime diagnostics.