Deep Learning for Limit-Order-Book Mid-Price Prediction:
The limit order book encodes the full supply–demand schedule at every instant, so a network trained on multi-level bid–ask queue snapshots can, in principle, predict short-horizon mid-price direction. This project implements and compares four classifiers — a logistic-regression baseline, a 1D-CNN, an LSTM, and the DeepLOB architecture (Zhang, Zohren & Roberts 2019: convolution blocks that collapse the 40-feature axis, an inception module, an LSTM, and attention pooling over time) — on a realistic synthetic order book where the mid genuinely moves with prior queue imbalance plus heavy noise. Everything is evaluated with a strict temporal train/test split (no shuffle, scaler fit on train only), a three-class up/down/stationary label, and per-class precision/recall so the models must prove themselves on the directional moves rather than hiding behind the easy “stationary” majority. The decisive test is economic, not statistical: a zero-latency signal-follower backtest that subtracts the round-trip spread. The honest result is that the directional edge is real (positive gross P&L, 60% hit rate) but does not survive the bid–ask spread — net P&L is negative for every architecture. Built on Python 3.11+ with torch, numpy, pandas, scikit-learn, scipy, plotly, streamlit, duckdb, pydantic v2 and typer; packaged with hatchling and tested with pytest against deterministic seed-42 fixtures.
%==========%
I. Interactive Dashboard:
The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server). PyTorch cannot run in Pyodide, so the in-browser demo trains a pure-NumPy multinomial-logistic stand-in on the same 40-feature LOB matrix — it shows the identical pipeline (queue-imbalance features, three-class prediction, and the cost-aware backtest) instantly. The full project trains the CNN, LSTM and DeepLOB networks in torch. First load downloads Pyodide and may take 20–40 seconds.
%==========%
II. Project Layout:
deeplob/
├── pyproject.toml # Build config, deps, ruff + pytest settings
├── .env.example # ALPACA / LOBSTER keys (optional)
├── dashboard.html # Self-contained stlite browser demo (numpy logit stand-in)
├── scripts/
│ └── make_thumbnail.py # Real matplotlib thumbnail (gross/cost/net P&L)
├── src/deeplob/
│ ├── data/
│ │ ├── synthetic.py # Realistic 10-level LOB simulator (imbalance→mid)
│ │ ├── features.py # 40-feature matrix, labels, no-leak temporal split
│ │ ├── schemas.py # Pydantic v2 LOB snapshot record
│ │ └── fetchers.py # LOBSTER / Alpaca Level-2 (optional)
│ ├── models/
│ │ ├── architectures.py # LogReg, CNN1D, LSTM, DeepLOB (inception + attention)
│ │ └── baselines.py # sklearn logistic + queue-imbalance rule
│ ├── eval/
│ │ ├── train.py # Mini-batch training, temporal split
│ │ └── metrics.py # 3-class precision/recall/F1, confusion matrix
│ ├── backtest/
│ │ └── signal_follower.py # Zero-latency follower; gross vs net of spread
│ ├── analysis/
│ │ └── saliency.py # Attention / level / bid-vs-ask attribution
│ ├── report/plots.py # Plotly: LOB depth, prob series, P&L curves
│ ├── cli.py # Typer CLI: train | compare | backtest | saliency | stats
│ └── app.py # Streamlit server-side dashboard
└── tests/ # Seed-42 fixtures; features, models, backtest, analysis
%==========%
III. The Order Book & the 40-Feature Matrix (data/features.py):
Each LOB snapshot is summarised by the canonical DeepLOB input: a 40-dimensional vector stacking the price and size at ten levels per side, \([\,p^{\text{ask}}_i, v^{\text{ask}}_i, p^{\text{bid}}_i, v^{\text{bid}}_i\,]_{i=1}^{10}\). Two supplementary, strictly-causal features are added: top-of-book queue imbalance \(I = (v^{\text{bid}}_1 - v^{\text{ask}}_1)/(v^{\text{bid}}_1 + v^{\text{ask}}_1)\) and mid-price velocity. The synthetic simulator is calibrated so the mid genuinely responds to prior imbalance — the realised correlation between top-of-book imbalance and the next mid-move is \(\rho \approx 0.27\), a weak-but-real signal that is exactly the regime real LOB predictors operate in. Sequence models consume a sliding window of snapshots ending at \(t\) (rows \(t-T_{\text{win}}+1 \ldots t\), all in the past) to predict the label at \(t\).
%==========%
IV. Labelling & the Leak-Free Temporal Split:
The three-class label follows the DeepLOB paper: compare the mean of the next \(k\) mids to the current mid; if the smoothed forward return exceeds \(+\alpha\) the label is up, below \(-\alpha\) it is down, else stationary. The threshold \(\alpha\) is a fixed quantile of the \(\lvert\text{return}\rvert\) distribution chosen for a sensible class balance — a data-driven constant, never tuned on test outcomes. Two anti-leak rules are enforced: the train/test boundary is contiguous with no shuffle (test is strictly later in time than train), and the z-score scaler is fit on the training rows only:
def temporal_split(n, train_frac=0.7):
split = int(n * train_frac)
return np.arange(split), np.arange(split, n) # NO shuffle — test is strictly after train
scaler = Scaler().fit(X[train_idx]) # fit on TRAIN only → no look-ahead
X_tr, X_te = scaler.transform(X[train_idx]), scaler.transform(X[test_idx])
%==========%
V. The Architectures (models/architectures.py):
Four models of increasing structure are compared. Logistic regression on the flattened window is the simple baseline. 1D-CNN applies temporal convolutions with features as channels. LSTM reads the window and classifies from its final hidden state. DeepLOB treats the window as an image \((B,1,T,F)\): convolution blocks with stride-2 kernels progressively collapse the 40-feature axis (price/size pairs → level features → across levels), an inception module runs parallel temporal filters, an LSTM models the resulting sequence, and a learned attention pooling over time produces the classification vector — and exposes which timesteps mattered.
class DeepLOB(nn.Module):
def forward(self, x): # x: (B, T, F=40)
z = self.conv1(x.unsqueeze(1)) # collapse px/sz pairs (stride-2 over F)
z = self.conv2(z); z = self.conv3(z) # collapse the feature axis fully
z = self.inception(z) # parallel temporal conv paths → (B, 96, T', 1)
out, _ = self.lstm(z.squeeze(-1).transpose(1, 2)) # (B, T', 64)
w = F.softmax(self.attn(out).squeeze(-1), dim=1) # attention over time
self._last_attn = w.detach()
return self.head((out * w.unsqueeze(-1)).sum(dim=1)) # 3-class logits
%==========%
VI. Classification Results (cli.py compare):
Out-of-sample on the temporal test split (\(n = 20{,}000\) snapshots, window 20, horizon \(k = 10\), 30 epochs). Because the “stationary” class is the majority, raw accuracy is misleading — a model can score well by never predicting a move. Macro-F1 (averaged over all three classes) is the honest summary, and per-class F1 shows where each model earns it:
| Model | Accuracy | Macro-F1 | F1 down | F1 stationary | F1 up |
|---|---|---|---|---|---|
| Queue imbalance (naive) | 0.361 | 0.368 | 0.411 | 0.232 | 0.463 |
| LogReg (sklearn) | 0.611 | 0.422 | 0.225 | 0.736 | 0.304 |
| LogReg (torch) | 0.425 | 0.430 | 0.435 | 0.399 | 0.455 |
| 1D-CNN | 0.421 | 0.420 | 0.398 | 0.411 | 0.450 |
| LSTM | 0.468 | 0.436 | 0.360 | 0.537 | 0.410 |
| DeepLOB | 0.408 | 0.397 | 0.332 | 0.438 | 0.420 |
Honestly reported: the LSTM wins on macro-F1 (0.436), narrowly ahead of the torch logistic regression; DeepLOB’s extra machinery does not dominate on this synthetic book. The sklearn logistic regression’s eye-catching 0.611 accuracy is a trap — it achieves it by predicting “stationary” almost always (F1 up/down collapse to 0.30/0.23), which is useless for trading. The learned sequence models trade a little headline accuracy for far better balance across the directional classes, which is what a signal needs.
%==========%
VII. Queue Imbalance — the Naive Predictor (models/baselines.py):
The single most informative LOB feature is top-of-book queue imbalance: when bids vastly outnumber asks, the mid tends to tick up. As a standalone rule it has strong directional recall (F1 up/down \(\approx 0.46/0.41\), the best of any model on the up class) but a poor overall score (accuracy 0.361, macro-F1 0.368) because it never predicts “stationary.” This is the benchmark every neural model must justify itself against, and the learned models beat it on macro-F1 by \(+0.04\) to \(+0.07\) — a real but modest improvement that comes from learning when not to trade, not from finding a stronger directional signal than imbalance already provides.
%==========%
VIII. The Backtest: Does the Edge Survive the Spread? (backtest/signal_follower.py):
Statistical accuracy is not money. A zero-latency signal follower enters on each up/down prediction and exits after \(k\) events; the gross P&L captures the directional edge, and the realised round-trip bid–ask spread is subtracted to get net. The verdict, on 4,024 DeepLOB trades:
| Component | Value |
|---|---|
| Hit rate (directional) | ~60% |
| Gross P&L | +3,850 bps |
| Spread cost | −8,091 bps |
| Net P&L | −4,242 bps |
The directional edge is unmistakably real — positive gross P&L and a 60% hit rate — yet the spread is roughly twice the gross edge, so the strategy loses money net, and net P&L is negative for every architecture. This is the central, deliberately un-oversold lesson of LOB prediction: sub-minute mid-price predictability exists and is statistically significant, but a naive signal-taker pays the spread on every round trip and the edge evaporates. Profiting requires being a liquidity provider (earning the spread) or having genuine latency/queue-position advantage — not simply a good classifier.
%==========%
IX. Attention & Level Attribution (analysis/saliency.py):
Gradient saliency over the input attributes the DeepLOB prediction back to price levels and to the bid vs ask side. The model weights size over price by roughly 0.81 to 0.19 and splits attention almost evenly across the two sides (0.49 bid / 0.51 ask) — exactly what one expects if queue volume (imbalance), not the price grid, is what drives the next move. The most influential price level shifts among the near-touch levels across runs, consistent with the action living close to the top of book.
%==========%
X. CLI — cli.py:
# Install
pip install -e ".[dev]"
# Train one model and print per-class precision/recall
deeplob train --model deeplob
# Compare all four models + the queue-imbalance baseline
deeplob compare
# Backtest the signal follower: gross vs net of spread
deeplob backtest --model deeplob
# Attribute predictions to LOB levels / bid-vs-ask side
deeplob saliency --model deeplob
# Launch the server-side Streamlit dashboard
streamlit run src/deeplob/app.py
| Command | Key options | Output |
|---|---|---|
deeplob compare | --n, --window, --k, --epochs, --drift | Per-model accuracy / macro-F1 / per-class F1 leaderboard |
deeplob backtest | --model (incl. imbalance), --k | Trades, gross, spread cost, net P&L |
%==========%
XI. Test Suite:
Nineteen tests, fully offline, seed-42. Feature tests verify the matrix is exactly 40-dimensional, that the scaler is fit on the training rows only (no leak), and that the three-class label is constructed correctly and reasonably balanced. Model tests confirm the forward-pass shape of all four architectures. Split tests confirm the temporal train/test indices never overlap and that test is strictly later than train. A learning test confirms a model beats random on the synthetic data, and backtest tests confirm gross/net P&L are finite and that the spread cost is material.
def test_scaler_fit_on_train_only(dataset):
tr, te = temporal_split(len(dataset.mid))
assert tr.max() < te.min() # test strictly after train — no overlap
sc = Scaler().fit(dataset.X40[tr])
assert np.allclose(sc.transform(dataset.X40[tr]).mean(0), 0, atol=1e-6)
def test_backtest_spread_eats_edge(trained):
res = run_backtest(trained, ...)
assert res.gross_bps > 0 and res.cost_bps > 0 # real edge, real cost
assert res.net_bps == res.gross_bps - res.cost_bps
%==========%
XII. Configuration & Setup:
cd assets/projects/deeplob
python -m venv .venv && .venv\Scripts\Activate.ps1 # Windows
pip install -e ".[dev]"
deeplob compare # reproduce the leaderboard
deeplob backtest --model deeplob # gross vs net of spread
pytest -q # 19 tests, offline
streamlit run src/deeplob/app.py
No data download is required: the models, tests and dashboard all run on the synthetic LOB simulator with no API keys. The optional scripts / data/fetchers.py support the free LOBSTER AAPL sample and Alpaca Level-2 data for a real-data study.
Team:
Theodosios Dimitrasopoulos, personal project.
Tools & methods:
Python 3.11, PyTorch (1D-CNN, LSTM, DeepLOB with inception + attention), scikit-learn, NumPy, SciPy, pandas, Pydantic v2, DuckDB, Typer, rich, Plotly, Streamlit, pytest, ruff, hatchling. Methods: limit-order-book microstructure; the 40-feature LOB representation and the DeepLOB architecture (Zhang, Zohren & Roberts 2019); queue imbalance as a microstructure predictor; leak-free temporal (rolling-origin) validation; multi-class precision/recall evaluation; gradient saliency / attention attribution; transaction-cost-aware signal-following backtest.