Projects — Theo Dimitrasopoulos

Deep Learning Volatility — Neural Surrogate Calibration of the Heston Surface

Fitting a stochastic volatility model to an option surface means running a numerical pricer inside an optimisation loop: every candidate parameter set re-prices the whole quote grid, every fit costs hundreds of Fourier transforms, and a desk that wants per-tick recalibration, scenario grids, or per-path Greeks quickly finds the pricer is the bottleneck. This project takes the classical calibration pipeline built in the Heston project and replaces its inner loop with a learned one: a feedforward network is trained offline on ~30,000 Latin-hypercube draws from the Heston parameter box, each labelled with the exact implied-vol surface computed by the Carr-Madan FFT, until the network is the pricing map — 7 bp mean absolute error, evaluated in ~50 microseconds. Calibration becomes gradient-based inversion of a frozen network with an exact autograd Jacobian (5.9 ms against 1.05 s for direct FFT calibration, a 177× speed-up), Greeks come from differentiating the surrogate, and the training loss carries soft static no-arbitrage penalties so the network's surfaces stay financially sane. Downstream, the fitted model feeds a minimum-variance hedging study and a Dupire local-vol extraction. Real quotes come from Yahoo Finance option chains with FRED T-bill discounting; training data is fully synthetic. Built on Python 3.11+ with torch (CPU), numpy, scipy, pandas, yfinance, plotly, streamlit, duckdb, and pydantic v2; packaged with hatchling and tested with pytest against an independent quadrature pricer.

%==========%

I. Interactive Dashboard:

The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server required). PyTorch does not exist in the browser, so the trained network ships as 17,328 weights in compact JSON and the forward pass is re-implemented in pure numpy — the same trick a production system uses to deploy a surrogate into a constrained runtime. Move the five Heston sliders and the surrogate surface updates in microseconds while the analytic Carr-Madan FFT pricer recomputes the ground truth beside it: the four tabs show the 3D surface (surrogate, analytic, or their difference), the residual heatmap and smile overlays, a 30-surface speed/accuracy study scored live in the tab, and a two-regime synthetic market recalibrated daily through the surrogate with \(\kappa\) and \(\xi\) tracking the regime break. First load downloads Pyodide and may take 20–40 seconds; subsequent loads are cached.

%==========%

II. Project Layout:


vol-surrogate/
├── pyproject.toml                              # Build config, deps, ruff + pytest settings
├── .env.example                                # DB_PATH, DATASET_PATH, MODELS_DIR
├── data/                                       # Git-ignored: DuckDB + generated datasets
├── models/                                     # Committed demo: checkpoint + JSON weight export
├── scripts/
│   └── download_data.py                        # yfinance chain + spot + FRED → DuckDB
├── src/vol_surrogate/
│   ├── data/
│   │   ├── schemas.py                          # Pydantic v2: OptionQuote, SpotRecord, RateRecord, CalibrationRecord
│   │   ├── fetchers.py                         # yfinance option chains + spot, FRED 3m T-bill
│   │   └── store.py                            # DuckDB init/upsert/read for all four tables
│   ├── pricing/
│   │   ├── heston.py                           # Little-trap CF, Carr-Madan FFT, Gil-Pelaez quad, IV grid
│   │   ├── black_scholes.py                    # BS price/vega/delta, OTM-side implied vol
│   │   └── sabr.py                             # Hagan (2002) lognormal SABR expansion
│   ├── surrogate/
│   │   ├── lhs.py                              # Latin-hypercube sampler over the parameter box
│   │   ├── dataset.py                          # (params → IV grid) training pairs, npz store
│   │   ├── mlp.py                              # SiLU MLP, checkpoints, JSON export + numpy forward
│   │   ├── transformer.py                      # Grid-token attention variant
│   │   ├── losses.py                           # Relative MSE + static no-arbitrage penalties
│   │   ├── train.py                            # Training loop, early stopping, val metrics
│   │   ├── compare.py                          # MLP vs transformer study (same data/budget)
│   │   └── calibrate.py                        # Surrogate + direct FFT calibration, benchmark
│   ├── analysis/
│   │   ├── hedging.py                          # Min-variance Heston deltas, delta-hedged P&L backtest
│   │   ├── dupire.py                           # Dupire local vol on total implied variance
│   │   └── params_ts.py                        # Daily recalibration → parameter regime time series
│   ├── report/plots.py                         # Plotly figures (plotly_dark)
│   ├── cli.py                                  # Typer CLI: fetch | generate | train | export-weights | …
│   └── app.py                                  # Server-side Streamlit dashboard
└── tests/                                      # 49 tests, offline, deterministic (seed 42)

%==========%

III. Data Sources:

Two regimes of data feed the project. Real quotes arrive exactly as in the classical pipeline: live SPY option chains (the liquid SPX proxy) from Yahoo Finance's .options API with zero-bid quotes filtered out, spot history for a realised-vol reference, and the 3-month T-bill from FRED so the calibration is not absorbing a rate error into \(\theta\) — all persisted to DuckDB with explicit column lists and date normalisation. Training data is entirely synthetic and that is the point: the supervised pairs \((\Theta_i, \sigma_{imp}(\Theta_i))\) are manufactured by the analytic pricer at any density the budget allows, with no market data, no labels to clean, and ground truth exact to FFT precision. The committed demo uses ~30k samples (about three minutes of CPU); the generator is a single --n flag away from the 500k+ scale used by Horvath, Muguruza & Tomas, since samples are embarrassingly parallel.


# dataset.py — one training pair: one FFT per maturity + vectorised IV inversion
GRID_MONEYNESS = np.linspace(0.80, 1.20, 8)
GRID_MATURITIES = np.array([0.1, 0.25, 0.5, 1.0, 1.5, 2.0])

for i, row in enumerate(X):                       # X ~ Latin hypercube (n, 5)
    grid = iv_grid(HestonParams.from_array(row), GRID_MONEYNESS, GRID_MATURITIES)
    if np.isnan(grid).any():                      # unrecoverable far wings
        keep[i] = False
        continue
    Y[i] = grid.ravel()                           # (48,) implied vols

%==========%

IV. The Calibration Bottleneck — Why Learn a Pricer You Already Have:

The FFT pricer is not slow — one transform prices a whole strike slice in a millisecond. The problem is multiplicative: a calibration needs \(\sim\)10–10² residual evaluations, each evaluation prices six maturity slices and inverts 48 implied vols, finite-difference Jacobians multiply that by another factor of six, and a desk re-fits per underlying, per snapshot, per stress scenario. The classical pipeline lands at a second per fit — fine for a daily mark, hopeless for real-time risk. A surrogate restructures the cost: the expensive part (dataset generation plus training, minutes of CPU here) is paid once, offline, and the online cost collapses to a forward pass. Three properties follow that no amount of FFT optimisation delivers. Amortisation: the per-fit marginal cost is microseconds, so calibration frequency stops being a budget decision. Batching: ten thousand scenario surfaces evaluate as one matrix multiply — on a GPU, effectively free. Differentiability: the surrogate is smooth by construction, so calibration Jacobians and Greeks come from autograd, exact to machine precision of the learned map, instead of noisy finite differences through a discretised transform. The surrogate never replaces the analytic pricer — it is distilled from it, and the analytic engine remains the arbiter in tests and the generator of training labels.

Cost	Direct FFT pipeline	Neural surrogate
One surface evaluation	6 FFTs + 48 inversions (∼150 ms)	One forward pass (∼50 μs)
One calibration	∼1 s (measured: 1.05 s)	∼6 ms (measured: 5.9 ms)
Jacobian	Finite differences (6× cost, noisy)	Autograd (exact, one backward pass)
10k scenario surfaces	∼25 min, serial	One batched matrix multiply
Offline cost	—	Dataset + training (minutes, once)

%==========%

V. Ground Truth — the Heston Characteristic Function and Carr-Madan FFT (`pricing/heston.py`):

The teacher must be beyond suspicion, so the analytic engine reuses every hard-won numerical lesson from the classical project. The Heston model couples the risk-neutral spot to a CIR variance process, \(dv_t = \kappa(\theta - v_t)\,dt + \xi\sqrt{v_t}\,dW_t^v\) with \(d\langle W^S, W^v\rangle_t = \rho\,dt\), and its tractability rests on the affine structure of \((\ln S_t, v_t)\): the characteristic function \(\varphi(u) = \mathbb{E}[e^{iu \ln S_T}]\) is exponentially affine with Riccati coefficients in closed form. The implementation uses the Albrecher et al. "little Heston trap" branching — Heston's original formulation crosses the negative real axis of the complex logarithm at long maturities and produces silently discontinuous prices; the little-trap convention keeps the log argument away from the cut for all maturities. Two identities pin it in tests: \(\varphi(0) = 1\) and \(\varphi(-i) = S_0 e^{rT}\). Prices come from the Carr-Madan method: damp the call with \(e^{\alpha k}\) in log-strike \(k\) so it is square-integrable, transform analytically, and invert with a single FFT that prices the entire strike slice:

\[ C(k) = \frac{e^{-\alpha k}}{\pi} \int_0^{\infty} e^{-ivk}\, \frac{e^{-rT}\,\varphi\bigl(v - (\alpha + 1)i\bigr)}{\alpha^2 + \alpha - v^2 + i(2\alpha + 1)v}\,dv \]

Three numerical traps documented in the sibling project are baked in as regression tests here. The Simpson weight pattern must run \(\tfrac{1}{3}, \tfrac{4}{3}, \tfrac{2}{3}, \tfrac{4}{3}, \dots\) — i.e. \(w_j = (3 - (-1)^j)/3\) with \(w_0 = \tfrac{1}{3}\); transposing the 4/3 and 2/3 weights biases every price by \(O(10^{-2} S_0)\), which would poison every training label. Off-grid strikes are interpolated with a cubic spline, not linearly (linear leaves \(O(10^{-3})\) errors at the grid spacing used). And far wings where the option value falls below FFT resolution (\(\sim 10^{-9} S_0\)) are dropped rather than inverted into garbage vols. The whole path is cross-validated against an independent Gil-Pelaez quadrature pricer, \(C = S_0 P_1 - Ke^{-rT}P_2\), which shares no code with the FFT. The Hagan et al. SABR lognormal expansion is included as a second analytic ground truth — an instructive mirror, since SABR's asymptotic formula is itself a kind of hand-derived surrogate: an algebraic approximation to a map that otherwise needs numerics, with known breakdown regions the neural version does not share.


# heston.py — Carr-Madan with the validated numerics
simpson = (3.0 - (-1.0) ** np.arange(N)) / 3.0     # 1/3, 4/3, 2/3, 4/3, ...
simpson[0] = 1.0 / 3.0
integrand = np.exp(1j * b * v) * psi * eta * simpson
calls = np.exp(-alpha * k) / np.pi * np.fft.fft(integrand).real

# off-grid strikes: cubic spline on the log-strike segment, never linear interp
spline = CubicSpline(log_k_grid[lo:hi], calls[lo:hi])

# independent cross-check pricer (Gil-Pelaez quadrature, no shared code)
def heston_call_quad(S0, K, r, T, params):
    P1 = 0.5 + quad(integrand_p1, 1e-8, 200.0, limit=200)[0] / np.pi
    P2 = 0.5 + quad(integrand_p2, 1e-8, 200.0, limit=200)[0] / np.pi
    return S0 * P1 - K * np.exp(-r * T) * P2

%==========%

VI. The Training Set — Latin Hypercube Sampling and Surface Parameterisation (`surrogate/lhs.py`, `dataset.py`):

A surrogate is only trustworthy inside the convex hull of its training set, so how the parameter box is filled matters as much as how big it is. Independent uniform draws leave clumps and voids; a Latin hypercube stratifies every one-dimensional marginal — each of \(n\) samples occupies its own \(1/n\) slice of every parameter's range — giving far more even coverage of the 5-dimensional box \((\kappa, \theta, \xi, \rho, v_0) \in [0.5, 8] \times [0.02, 0.25] \times [0.1, 1.2] \times [-0.95, 0] \times [0.02, 0.25]\). The sample is filtered to the Feller-respecting region \(2\kappa\theta \geq \xi^2\), where the variance stays strictly positive and the parameter-to-surface map is smoothest. The surface itself is parameterised the way Horvath, Muguruza & Tomas do: a fixed lattice of 8 forward-moneyness levels (0.80–1.20) by 6 maturities (0.1–2.0y) with \(S_0 = 1\) and rates stripped, so the network's 48 outputs always mean the same thing. The lattice's short end starts at 0.1y deliberately: at the box's minimum vol (\(\sqrt{0.02} \approx 14\%\)) the 0.8/1.2 wings sit near five standard deviations — still invertible — whereas at 0.05y they fall below FFT resolution and the NaN-filter would silently bias the sample toward high-vol draws. Implied vols are inverted from the OTM side (parity put below spot — deep-ITM calls sit on intrinsic and the inversion is numerically hopeless) with a vectorised Newton iteration over each strike slice at once; stragglers fall back to Brent. Generation runs at ~150 surfaces/second on one core: ~30k samples in three minutes, with 77 of 30,000 dropped for unrecoverable wings.

%==========%

VII. Two Architectures — MLP vs Grid Transformer (`surrogate/mlp.py`, `transformer.py`):

The map \(\Theta \mapsto \sigma_{imp}\) is smooth, low-dimensional in and modest-dimensional out — close to the best case for a plain MLP. The baseline is a SiLU feedforward network (smooth activations matter: the calibrator differentiates through the net, and ReLU kinks would hand the optimiser a piecewise-constant Jacobian). The committed demo is deliberately small — three hidden layers of 80 units, 17,328 parameters — so its weights ship to the browser as 157 KB of JSON. The alternative treats the 48 lattice nodes as tokens: each node gets a learned positional embedding (which \((k, T)\) am I?) plus a projection of the five parameters (what world am I in?), two pre-norm self-attention blocks let nodes exchange information across the surface, and a per-token head reads out implied vol. Attention is a natural prior — smile shapes are strongly coupled across strikes and maturities — but the study, both architectures trained on identical data, budget, and seed, says the prior is unnecessary at this scale: the structure attention would discover is already cheap for dense layers to learn on a smooth 48-node target, and the MLP wins on error and is 8× faster per inference. The honest conclusion (consistent with the deep-pricing literature): transformers earn their overhead on rough-volatility models with path-dependent features or very large unstructured grids, not on a fixed small lattice.

Same data, budget, seed	MLP (256×4)	Grid transformer (64-d, 2 blocks)
Validation MAE	6.6 bp	8.6 bp
Validation relative MSE	1.13×10^-5	1.69×10^-5
Parameters	211k	70k
Single-surface latency (CPU)	0.05 ms	0.43 ms
Training wall time (25 epochs)	7 s	112 s

%==========%

VIII. The Loss — Relative MSE plus Static No-Arbitrage Penalties (`surrogate/losses.py`):

Plain MSE on implied vol would let the network spend its capacity where vols are large — a 50 bp error at a 40-vol node costs the same as at a 12-vol node, though the second is four times worse in relative terms. The data term is therefore relative MSE, \(\mathbb{E}[((\hat\sigma - \sigma)/\sigma)^2]\). The second term addresses a subtler failure: a network fitted to arbitrage-free surfaces is not itself arbitrage-free, and between training points it can emit surfaces that a downstream pricer would happily monetise. Arbitrage lives in prices, not vols, so the predicted grid is re-priced through a differentiable Black-Scholes layer (normal CDF via \(\operatorname{erf}\), autograd-friendly) and three squared-hinge penalties enforce the classical static conditions: call-spread monotonicity \(\partial C/\partial K \leq 0\), butterfly convexity \(\partial^2 C/\partial K^2 \geq 0\), and the calendar condition that total implied variance \(w = \sigma^2 T\) is non-decreasing in maturity at fixed moneyness. Each penalty is exactly zero on a clean surface and positive precisely where a violation appears — pinned by tests on a flat Black-Scholes surface (zero) and a deliberately corrupted one (positive, with the right component firing). The penalties are soft, in the spirit of a regulariser: they do not guarantee an arbitrage-free output, but they push violations below quote-noise scale at negligible cost to fit.


# losses.py — arbitrage is checked in price space, differentiably
C = torch_bs_call(iv, moneyness, maturities)        # (B, n_mat, n_mon), S0=1, r=0

call_spread = torch.mean(torch.relu(C[..., 1:] - C[..., :-1]) ** 2)   # C must fall in K
fly = C[..., :-2] - 2.0 * C[..., 1:-1] + C[..., 2:]
butterfly = torch.mean(torch.relu(-fly) ** 2)                          # convex in K

w = iv**2 * maturities.view(1, -1, 1)                                  # total variance
calendar = torch.mean(torch.relu(w[:, :-1, :] - w[:, 1:, :]) ** 2)     # w up in T

loss = relative_mse(iv_pred, iv_true) + lambda_arb * (call_spread + butterfly + calendar)

%==========%

IX. Calibration as Inversion (`surrogate/calibrate.py`):

With the pricing map learned, calibration is inversion: find \(\hat\Theta = \arg\min_\Theta \lVert \mathrm{NN}(\Theta) - \sigma^{mkt} \rVert^2\) over the bounded box. The structure of the problem is unchanged from classical calibration — same residuals, same trust-region least squares, same parameter degeneracies (\(\kappa\) and \(\xi\) still trade off against each other; \(v_0\), \(\rho\), \(\theta\) are still the well-identified trio, and tests assert tight recovery on exactly those) — but every residual evaluation is now a 50-microsecond forward pass and the Jacobian comes from torch autograd, exact for the surrogate, instead of finite differences. One implementation detail is worth recording: the network runs in float32, and scipy's default finite-difference step (\(\sim 10^{-8}\)) sits below its noise floor — FD calibration through a float32 network silently zeroes out the weakly-identified \(\kappa\) direction unless the step is widened to \(\sim 10^{-3}\). Autograd has no such failure mode, which is an argument for surrogate calibration that is independent of speed. The benchmark fits the same FFT-generated synthetic surface (the surrogate cannot cheat: the target comes from the analytic pricer) from the same start with the same optimiser settings:

Engine	Wall time	Residual evals	Fit RMSE
Surrogate + autograd Jacobian	5.9 ms	6	4.3 bp
Direct Carr-Madan FFT	1.05 s	7	0.0 bp (vs itself)
Speed-up	177×	—	—

The surrogate's 4.3 bp residual floor is its own approximation error — the irreducible price of the speed-up, and well inside typical index bid-ask vol spreads. Against quotes generated by the surrogate itself the optimiser recovers parameters near-exactly, isolating optimiser plumbing from model error; both recovery modes are pinned in tests through the committed demo checkpoint.

%==========%

X. Greeks and the Delta-Hedged P&L Study (`analysis/hedging.py`):

The point of a stochastic-vol model on a desk is not the fit — it is what the fitted dynamics say about hedging. The backtest simulates spot/variance paths under true Heston dynamics (full-truncation Euler, the smallest-bias scheme of the Euler family), sells an ATM call on each path, and rebalances daily with two hedgers. The Black-Scholes hedger holds \(N(d_1)\) at the frozen initial implied vol. The model hedger holds the minimum-variance delta

\[ \Delta_{mv} = \frac{\partial C}{\partial S}\Big|_{v} + \frac{\rho\,\xi}{S}\,\frac{\partial C}{\partial v} \]

— the spot derivative at fixed variance (Heston is homogeneous of degree one in \((S, K)\), so at fixed \(v\) the smile genuinely rides moneyness \(K/S\)) plus the projection of the correlated variance move onto the spot move, \(\operatorname{cov}(dv, dS)/\operatorname{var}(dS) = \rho\xi/S\). With \(\rho < 0\) the correction cuts the delta below Black-Scholes: when spot falls, vol rises and lifts the short call's value, so less stock is needed. Both Greeks are bump-and-reprice through an implied-vol cube over \((v_0, T, K/S)\) built once before the run — from the analytic pricer by default, or from the surrogate as one batched forward pass over the \(v_0\) grid, which is exactly how per-path Greeks become affordable in production. A cautionary result discovered en route: the naive "smile-aware" delta (spot-bump through the surface without the \(\rho\xi\) correction) hedges worse than frozen-vol Black-Scholes under these dynamics — the two smile effects have opposite signs and the correction term is where the model's hedging value actually lives. With it, on 2,000 paths: P&L standard deviation 1.55 vs 1.72 per option — a 10% reduction over the BS hedger, deterministic at seed 42.

Hedger (short ATM call, daily rebalance)	P&L std	P&L mean
Unhedged	6.77	—
Black-Scholes delta, frozen IV	1.72	+0.04
Minimum-variance Heston delta	1.55 (−10%)	+0.03

%==========%

XI. Dupire Local Volatility from the Fitted Smile (`analysis/dupire.py`):

Dupire's theorem says any arbitrage-free surface is consistent with exactly one local volatility function \(\sigma_{loc}(K, T)\); extracting it from the fitted Heston surface shows what the two model families disagree about. Differencing raw call prices is numerically vicious (the butterfly density in the denominator amplifies every wiggle), so the module uses Gatheral's formulation in total implied variance \(w(y, T) = \sigma_{imp}^2 T\) at log-moneyness \(y\):

\[ \sigma_{loc}^2(y, T) = \frac{\partial_T w}{1 - \frac{y}{w}\,\partial_y w + \frac{1}{4}\bigl(-\frac{1}{4} - \frac{1}{w} + \frac{y^2}{w^2}\bigr)(\partial_y w)^2 + \frac{1}{2}\,\partial_{yy} w} \]

computed by central finite differences on a dense FFT-priced grid, with non-positive denominators (numerical noise or genuine butterfly violations) returned as NaN rather than laundered. On a flat Black-Scholes surface every smile term vanishes and the formula returns the constant vol to \(10^{-10}\) — the validation test. On the Heston smile the classic structure appears: short-dated ATM local vol pins to the instantaneous \(\sqrt{v_0}\), and the local-vol skew is roughly twice the implied skew (the "two" rule: implied vol is approximately the average of local vol along the path from spot to strike). The deeper point is dynamics: local vol freezes today's smile into the spot dimension, so its smile moves opposite to spot, while stochastic vol moves the smile with spot — same vanilla prices today, opposite forward-skew and hedging behaviour, which is precisely why the hedging study in section X is run under stochastic rather than local dynamics.

%==========%

XII. The Parameter Path as a Regime Indicator (`analysis/params_ts.py`):

Once a fit costs milliseconds, refitting every day — or every quote update — is free, and the fitted parameter path itself becomes a market indicator. The module builds a two-regime synthetic market (calm, then stressed: \(\kappa\) and \(\xi\) jump, \(v_0\) spikes above \(\theta\), \(\rho\) deepens — the standard crisis signature) with AR(1) noise, generates each day's quote surface plus bid-ask-scale jitter, and recalibrates daily through the surrogate, warm-starting each fit at the previous day's parameters — the production pattern. The fitted path tracks the truth through the break: \(\xi\) re-rates as the wings bid up and \(\kappa\) spikes as the term structure inverts, both visible the day the regime turns (dashboard tab 4 reruns the experiment in the browser). On real data the same loop runs off the DuckDB store — each day's chain is fetched, fitted by both engines, and appended to the calibrations table, so the regime series accumulates as a by-product of daily marking.

%==========%

XIII. CLI, Tests, and References:

Ten subcommands cover the pipeline from chain ingestion through training to the analysis studies.


# Install
pip install -e ".[dev]"

# Live chain + spot from Yahoo, 3m T-bill from FRED → DuckDB
volsur fetch --ticker SPY

# LHS-sample the Heston box and price ~30k training surfaces (scales to 500k+)
volsur generate --n 30000

# Train the MLP surrogate; committed demo: 80x3, 17,328 params, ~3 min CPU
volsur train --hidden 80 --depth 3 --max-epochs 100

# Export weights to compact JSON for the browser dashboard (157 KB)
volsur export-weights

# Recover known parameters from an FFT-generated surface, in milliseconds
volsur calibrate --xi 0.45 --rho -0.65

# Surrogate vs direct FFT wall-clock benchmark
volsur benchmark

# MLP vs transformer on the same budget; min-variance hedging; Dupire local vol
volsur compare
volsur hedge --n-paths 2000
volsur localvol

# Server-side Streamlit dashboard
volsur dashboard

All 49 tests run offline and deterministic (seed 42), via py -3.12 -m pytest. The pricing layer is pinned by the characteristic-function identities, FFT-vs-quadrature agreement across a strike/maturity grid, put-call parity, the vanishing vol-of-vol Black-Scholes limit, Black-Scholes round-trip inversion to \(10^{-8}\), and the SABR ATM/lognormal limits. The surrogate layer checks LHS bounds/Feller/determinism, dataset shape and sanity, normaliser and checkpoint round-trips, penalty-zero on clean surfaces and penalty-positive on corrupted ones, the overfit-50 sanity check, and that the pure-numpy JSON forward pass matches torch through the 5-significant-digit export. Calibration tests recover known parameters through the committed checkpoint (near-exactly against surrogate-generated quotes, within model error against analytic quotes) and assert the autograd and finite-difference routes agree. Analysis tests pin the discounted-spot martingale, CIR mean reversion, hedge determinism, the minimum-variance-beats-BS ordering, Dupire flat-surface recovery to \(10^{-10}\), and the local-vs-implied skew ordering. Key references: Heston (1993); Albrecher, Mayer, Schoutens & Tistaert, The Little Heston Trap (2007); Carr & Madan (1999); Hagan, Kumar, Lesniewski & Woodward, Managing Smile Risk (2002); Horvath, Muguruza & Tomas, Deep Learning Volatility (2021); Dupire, Pricing with a Smile (1994); Gatheral, The Volatility Surface (2006); Lord, Koekkoek & van Dijk (2010) for the full-truncation scheme.

Deep Learning Volatility — Neural Surrogate Calibration of the Heston Surface

I. Interactive Dashboard:

II. Project Layout:

III. Data Sources:

IV. The Calibration Bottleneck — Why Learn a Pricer You Already Have:

V. Ground Truth — the Heston Characteristic Function and Carr-Madan FFT (pricing/heston.py):

VI. The Training Set — Latin Hypercube Sampling and Surface Parameterisation (surrogate/lhs.py, dataset.py):

VII. Two Architectures — MLP vs Grid Transformer (surrogate/mlp.py, transformer.py):

VIII. The Loss — Relative MSE plus Static No-Arbitrage Penalties (surrogate/losses.py):

IX. Calibration as Inversion (surrogate/calibrate.py):

X. Greeks and the Delta-Hedged P&L Study (analysis/hedging.py):

XI. Dupire Local Volatility from the Fitted Smile (analysis/dupire.py):

XII. The Parameter Path as a Regime Indicator (analysis/params_ts.py):

XIII. CLI, Tests, and References:

V. Ground Truth — the Heston Characteristic Function and Carr-Madan FFT (`pricing/heston.py`):

VI. The Training Set — Latin Hypercube Sampling and Surface Parameterisation (`surrogate/lhs.py`, `dataset.py`):

VII. Two Architectures — MLP vs Grid Transformer (`surrogate/mlp.py`, `transformer.py`):

VIII. The Loss — Relative MSE plus Static No-Arbitrage Penalties (`surrogate/losses.py`):

IX. Calibration as Inversion (`surrogate/calibrate.py`):

X. Greeks and the Delta-Hedged P&L Study (`analysis/hedging.py`):

XI. Dupire Local Volatility from the Fitted Smile (`analysis/dupire.py`):

XII. The Parameter Path as a Regime Indicator (`analysis/params_ts.py`):