Optimal Trade Execution — Almgren-Chriss and Reinforcement Learning

Executing a large order is itself an optimisation problem. Trade too fast and your own order moves the price against you — market impact. Trade too slowly and price volatility accumulates while you wait — timing risk. The Almgren-Chriss model (2000) resolves the tension with a closed-form trajectory that minimises the expected execution cost plus a risk-aversion-weighted variance of implementation shortfall, tracing an efficient frontier from gradual TWAP-like liquidation to aggressive front-loading. But its parameters are constant by assumption, while real intraday liquidity, volume, and momentum are anything but. This module derives and implements the Almgren-Chriss solution and its frontier, calibrates temporary and permanent impact from intraday data, builds a vectorised gymnasium execution environment with liquidity-dependent impact, and trains an in-house PyTorch PPO agent (actor-critic, GAE, clipped objective — no stable-baselines3) that learns to deviate from the static schedule in response to volume and momentum, beating Almgren-Chriss, TWAP, and VWAP on a risk-adjusted basis. Built on Python 3.11+ with numpy, scipy, gymnasium, torch, pandas, duckdb, pydantic v2, plotly, streamlit, and typer; packaged with hatchling and tested with pytest (16 offline, seed-42 tests).


%==========%


I. Interactive Dashboard:

The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server). Order size and risk-aversion sliders drive four tabs: the Almgren-Chriss trajectory morphing from TWAP to front-loaded as risk aversion rises, the cost-vs-risk efficient frontier with the selected point marked, a side-by-side distribution of implementation shortfall for AC, TWAP, VWAP, and an adaptive policy across thousands of simulated non-stationary days, and a readout of how the learned policy's participation responds to time-of-day and momentum. First load downloads Pyodide and may take 20–40 seconds; subsequent loads are cached.


%==========%


II. Project Layout:

optimal-execution/
├── pyproject.toml                              # hatchling build, deps, ruff + pytest
├── .env.example                                # DB_PATH
├── scripts/download_data.py                    # yfinance 5-minute bars → DuckDB
├── src/optimal_execution/
│   ├── almgren_chriss/
│   │   └── model.py                            # Closed-form trajectory, cost moments, frontier
│   ├── impact/
│   │   └── calibrate.py                        # Participation-rate impact regression
│   ├── env/
│   │   ├── execution_env.py                    # gymnasium.Env, liquidity-dependent impact
│   │   └── schedulers.py                       # TWAP / VWAP / AC schedules + runner
│   ├── rl/
│   │   ├── ppo.py                              # In-house actor-critic PPO (GAE + clip)
│   │   └── evaluate.py                         # Compare vs baselines + policy readout
│   ├── data/                                   # schemas.py, fetchers.py, store.py (DuckDB)
│   ├── report/plots.py                         # Plotly figures
│   ├── cli.py                                  # Typer CLI: fetch | frontier | calibrate | train | evaluate | dashboard
│   └── app.py                                  # Streamlit server-side dashboard
└── tests/                                      # test_almgren_chriss, test_impact, test_env, test_rl
  

%==========%


III. Data Sources:

Intraday 5-minute OHLCV bars for a liquid name (e.g. SPY) come from Yahoo Finance and serve two purposes: estimating the market-impact parameters and building the execution simulator's volume profile. Average daily trading volume calibrates the participation rate, and the high-low range proxies the bid-ask spread that enters temporary impact separately from permanent impact. Every fetch degrades gracefully to a documented synthetic intraday panel — a GBM with a U-shaped intraday volume profile and day-to-day volatility regimes — so calibration, the environment, and the dashboard all run offline without credentials.


# data/fetchers.py — synthetic intraday panel (non-stationary testbed)
for d in range(n_days):
    day_sigma = float(np.clip(0.012 * (1 + 0.5 * rng.standard_normal()), 0.004, 0.03))
    bar_sigma = day_sigma / np.sqrt(bars_per_day)
    day_volume_mult = float(np.clip(1 + 0.4 * rng.standard_normal(), 0.4, 2.5))
    for b in range(bars_per_day):
        ret = bar_sigma * rng.standard_normal()
        vol = base_vol_per_bar * u_shape[b] * day_volume_mult   # U-shaped profile
  

%==========%


IV. Execution as Optimisation — Impact versus Timing Risk:

Liquidating \(X\) shares over a horizon \(T\) split into \(N\) intervals, the trader chooses a schedule \(x_0 = X, x_1, \dots, x_N = 0\) of remaining holdings. Two costs pull in opposite directions. Trading quickly concentrates volume, and market impact — the price concession your own flow demands — grows with the trading rate. Trading slowly leaves a large position exposed for longer to volatility, so the variance of the final execution price (the implementation shortfall versus the arrival price) grows with the time on the book. The expected cost falls and the cost variance rises as execution slows; there is no schedule that minimises both. Almgren-Chriss makes the trade-off explicit by minimising a mean-variance objective, and the choice of where to sit on the resulting frontier is the trader's risk appetite.

Trade fasterTrade slower
Higher market impact (more cost)Lower market impact (less cost)
Less time exposed ⇒ lower varianceMore time exposed ⇒ higher variance
Suits high risk aversion \(\lambda\)Suits low risk aversion \(\lambda\)

%==========%


V. The Almgren-Chriss Model (almgren_chriss/model.py):

With linear permanent impact \(g(v) = \gamma v\) and temporary impact \(h(v) = \epsilon\,\mathrm{sign}(v) + \eta v\) at trading rate \(v = n_j/\tau\), the arrival-price implementation shortfall has expected cost and variance

\[ \mathbb{E}[X] = \tfrac12\gamma X^2 + \epsilon X + \frac{\tilde\eta}{\tau}\sum_j n_j^2, \qquad \mathbb{V}[X] = \sigma^2\tau\sum_j x_j^2, \]

where \(\tilde\eta = \eta - \gamma\tau/2\). Minimising \(\mathbb{E} + \lambda\mathbb{V}\) yields a hyperbolic-sine holdings trajectory

\[ x_j = X\,\frac{\sinh\!\big(\kappa(T - t_j)\big)}{\sinh(\kappa T)}, \qquad \cosh(\kappa\tau) = 1 + \frac{\lambda\sigma^2\tau^2}{2\tilde\eta}, \]

with the small-\(\tau\) limit \(\kappa^2 \approx \lambda\sigma^2/\tilde\eta\). The decay rate \(\kappa\) is the whole story: \(\lambda \to 0\) gives \(\kappa \to 0\) and the straight TWAP line (pure cost minimisation, maximal timing risk), while larger \(\lambda\) bows the curve into front-loading, trading fast early to shrink the variance of the shortfall. The implementation solves the \(\cosh\) equation exactly with \(\mathrm{arccosh}\) and the tests pin that identity.


# model.py
def kappa(self, lam, approx=False):
    sig2 = self.p.sigma ** 2
    kappa2_tilde = lam * sig2 / self.eta_tilde            # small-tau kappa^2
    if approx:
        return float(np.sqrt(kappa2_tilde))
    rhs = 1.0 + 0.5 * kappa2_tilde * self.tau ** 2        # = cosh(kappa*tau)
    return float(np.arccosh(rhs) / self.tau)

def trajectory(self, lam, approx=False):
    times = np.linspace(0.0, self.T, self.N + 1)
    k = self.kappa(lam, approx=approx)
    holdings = (self.X * (1.0 - times / self.T) if k <= 0
                else self.X * np.sinh(k * (self.T - times)) / np.sinh(k * self.T))
    trades = -np.diff(holdings)                           # n_j = x_{j-1} - x_j
    return ACSolution(times, holdings, trades, k,
                      self._expected_cost(trades), self._variance(holdings), lam)
  

%==========%


VI. The Efficient Frontier of Execution:

Sweeping the risk aversion \(\lambda\) and recording \((\,\mathbb{V}[X], \mathbb{E}[X])\) for each optimal trajectory traces the efficient frontier — the locus of execution schedules for which no lower expected cost is achievable at a given shortfall variance. Low-\(\lambda\) schedules sit at the low-cost, high-risk end (TWAP-like); high-\(\lambda\) schedules front-load to the low-risk, high-cost end. The frontier is monotone: as variance falls, expected cost rises, with no schedule below the curve. It is the execution analogue of the mean-variance portfolio frontier, and choosing a point on it is identical to choosing a risk appetite. The module returns the full frontier (cost, variance, std, and \(\kappa\)) and the tests assert the trade-off is monotone in both coordinates.


# model.py
def efficient_frontier(ac, lambdas=None):
    if lambdas is None:
        lambdas = np.concatenate([[0.0], np.logspace(-8, -4, 40)])
    e = np.empty(len(lambdas)); v = np.empty(len(lambdas))
    for i, lam in enumerate(lambdas):
        sol = ac.trajectory(float(lam))
        e[i], v[i] = sol.expected_cost, sol.variance
    return {"lambda": lambdas, "expected_cost": e, "variance": v,
            "std": np.sqrt(np.maximum(v, 0.0))}
  

%==========%


VII. Calibrating Market Impact from Intraday Data (impact/calibrate.py):

The impact coefficients are estimated by regressing observed per-bar price impact on the participation rate \(\rho = |Q|/V\) (signed order flow over bar volume). Two functional forms are supported: the linear law \(\Delta p/p = \eta\rho\,\mathrm{sign}(Q)\), which maps directly onto the Almgren-Chriss \(h(v) = \eta v\) temporary impact, and the empirically dominant square-root law \(\Delta p/p = \eta\sqrt{\rho}\,\mathrm{sign}(Q)\) (Almgren et al. 2005) for metaorders. The temporary-impact slope comes from an OLS-through-the-origin fit; the permanent component is read from the drift of cumulative signed flow; \(\sigma\) is the realised per-bar return volatility; and a half-spread proxy comes from the high-low range. The regression coefficients (dimensionless, return per participation) are then mapped to price-per-share-rate Almgren-Chriss parameters by scaling with price and average bar volume.


# calibrate.py
regressor = sign * (np.sqrt(part) if model == "sqrt" else part)
eta, r2 = _ols_through_origin(regressor, ret)             # temporary impact slope
cum_flow, cum_ret = np.cumsum(sign * part), np.cumsum(ret)
gamma, _ = _ols_through_origin(cum_flow, cum_ret)         # permanent impact drift
sigma = float(np.std(ret, ddof=1))                        # per-bar return vol
  

%==========%


VIII. The RL Execution Environment (env/execution_env.py):

The environment is a gymnasium.Env liquidating \(X\) shares over \(N\) slices. The state is \((\text{shares remaining}, \text{time remaining}, \text{recent momentum}, \text{volume ratio})\); the action is a participation fraction of the remaining inventory; the reward is the negative per-slice implementation-shortfall cost plus an inventory-risk penalty, so maximising return minimises risk-adjusted shortfall. The crucial modelling choice is that temporary impact scales inversely with available liquidity: trading the same rate into a thin slice moves the price more. This is the exploitable non-stationarity — a smart policy concentrates trading in deep-liquidity slices (the U-shaped open/close) and slows into thin midday liquidity, which a static schedule blind to realised volume cannot do. Each episode randomises the volatility regime, a volume multiplier, and a persistent intraday drift, so the agent faces genuinely non-stationary conditions.


# execution_env.py — liquidity-dependent temporary impact
vol_ratio = self._vol_profile[min(self.t, c.N - 1)] * c.N * self._vol_mult
perm = p.gamma * shares                                   # permanent: shifts the mid
temp = p.epsilon + p.eta * rate / max(vol_ratio, 0.25)    # temporary: thin => costlier
exec_price = self.price - temp                            # selling: receive less than mid
self.shortfall_cost += (self.arrival - exec_price) * shares
  

%==========%


IX. The PPO Agent (rl/ppo.py):

The agent is a compact Proximal Policy Optimisation implementation written from scratch in PyTorch — an actor-critic with a shared trunk, a Gaussian policy over the participation fraction (sigmoid-squashed mean), generalised advantage estimation, and the clipped surrogate objective. No stable-baselines3 dependency: the loop collects rollouts, computes GAE advantages, and runs several clipped epochs per update, sized so training finishes in a few seconds on CPU and is deterministic under a fixed seed. The clipped objective is what keeps PPO stable — it bounds how far the policy can move in a single update, preventing the destructive large steps that plague vanilla policy gradients.

\[ L^{\text{clip}}(\theta) = \mathbb{E}\Big[\min\big(r_t(\theta)\,\hat A_t,\; \mathrm{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\,\hat A_t\big)\Big], \qquad r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}. \]

# ppo.py — GAE advantages and the clipped update
adv = np.zeros_like(R); last = 0.0
for t in reversed(range(len(R))):
    next_v = 0.0 if t == len(R) - 1 else V[t + 1]
    delta = R[t] + gamma * next_v * (1 - D[t]) - V[t]
    last = delta + gamma * gae_lambda * (1 - D[t]) * last
    adv[t] = last
...
ratio = torch.exp(lp - LP[b])
surr = torch.min(ratio * adv_b, torch.clamp(ratio, 1 - clip, 1 + clip) * adv_b)
loss = -surr.mean() + value_coef * ((val - ret_b) ** 2).mean()
  

%==========%


X. Results — PPO vs AC vs TWAP vs VWAP:

Rolling the trained policy and the three static schedules through the same non-stationary environment on a held-out set of seeds, the PPO agent attains materially lower shortfall standard deviation and tail risk (95% CVaR) than Almgren-Chriss, TWAP, and VWAP, at comparable mean cost — a superior point on the cost-risk frontier. The gain does not come from forecasting the market better; it comes from execution: the agent concentrates trading in the deep-liquidity open and close slices, slows into thin midday liquidity, and manages inventory risk dynamically, exactly the state-dependent behaviour a constant-parameter schedule cannot express. The untrained network, by contrast, posts roughly three times the shortfall — the gap is learned.

MethodMean shortfallStd (risk)95% CVaR (tail)
PPO (learned)ComparableLowestLowest
Almgren-ChrissComparableHigherHigher
TWAPComparableHigherHigher
VWAPComparableHigherHigher

The result is reproducible from the package: execution evaluate --updates 80 trains the agent and prints the mean / std / CVaR table; the deterministic test test_ppo_learns_and_beats_baselines_on_risk asserts the trained policy improves over the untrained net and beats TWAP on both shortfall std and CVaR.


%==========%


XI. Interpreting the Learned Policy (rl/evaluate.py):

A learned policy is only trustworthy if its behaviour is legible. Mapping the agent's participation fraction over the state space — at moderate inventory, across time-remaining and momentum — reveals the economics it discovered. Participation is highest in the deep-liquidity slices at the open and close and rises further when momentum is adverse (the price falling during a sale), so the agent accelerates to avoid drifting further offside; it slows into thin midday liquidity where the same rate would move the price more. These are precisely the reactions to non-stationary volume and momentum that the constant-parameter Almgren-Chriss schedule is structurally unable to make, and they are the source of the risk reduction in section X.


# evaluate.py — read the policy over (time remaining, momentum)
for i, mom in enumerate(momentum):
    for j, tr in enumerate(time_frac):
        obs = np.array([0.5, tr, mom, 1.0], dtype=np.float32)   # 50% inventory
        with torch.no_grad():
            mean, _, _ = model(torch.as_tensor(obs))
        grid[i, j] = float(torch.clamp(mean, 0, 1).item())      # participation
  

%==========%


XII. CLI — cli.py:

Six subcommands cover the pipeline from intraday ingestion through calibration to training and evaluation, all sharing the DuckDB store.


# Install
pip install -e ".[dev]"

# Intraday 5-minute bars from Yahoo (synthetic fallback) into DuckDB
execution fetch --ticker SPY

# Almgren-Chriss efficient frontier across risk aversions
execution frontier --shares 100000

# Calibrate temporary/permanent impact from stored bars (sqrt or linear law)
execution calibrate --ticker SPY --model sqrt

# Train the in-house PPO execution agent and save the policy
execution train --updates 80

# Train and evaluate PPO vs AC, TWAP, VWAP on held-out days
execution evaluate --updates 80

# Launch the Streamlit server-side dashboard
execution dashboard
  
CommandKey optionsOutput
execution fetch--ticker, --period, --dbIntraday OHLCV bars to DuckDB
execution frontier--shares, --n-slices, --sigmaCost, std, \(\kappa\) by risk aversion
execution calibrate--ticker, --model\(\eta, \gamma, \epsilon, \sigma\), \(R^2\); stored to DuckDB
execution train--updates, --outTrained PPO policy saved to disk
execution evaluate--updates, --episodesMean / std / CVaR: PPO vs AC vs TWAP vs VWAP
execution dashboardLaunches streamlit run app.py

%==========%


XIII. Test Suite:

All 16 tests are offline and deterministic (seed 42). The Almgren-Chriss tests pin the closed form — the \(\lambda \to 0\) TWAP limit, that higher risk aversion front-loads (faster holdings decay, larger first trade), the boundary conditions \(x_0 = X, x_N = 0\), the \(\cosh(\kappa\tau)\) identity, and a monotone cost-vs-variance frontier. The impact tests check that calibration returns positive, sensible coefficients with \(R^2 \in [0,1]\) under both laws and that the mapping to Almgren-Chriss parameters stays positive. The environment tests verify the gym API and observation bounds, full liquidation with a reported implementation shortfall, schedule determinism under a fixed seed, that the static schedules sum to one, and that a risk-averse AC schedule lowers shortfall variance versus TWAP. The RL test trains the PPO agent and asserts it improves dramatically over an untrained network and beats TWAP on both shortfall standard deviation and 95% CVaR.


# test_rl.py — the agent learns and cuts tail risk
def test_ppo_learns_and_beats_baselines_on_risk():
    cfg = EnvConfig()
    pre = np.nanmean(rollout_policy(_build(4, 1, 64), cfg, n_episodes=200, seed=5000))
    model = train_ppo(cfg, PPOConfig(total_updates=80, steps_per_update=1024, seed=42))
    res = compare_methods(model, cfg, n_episodes=300)
    assert res["PPO"]["mean"] < 0.6 * pre                 # learned, not random
    assert res["PPO"]["std"] < res["TWAP"]["std"]         # lower risk
    assert res["PPO"]["cvar95"] < res["TWAP"]["cvar95"]   # lower tail risk