Multi-Agent Reinforcement Learning Market Simulation:
Real markets are populated by heterogeneous agents — market makers, momentum traders, fundamental value investors, and noise traders — whose strategic interaction generates price dynamics far richer than any single-agent model. This project builds a continuous double-auction order book in pure Python and lets four agent types trade into it simultaneously, each trained with Independent PPO against its own reward (the market maker earns the spread and hates inventory; the momentum trader chases trend; the value investor mean-reverts to a drifting fair value; the noise trader is random). The point is emergence: the stylised facts of real markets — fat-tailed returns, volatility clustering, autocorrelation in absolute returns — are not hand-coded but arise endogenously from agent interaction. The simulator then doubles as a microstructure laboratory: pulling every market maker mid-run produces a flash crash whose recovery can be timed, and a short-selling ban or Tobin tax can be switched on to measure the effect on liquidity and price-discovery speed — all without a single byte of real market data. Built on Python 3.11+ with an in-house PyTorch IPPO (no stable-baselines3), gymnasium, numpy, pandas, scipy, plotly, streamlit, duckdb, pydantic v2 and typer; packaged with hatchling and tested with pytest against deterministic seed-42 fixtures.
%==========%
I. Interactive Dashboard:
The dashboard below runs entirely in the browser via stlite (Streamlit on WebAssembly — no server). PyTorch and gymnasium cannot run in Pyodide, so the in-browser demo runs a pure-NumPy version of the same order book with fixed-rule agents (no training) — it still produces the emergent stylised facts and runs the flash-crash and regulatory toggles live. The full project trains the agents with Independent PPO offline. First load downloads Pyodide and may take 20–40 seconds.
%==========%
II. Project Layout:
marl-market/
├── pyproject.toml # Build config, deps, ruff + pytest settings
├── .env.example # Optional FRED / Yahoo keys for fair-value calibration
├── dashboard.html # Self-contained stlite demo (numpy fixed-rule stand-in)
├── scripts/
│ ├── make_thumbnail.py # Real matplotlib thumbnail (emergent price + fat tails)
│ └── run_demo.py # One-shot end-to-end demo
├── src/marl_market/
│ ├── data/
│ │ ├── synthetic.py # Fair-value process; seed-42 config
│ │ └── schemas.py # Pydantic v2 simulation records
│ ├── env/
│ │ ├── order_book.py # Continuous double auction (price-time priority, partial fills)
│ │ └── market_env.py # Multi-agent env + scenario hooks
│ ├── agents/
│ │ ├── base.py # Four agent types (fixed-rule + policy behaviours)
│ │ └── policies.py # Tiny actor-critic networks
│ ├── train/
│ │ ├── ippo.py # In-house mini-batch Independent PPO
│ │ └── evaluate.py # Held-out-seed evaluation
│ ├── analysis/
│ │ ├── stylized_facts.py # Kurtosis, ACF of |returns|, spread vs population
│ │ └── scenarios.py # Flash crash, short-ban, Tobin tax
│ ├── report/plots.py # Plotly: LOB depth, return dist, scenario panels
│ ├── cli.py # Typer CLI: simulate | train | scenario | dashboard
│ └── app.py # Streamlit server-side dashboard
└── tests/ # Seed-42 fixtures; order book, stylised facts, scenarios, IPPO
%==========%
III. The Continuous Double-Auction Order Book (env/order_book.py):
The shared environment is a pure-Python continuous double auction with strict price-time priority and partial fills. Prices are integer ticks so equality is exact; a marketable order walks the opposite side from the best price outward and, within a level, fills oldest-to-newest (FIFO). The book carries no randomness of its own — all stochasticity lives in the agents — so it is fully deterministic given a fixed order sequence, which is what makes the emergent statistics reproducible.
def _match(self, incoming, opp, take_if):
trades = []
buy_side = incoming.side == "buy"
while incoming.qty > 0:
prices = [p for p, q in opp.items() if q]
if not prices: break
best = min(prices) if buy_side else max(prices)
if not take_if(best): break # price-time priority: best price first
level = opp[best]
while incoming.qty > 0 and level:
resting = level[0] # FIFO within a price level
fill = min(incoming.qty, resting.qty) # partial fills
incoming.qty -= fill; resting.qty -= fill
trades.append(Trade(best, fill, ...))
if resting.qty == 0: level.popleft()
return trades
%==========%
IV. Four Heterogeneous Agent Types (agents/base.py):
Each agent type observes the book and acts under a distinct reward, so their strategies genuinely conflict:
| Agent | Behaviour | Reward |
|---|---|---|
| Market maker | Quotes both sides around the mid | Earns the spread; penalised on inventory (adverse selection) |
| Momentum | Buys rising / sells falling | Profits from trend continuation |
| Fundamental / value | Mean-reverts toward a drifting fair value | Profits when price returns to fair value |
| Noise | Random marketable / limit orders | Provides exogenous order flow |
The interaction is what matters: momentum traders amplify moves, value traders dampen them, market makers supply liquidity and absorb imbalance, and noise traders inject the order flow that keeps the book alive. No single agent is told to produce fat tails or clustering — those are joint properties of the population.
%==========%
V. Independent PPO (train/ippo.py):
Each agent type carries its own tiny actor-critic and is trained with Independent PPO (IPPO): from one agent’s perspective the other agents are part of a non-stationary environment, and each policy is optimised against its own clipped surrogate objective with mini-batch updates. Convergence is monitored at the market level — spread, volatility, kurtosis — not just individual reward, because in a multi-agent system rising individual rewards can coincide with a degrading market. No stable-baselines3 is used; the PPO loop is written from scratch in PyTorch.
%==========%
VI. Emergent Stylised Facts (analysis/stylized_facts.py):
With the default population (2 market makers, 4 momentum, 3 value, 4 noise), the market-level statistics reproduce the canonical stylised facts of real returns — none of which were programmed in:
| Stylised fact | Measured | Interpretation |
|---|---|---|
| Fat tails (excess kurtosis of mid returns) | ≈ 20–30 | Gaussian = 0; heavy tails emerge |
| Volatility clustering (ACF of \(\lvert r\rvert\), lags 1→5) | ≈ 0.14 → 0.07 | Positive, slowly decaying |
| Spread vs liquidity (1 → 4 market makers) | ≈ 4.0 → 2.6 ticks | More makers ⇒ tighter spread |
A genuine modelling tension is reported rather than hidden: the same momentum-cascade mechanism that produces the fat tails also lifts the raw-return autocorrelation to \(\approx 0.3\), higher than real markets (which are close to zero) — an artefact of the thin, discrete-tick book. Naming that limitation is part of the honest assessment: the model reproduces the hard-to-fake stylised facts but is not a perfect microstructure replica.
%==========%
VII. Flash Crash (analysis/scenarios.py):
The stress test removes all market-maker agents mid-simulation — the liquidity providers vanish and the momentum traders are left to feed on each other. The result is a textbook flash crash: the bid–ask spread blows up roughly 17× (from \(\approx 3\) to \(\approx 52\) ticks) as the book empties. Notably, the value traders arrest the price slide quickly by stepping in at cheap prices, so the price drawdown is contained (depth of order \(0.2\!-\!1\%\)) even as liquidity collapses — a clean illustration that a flash crash is a liquidity event, not necessarily a fundamental one.
%==========%
VIII. Regulatory Interventions (analysis/scenarios.py):
The simulator measures the market-quality effect of two classic interventions:
| Intervention | Liquidity | Price discovery (tracking error to fair value) |
|---|---|---|
| Short-selling ban | Volume −98% | Worse: 40 → 65 |
| Tobin (transaction) tax | Volume −7% | Mild drag |
The short-selling ban is a clean unintended-consequence story: by forbidding value traders from selling overvalued names it nearly halts trading and worsens price discovery rather than stabilising it. The Tobin tax is a gentler drag on volume. Reporting that a well-intentioned rule degrades the very efficiency it targets is exactly the kind of honest, counter-intuitive finding an agent-based laboratory is built to surface.
%==========%
IX. MARL vs Fixed Rules — What Training Actually Buys:
Does learning help? Training only the market maker with IPPO against the fixed ecology cuts its adverse-selection losses by 30–40% (mark-to-market \(\approx -91\text{k} \to -63\text{k}\)) while the emergent fat tails persist — the learned maker quotes more defensively and manages inventory better. But the candid headline is the opposite experiment: letting all four types learn selfishly degrades price discovery dramatically (tracking error \(\approx 30 \to 1200\)), because price discovery is a positive externality that no self-interested learner is paid to provide. This is the deepest lesson of the project and it is surfaced, not buried: MARL reproduces market microstructure faithfully, but optimising every agent’s private reward does not optimise market quality — a direct, simulated illustration of why well-functioning markets depend on agents (or rules) that internalise the public good of liquidity and discovery.
%==========%
X. CLI — cli.py:
# Install
pip install -e ".[dev]"
# Simulate the fixed-rule ecology and print the emergent stylised facts
marlsim simulate
# Train the agents with Independent PPO (or --everyone to let all types learn)
marlsim train
marlsim train --everyone
# Run a scenario: flash crash, short-selling ban, Tobin tax, or a side-by-side compare
marlsim scenario --kind flash_crash
marlsim scenario --kind compare
# Launch the server-side Streamlit dashboard
streamlit run src/marl_market/app.py
| Command | Key options | Output |
|---|---|---|
marlsim simulate | population mix, --steps | Kurtosis, ACF of \(\lvert r\rvert\), mean spread |
marlsim scenario | --kind flash_crash | short_ban | tobin_tax | compare | Spread blow-up, liquidity & discovery effects |
marlsim train | --everyone | Learned-MM PnL vs fixed; market-level convergence |
%==========%
XI. Test Suite:
Twenty-four tests, fully offline, seed-42. Order-book tests verify price-time priority, partial fills, crossing-order matching, FIFO within a level, and order cancellation. Stylised-fact tests confirm the metrics are finite and in sensible ranges, that returns are genuinely fat-tailed and clustered, and that the spread tightens as market makers are added. Scenario tests confirm the flash crash blows the spread up and that the interventions move liquidity and discovery in the documented directions. IPPO smoke tests confirm training runs, entropy sharpens, and the learned market maker beats the fixed one on held-out seeds.
def test_price_time_priority_and_partial_fill():
book = OrderBook()
book.submit(Order("sell", 5, price=100, agent_id=1)) # rests first
book.submit(Order("sell", 5, price=100, agent_id=2)) # rests behind (FIFO)
trades = book.submit(Order("buy", 7, price=100, agent_id=3))
assert trades[0].sell_agent == 1 and trades[0].qty == 5 # oldest filled first
assert trades[1].sell_agent == 2 and trades[1].qty == 2 # partial fill of the next
def test_flash_crash_blows_out_spread(sim):
base, crash = run_flash_crash(sim)
assert crash.mean_spread > 5 * base.mean_spread # liquidity collapses
%==========%
XII. Configuration & Setup:
cd assets/projects/marl_market
python -m venv .venv && .venv\Scripts\Activate.ps1 # Windows
pip install -e ".[dev]"
marlsim simulate # emergent stylised facts
marlsim scenario --kind compare # flash crash + interventions
pytest -q # 24 tests, offline
streamlit run src/marl_market/app.py
No data download is required: the environment is fully synthetic and offline. The optional fair-value calibration can be driven from FRED / Yahoo Finance series; LOBSTER data can be used to validate that the simulated depth and spread dynamics match real market statistics.
Team:
Theodosios Dimitrasopoulos, personal project.
Tools & methods:
Python 3.11, PyTorch (in-house Independent PPO actor-critic), gymnasium, NumPy, SciPy, pandas, Pydantic v2, DuckDB, Typer, rich, Plotly, Streamlit, pytest, ruff, hatchling. Methods: agent-based / heterogeneous-agent market models; the continuous double auction with price-time priority; multi-agent reinforcement learning and Independent PPO (Schulman et al. 2017); emergent stylised facts of financial returns (fat tails, volatility clustering, absolute-return autocorrelation); flash-crash and liquidity-shock analysis; regulatory policy experiments (short-selling bans, Tobin taxes) and price-discovery measurement.