← Back to Projects

Optimal Execution with Reinforcement Learning (DQN)

Execution-policy learning for large buy orders using a custom limit-order-book simulator, discrete sizing choices, and DQN optimization against baselines.

Overview

This project addresses a key trading system task: executing a large parent order with minimal cost and controlled market impact. We model the problem as a sequential decision process and train a DQN agent to pick the execution quantity at each step.

The implementation uses an in-house limit-order-book simulator and includes benchmarks against TWAP, passive, aggressive, and random execution strategies. Winning behavior is evaluated as lower implementation shortfall, stable completion, and reduced variance.

Why Optimal Execution Is Hard (Even Before Alpha)

Execution is not the same as prediction. Even with zero alpha, poor execution can destroy realized PnL via slippage, impact, and adverse queue interaction. The challenge is path-dependent: when you trade now, you alter future state (book depth, spread, and available liquidity), which in turn changes future costs.

Classical schedules (TWAP/VWAP variants) are robust and interpretable, but static. RL makes sense here because the environment is naturally sequential, cost signals are delayed, and decisions depend jointly on inventory remaining, time remaining, and local microstructure state.

Environment Design (What the Agent Actually Sees)

The custom `OptimalExecEnv` follows Gymnasium conventions and models a buy program with `(total_shares=20000, total_time_steps=1800, q_min=20)` by default. The action space is discrete with five actions: `{0, 1*q_min, 2*q_min, 3*q_min, 4*q_min}`.

State representation is a compact 12-dimensional vector: percentage inventory remaining, percentage time elapsed, 5-level cumulative volume imbalances, normalized best bid, normalized best ask, normalized spread, normalized mid-price, and normalized total displayed liquidity. This is intentionally small enough for stable DQN training while still encoding time-pressure and microstructure pressure.

A key implementation detail is inventory clipping in `_get_quantity_from_action`: the chosen size is bounded by remaining shares, which avoids impossible actions near episode end and makes terminal behavior numerically stable.

Market Simulator and Impact Mechanics

The simulator maintains a 5-level bid/ask book around a noisy mean-reverting latent value (Ornstein-Uhlenbeck style update). Depth decays by level, spread is variable, and random background market activity occasionally consumes top-level liquidity.

Agent execution consumes ask-side depth for buys, level by level, and returns average execution price plus a simple depth-consumption score. Temporary price impact is modeled by shifting current price proportional to executed size (capped), ensuring large clips are penalized not only by immediate fills but also by the altered next-state price context.

This setup is intentionally tractable rather than fully realistic. It does not model queue priority, hidden liquidity, exchange-specific matching rules, or cross-venue routing. That is acceptable for a research page so long as those assumptions are explicit.

Reward Engineering (Where Most Execution Projects Win or Fail)

Per-step reward combines implementation shortfall and depth penalty: `reward = q_t * (arrival_price - execution_price) - alpha * depth_consumed`. In this implementation, `alpha = 2.0`.

An end-of-episode inventory penalty enforces completion discipline: if the horizon ends with unexecuted shares, reward is reduced by `end_of_time_penalty * remaining_inventory` (configured as `5.0`). This is critical because otherwise the agent can game the objective by under-trading to avoid impact.

The resulting objective is a practical compromise: it rewards price improvement while explicitly pricing liquidity-taking aggressiveness and schedule risk. For a first RL execution implementation, this is a strong and correct design direction.

Agent and Training Loop

The training stack uses Stable-Baselines3 DQN with epsilon-greedy exploration, replay-buffer learning, and target-network stabilization. The repository documents training over roughly 200 episodes and reports convergence stabilization around episode ~100 in the shown run.

For execution tasks, this setup is suitable because actions are naturally discrete and low-cardinality. If the action set were continuous (fractional participation rates or continuous clip sizing), policy-gradient or actor-critic families would usually be a better fit than vanilla DQN.

Baseline Strategy Benchmarking

Baselines implemented: TWAP (fixed pacing), Aggressive (fixed larger clip), Passive (high probability of no trade), and Random (uniform random sizing action). These baselines are exactly what an execution RL benchmark should include because they expose different points on the urgency-impact frontier.

Reported mean rewards over multiple runs are tightly clustered, with RL slightly better than passive/random and close to TWAP in this environment. The practical interpretation is not that RL is universally superior, but that the policy learns sensible schedule behavior and does not collapse to pathological aggressiveness or inactivity.