waremax: Deterministic Warehouse-Robotics Simulation for RL Research

The problem

Warehouse-robotics research needs a simulator. The legacy simulators (RAWSim-O, ARENA-Sim) are written in Java and Python, target the same RMFS use case, and are not deterministic in any meaningful way. A researcher running two seeded simulations of the same scenario would get two different results.

This means every paper that uses RAWSim-O to compare two dispatching policies has an unknown confound: the simulation noise is comparable in magnitude to the policy difference. The result is a literature where “X beats Y” often doesn’t reproduce.

waremax is our attempt to fix this.

What waremax is

waremax is a deterministic, high-fidelity discrete-event simulator for Robotic Mobile Fulfillment Systems (RMFS), written in pure Rust, with a Gymnasium-style RL interface (PyO3) and instrumented causal delay attribution.

The three properties it guarantees:

Determinism. Same seed + same action sequence → byte-identical trajectory. Verified by tests, not a marketing claim. We have caught and fixed real non-determinism bugs in the core.
First-class RL interface. Gymnasium + PyO3. Train MaskablePPO against the same scenarios you would benchmark heuristics on. Identical seeded trajectory.
Instrumented delay attribution. Per-task decomposition of cycle time into causal categories (assignment wait, travel, station queue, congestion, service). Usable as a reward signal.

The thesis: a research-grade robotics benchmark must be deterministic, RL-ready, and reward-instrumentable. The current simulators are none of those.

The competitive landscape

Simulator	Language	Determinism	RL interface	Delay attribution
RAWSim-O	Java / Kotlin	No (HashMap-iteration-dependent)	No	Limited
ARENA-Sim	Python	No	Limited	No
gym-dispatch	Python	Limited	Yes (Gym)	No
AnyLogic	Java	No (commercial)	No	Yes
FlexSim	C++	No (commercial)	Limited	Yes
waremax	Rust	Yes (byte-identical)	Yes (Gymnasium + PyO3)	Yes (per-task causal)

waremax is the deterministic + RL-ready + open-source option. It is the only one that simultaneously provides all three.

Why determinism matters

A non-deterministic simulator makes research impossible:

Reproducibility. A reviewer cannot rerun your experiment. A reader cannot build on your result.
Statistical power. To detect a small effect, you need many runs. If each run is noisy, you need hundreds. Deterministic simulators need fewer.
Debugging. When a heuristic fails, you need to reproduce the exact scenario. Non-determinism makes this intractable.
Reward shaping. If your reward signal is noisy, the RL agent learns the noise, not the signal.

RAWSim-O’s HashMap-iteration bug is the canonical example. Prior to the fix, “seeded” results on RAWSim-O were not reproducible — they were typically reproducible, with small but real noise. The published “X beats Y” results from RAWSim-O papers had unmeasured confidence intervals because the variance came from the simulator, not the policy.

waremax’s determinism is a correctness property. We test it. The CI run on every commit verifies that seeded trajectories are byte-identical.

Why RL matters

Most warehouse-robotics research questions are learning questions: which policy generalises? how much data do we need? when does a learned policy beat a heuristic? These questions need an RL interface, not just a simulator.

waremax exposes a Gymnasium environment:

from waremax_gym import WaremaxAllocEnv

env = WaremaxAllocEnv(
    preset="standard",
    duration_minutes=15,
    due_time_minutes=2,
    reward_mode="routed",  # or "sparse", "dense", "attribution"
)
obs, _ = env.reset(seed=42)
# obs is a Dict({"robots": (64, 8), "task": (6,), "action_mask": (64,)})
# action: index into masked candidates
# pair with sb3-contrib MaskablePPO

The reward modes are designed around the controllability principle: the reward targets the delay the decision controls (assignment wait, travel-to-pickup), not the uncontrollable delay (congestion, station queue). The attribution mode exposes the full causal decomposition.

Why delay attribution matters

A typical research question: “Why does learned dispatching underperform round-robin in this scenario?”

Without delay attribution, the answer is hand-wavey. With delay attribution:

let report = waremax.analyze(run_dir)?;
println!("Per-task cycle time decomposition:");
for task in report.tasks {
    println!(
        "  task={} cycle={:.2}s wait={:.2}s travel={:.2}s queue={:.2}s cong={:.2}s svc={:.2}s",
        task.id, task.cycle_time,
        task.assignment_wait, task.travel,
        task.station_queue, task.congestion, task.service,
    );
}

You can see exactly where the time is going. If the learned policy loses to round-robin because of queue, you know the issue is station assignment, not dispatching. The diagnosis is data-driven, not narrative.

When waremax is the right answer

You are doing research on warehouse dispatching and need reproducible results.
You are training RL policies and need a Gymnasium interface with realistic reward modes.
You are studying the controllability principle (which delays should a reward target) and need per-task attribution.
You are benchmarking policies and need a confidence interval that doesn’t include simulator noise.

When waremax is the wrong answer

You need a UI to visualise the simulation. waremax has a CLI, not a GUI. Use AnyLogic or FlexSim.
You need a different domain (e.g. manufacturing, AGV in hospitals). Use a domain-specific simulator.
You need a closed-form analytical model, not a simulation. waremax is empirical.
You are prototyping and the determinism guarantee isn’t worth the setup cost. Use a Python prototype.

The published findings

waremax is not just a simulator; it backs a research program. The published findings (from the accompanying paper, “When Does Learning to Dispatch Help? A Deterministic Benchmark and a Controllability Principle for Reward Design in Warehouse Robotics”):

Representation × reward interaction. A permutation-equivariant candidate-scoring policy paired with an attribution-shaped reward reaches ~97% on-time attainment. A flattened MLP with the same reward plateaus at ~82%. Neither ingredient suffices alone.
Controllability principle. Restricting the reward to the delay the agent can control is directionally better than penalising uncontrollable delay. Reported with a Welch’s t-test.
Bounded leverage. Across four control levers (allocation, congestion-aware routing, reward design, pickup-bin choice), learned dispatching matches but does not beat simple heuristics in the current scenarios. The system is capacity- and destination-contention-bound.

The third finding is itself a contribution. Most “learned-vs-heuristic” comparisons in the literature claim a win; waremax’s determinism allows the more honest “they tie” finding to be defended.