waremax: Deterministic Warehouse-Robotics Simulation for RL Research
How waremax gives warehouse-robotics research a deterministic simulator, a Gymnasium RL interface, and instrumented delay attribution. Why deterministic beats plausible for reproducible research.
The problem
Warehouse-robotics research needs a simulator. The legacy simulators (RAWSim-O, ARENA-Sim) are written in Java and Python, target the same RMFS use case, and are not deterministic in any meaningful way. A researcher running two seeded simulations of the same scenario would get two different results.
This means every paper that uses RAWSim-O to compare two dispatching policies has an unknown confound: the simulation noise is comparable in magnitude to the policy difference. The result is a literature where “X beats Y” often doesn’t reproduce.
waremax is our attempt to fix this.
What waremax is
waremax is a deterministic, high-fidelity discrete-event simulator for Robotic Mobile Fulfillment Systems (RMFS), written in pure Rust, with a Gymnasium-style RL interface (PyO3) and instrumented causal delay attribution.
The three properties it guarantees:
- Determinism. Same seed + same action sequence → byte-identical trajectory. Verified by tests, not a marketing claim. We have caught and fixed real non-determinism bugs in the core.
- First-class RL interface. Gymnasium + PyO3. Train MaskablePPO against the same scenarios you would benchmark heuristics on. Identical seeded trajectory.
- Instrumented delay attribution. Per-task decomposition of cycle time into causal categories (assignment wait, travel, station queue, congestion, service). Usable as a reward signal.
The thesis: a research-grade robotics benchmark must be deterministic, RL-ready, and reward-instrumentable. The current simulators are none of those.
The competitive landscape
| Simulator | Language | Determinism | RL interface | Delay attribution |
|---|---|---|---|---|
| RAWSim-O | Java / Kotlin | No (HashMap-iteration-dependent) | No | Limited |
| ARENA-Sim | Python | No | Limited | No |
| gym-dispatch | Python | Limited | Yes (Gym) | No |
| AnyLogic | Java | No (commercial) | No | Yes |
| FlexSim | C++ | No (commercial) | Limited | Yes |
| waremax | Rust | Yes (byte-identical) | Yes (Gymnasium + PyO3) | Yes (per-task causal) |
waremax is the deterministic + RL-ready + open-source option. It is the only one that simultaneously provides all three.
Why determinism matters
A non-deterministic simulator makes research impossible:
- Reproducibility. A reviewer cannot rerun your experiment. A reader cannot build on your result.
- Statistical power. To detect a small effect, you need many runs. If each run is noisy, you need hundreds. Deterministic simulators need fewer.
- Debugging. When a heuristic fails, you need to reproduce the exact scenario. Non-determinism makes this intractable.
- Reward shaping. If your reward signal is noisy, the RL agent learns the noise, not the signal.
RAWSim-O’s HashMap-iteration bug is the canonical example. Prior to the fix, “seeded” results on RAWSim-O were not reproducible — they were typically reproducible, with small but real noise. The published “X beats Y” results from RAWSim-O papers had unmeasured confidence intervals because the variance came from the simulator, not the policy.
waremax’s determinism is a correctness property. We test it. The CI run on every commit verifies that seeded trajectories are byte-identical.
Why RL matters
Most warehouse-robotics research questions are learning questions: which policy generalises? how much data do we need? when does a learned policy beat a heuristic? These questions need an RL interface, not just a simulator.
waremax exposes a Gymnasium environment:
from waremax_gym import WaremaxAllocEnv
env = WaremaxAllocEnv(
preset="standard",
duration_minutes=15,
due_time_minutes=2,
reward_mode="routed", # or "sparse", "dense", "attribution"
)
obs, _ = env.reset(seed=42)
# obs is a Dict({"robots": (64, 8), "task": (6,), "action_mask": (64,)})
# action: index into masked candidates
# pair with sb3-contrib MaskablePPO
The reward modes are designed around the controllability principle: the reward targets the delay the decision controls (assignment wait, travel-to-pickup), not the uncontrollable delay (congestion, station queue). The attribution mode exposes the full causal decomposition.
Why delay attribution matters
A typical research question: “Why does learned dispatching underperform round-robin in this scenario?”
Without delay attribution, the answer is hand-wavey. With delay attribution:
let report = waremax.analyze(run_dir)?;
println!("Per-task cycle time decomposition:");
for task in report.tasks {
println!(
" task={} cycle={:.2}s wait={:.2}s travel={:.2}s queue={:.2}s cong={:.2}s svc={:.2}s",
task.id, task.cycle_time,
task.assignment_wait, task.travel,
task.station_queue, task.congestion, task.service,
);
}
You can see exactly where the time is going. If the learned policy loses to round-robin because of queue, you know the issue is station assignment, not dispatching. The diagnosis is data-driven, not narrative.
When waremax is the right answer
- You are doing research on warehouse dispatching and need reproducible results.
- You are training RL policies and need a Gymnasium interface with realistic reward modes.
- You are studying the controllability principle (which delays should a reward target) and need per-task attribution.
- You are benchmarking policies and need a confidence interval that doesn’t include simulator noise.
When waremax is the wrong answer
- You need a UI to visualise the simulation. waremax has a CLI, not a GUI. Use AnyLogic or FlexSim.
- You need a different domain (e.g. manufacturing, AGV in hospitals). Use a domain-specific simulator.
- You need a closed-form analytical model, not a simulation. waremax is empirical.
- You are prototyping and the determinism guarantee isn’t worth the setup cost. Use a Python prototype.
The published findings
waremax is not just a simulator; it backs a research program. The published findings (from the accompanying paper, “When Does Learning to Dispatch Help? A Deterministic Benchmark and a Controllability Principle for Reward Design in Warehouse Robotics”):
- Representation × reward interaction. A permutation-equivariant candidate-scoring policy paired with an attribution-shaped reward reaches ~97% on-time attainment. A flattened MLP with the same reward plateaus at ~82%. Neither ingredient suffices alone.
- Controllability principle. Restricting the reward to the delay the agent can control is directionally better than penalising uncontrollable delay. Reported with a Welch’s t-test.
- Bounded leverage. Across four control levers (allocation, congestion-aware routing, reward design, pickup-bin choice), learned dispatching matches but does not beat simple heuristics in the current scenarios. The system is capacity- and destination-contention-bound.
The third finding is itself a contribution. Most “learned-vs-heuristic” comparisons in the literature claim a win; waremax’s determinism allows the more honest “they tie” finding to be defended.
What to read next
- waremax repository
- waremax documentation
- RAWSim-O — the legacy simulator waremax improves on
- Gymnasium — the RL interface
- sb3-contrib MaskablePPO — the RL algorithm waremax is designed for