Emerge-Lab · riccardosavorgnan · Mar 9, 2026 · Mar 9, 2026 · eugenevinitsky · Mar 9, 2026
diff --git a/README.md b/README.md
@@ -1,5 +1,7 @@
 # PufferDrive
 
+[![Unit Tests](https://github.com/Emerge-Lab/PufferDrive/actions/workflows/utest.yml/badge.svg)](https://github.com/Emerge-Lab/PufferDrive/actions/workflows/utest.yml)
+
 <img align="left" style="width:260px" src="https://github.com/Emerge-Lab/PufferDrive/blob/main/pufferlib/resources/drive/pufferdrive_20fps_long.gif" width="288px">
 
 **PufferDrive is a fast and friendly driving simulator to train and test RL-based models.**

diff --git a/docs/src/interact-with-agents.md b/docs/src/interact-with-agents.md
@@ -18,6 +18,30 @@ then launch:
 
 This will run `demo()` with an existing model checkpoint.
 
+## Arguments & Configuration
+
+The `drive` tool supports similar CLI arguments as the visualizer to control the environment and rendering. It also reads the `pufferlib/config/ocean/drive.ini` file for default environment settings.
+
+### Command Line Arguments
+
+| Argument | Description | Default |
+| :--- | :--- | :--- |
+| `--map-name <path>` | Path to the map binary file (e.g., `resources/drive/binaries/training/map_000.bin`). If omitted, picks a random map out of `num_maps` from `map_dir` in `drive.ini`. | Random |
+| `--policy-name <path>` | `Path to the policy weights file (.bin).` | `resources/drive/puffer_drive_weights.bin` |
+| `--view <mode>` | Selects which views to render: `agent`, `topdown`, or `both`. | `both` |
+| `--frame-skip <n>` | Renders every Nth frame to speed up simulation (framerate remains 30fps). | `1` |
+| `--num-maps <n>` | Overrides the number of maps to sample from if `--map-name` is not set. | `drive.ini` value |
+
+### Visualization Flags
+
+| Flag | Description |
+| :--- | :--- |
+| `--show-grid` | Draws the underlying nav-graph/grid on the map. |
+| `--obs-only` | Hides objects not currently visible to the agent's sensors (fog of war). |
+| `--lasers` | Visualizes the raycast sensor lines from the agent. |
+| `--log-trajectories` | Draws the ground-truth "human" expert trajectories as green lines. |
+| `--zoom-in` | Zooms the camera mainly on the active region rather than the full map bounds. |
+
 ### Controls
 
 **General:**

diff --git a/docs/src/simulator.md b/docs/src/simulator.md
@@ -22,11 +22,28 @@ A high-performance autonomous driving simulator in C with Python bindings.
 
 - `control_vehicles`: Only vehicles
 - `control_agents`: All agent types (vehicles, cyclists, pedestrians)
-- `control_tracks_to_predict`: WOMD evaluation mode
+- `control_wosac`: WOSAC evaluation mode (controls all valid agents ignoring expert flag and start to goal distance)
 - `control_sdc_only`: Self-driving car only
 
 > [!NOTE]
-> `control_vehicles` filters out agents marked as "expert" and those too close to their goal (<2m). For full WOMD evaluation, use `control_tracks_to_predict`.
+> `control_vehicles` filters out agents marked as "expert" and those too close to their goal (<2m). For full WOMD evaluation, use `control_wosac`.
+
+> [!IMPORTANT]
+> **Agent Dynamics:** The simulator supports three types of agents:
+> 1. **Policy-Controlled:** Stepped by your model's actions.
+> 2. **Experts:** Stepped using ground-truth log trajectories.
+> 3. **Static:** Remain frozen in place.
+>
+> In the simulator, agents not selected for policy control will be treated as **Static** by default. To make them follow their **Expert trajectories**, you must set `mark_as_expert=true` for those agents in the jsons. This is critical for `control_sdc_only` to ensure the environment behaves realistically around the policy-controlled agents.
+
+### Init modes
+
+- **`create_all_valid`** (Default): Initializes every valid agent present in the map file. This includes policy-controlled agents, experts (if marked), and static agents.
+
+- **`create_only_controlled`**: Initializes **only** the agents that are directly controlled by the policy.
+
+> [!NOTE]
+> In `create_only_controlled` mode, the environment will contain **no static or expert agents**. Only the policy-controlled agents will exist.
 
 ### Goal behaviors
 

diff --git a/docs/src/train.md b/docs/src/train.md
@@ -1,14 +1,14 @@
-## Training
+# Training
 
-### Basic training
+## Basic training
 
 Launch a training run with Weights & Biases logging:
 
 ```bash
 puffer train puffer_drive --wandb --wandb-project "pufferdrive"
 ```
 
-### Environment configurations
+## Environment configurations
 
 **Default configuration (Waymo maps)**
 
@@ -33,11 +33,11 @@ resample_frequency = 100000 # No resampling needed (there are only a few Carla m
 termination_mode = 0  # 0: terminate at episode_length, 1: terminate after all agents reset
 
 # Map settings
-map_dir = "resources/drive/binaries/carla"
-num_maps = 2
+map_dir = "resources/drive/binaries"
+num_maps = 2 # Number of Carla maps you're training in
 ```
 
-this should give a good starting point. With these settings, you'll need about 2-3 billion steps to get an agent that reaches most of it's goals (> 95%) and has a combined collsion / off-road rate of 3 % per episode of 300 steps.
+this should give a good starting point. With these settings, you'll need about 2-3 billion steps to get an agent that reaches most of it's goals (> 95%) and has a combined collsion / off-road rate of 3 % per episode of 300 steps in town 1 and 2, which can be found [here](https://github.com/Emerge-Lab/PufferDrive/tree/2.0/data_utils/carla/carla_data). Before launching your experiment, run `drive.py` with the folder to the Carla towns to process them to binaries, then ensure the `map_dir` above is pointed to these binaries.
 
 > [!Note]
 > The default training hyperparameters work well for both configurations and typically don't need adjustment.

diff --git a/docs/src/visualizer.md b/docs/src/visualizer.md
@@ -25,8 +25,8 @@ bash scripts/build_ocean.sh visualize local
 
 If you need to force a rebuild, remove the cached binary first (`rm ./visualize`).
 
-## Run headless
-Launch the visualizer with a virtual display and export an `.mp4`:
+## Rendering a Video
+Launch the visualizer with a virtual display and export an `.mp4` for the binary scenario:
 
 ```bash
 xvfb-run -s "-screen 0 1280x720x24" ./visualize
@@ -43,3 +43,37 @@ puffer render puffer_drive
 ```
 
 This mode parallelizes rendering based on `vec.num_workers`.
+
+## Arguments & Configuration
+
+The `visualize` tool supports several CLI arguments to control the rendering output. It also reads the `pufferlib/config/ocean/drive.ini` file for default environment settings(For more details on these settings, refer to [Configuration](simulator.md#configuration)).
+
+### Command Line Arguments
+
+| Argument | Description | Default |
+| :--- | :--- | :--- |
+| `--map-name <path>` | Path to the map binary file (e.g., `resources/drive/binaries/training/map_000.bin`). If omitted, picks a random map out of `num_maps` from `map_dir` in `drive.ini`. | Random |
+| `--policy-name <path>` | Path to the policy weights file (`.bin`). | `resources/drive/puffer_drive_weights.bin` |
+| `--view <mode>` | Selects which views to render: `agent`, `topdown`, or `both`. | `both` |
+| `--output-agent <path>` | Output filename for agent view video. | `<policy>_agent.mp4` |
+| `--output-topdown <path>` | Output filename for top-down view video. | `<policy>_topdown.mp4` |
+| `--frame-skip <n>` | Renders every Nth frame to speed up generation (framerate remains 30fps). | `1` |
+| `--num-maps <n>` | Overrides the number of maps to sample from if `--map-name` is not set. | `drive.ini` value |
+
+### Visualization Flags
+
+| Flag | Description |
+| :--- | :--- |
+| `--show-grid` | Draws the underlying nav-graph/grid on the map. |
+| `--obs-only` | Hides objects not currently visible to the agent's sensors (fog of war). |
+| `--lasers` | Visualizes the raycast sensor lines from the agent. |
+| `--log-trajectories` | Draws the ground-truth "human" expert trajectories as green lines. |
+| `--zoom-in` | Zooms the camera mainly on the active region rather than the full map bounds. |
+
+### Key `drive.ini` Settings
+The visualizer initializes the environment using `pufferlib/config/ocean/drive.ini`. Important settings include:
+
+- `[env] dynamics_model`: `classic` or `jerk`. Must match the trained policy.
+- `[env] episode_length`: Duration of the playback. defaults to 91 if set to 0.
+- `[env] control_mode`: Determines which agents are active (`control_vehicles` vs `control_sdc_only`).
+- `[env] goal_behavior`: Defines agent behavior upon reaching goals (respawn vs stop).
diff --git a/docs/src/wosac.md b/docs/src/wosac.md
@@ -55,30 +55,18 @@ We provide baselines on a small curated dataset from the WOMD validation set wit
 
 | Method | Realism meta-score | Kinematic metrics | Interactive metrics | Map-based metrics | minADE | ADE |
 |--------|-------------------|-------------------|---------------------|-------------------|--------|------|
-| Ground-truth (UB) | 0.832 | 0.606 | 0.846 | 0.961 | 0 | 0 |
-| π_Base self-play RL | 0.737 | 0.319 | 0.789 | 0.938 | 10.834 | 11.317 |
-| [SMART-tiny-CLSFT](https://arxiv.org/abs/2412.05334) | 0.805 | 0.534 | 0.830 | 0.949 | 1.124 | 3.123 |
-| π_Random | 0.485 | 0.214 | 0.657 | 0.408 | 6.477 | 18.286 |
+| Ground-truth (UB) | 0.8179 | 0.6070 | 0.9590 | 0.8722 | 0 | 0 |
+| Self-play RL agent | 0.6750 | 0.2798 | 0.7966 | 0.7811 | 10.8057 | 11.4108 |
+| [SMART-tiny-CLSFT](https://arxiv.org/abs/2412.05334) | 0.7818 | 0.5200 | 0.8914 | 0.8378 | 1.1236 | 3.1231 |
+| Random | 0.4459 | 0.0506 | 0.7843 | 0.4704 | 23.5936 | 25.0097 |
 
 *Table: WOSAC baselines in PufferDrive on 229 selected clean held-out validation scenarios.*
 
-
-> ✏️ Download the dataset from [Hugging Face](https://huggingface.co/datasets/daphne-cornelisse/pufferdrive_wosac_val_clean) to reproduce these results or benchmark your policy.
-
-
-| Method | Realism meta-score | Kinematic metrics | Interactive metrics | Map-based metrics | minADE | ADE |
-| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
-| Ground-truth (UB) | 0.833 | 0.574 | 0.864 | 0.958 | 0 | 0 |
-| π_Base self-play RL | 0.737 | 0.323 | 0.792 | 0.930 | 8.530 | 9.088 |
-| [SMART-tiny-CLSFT](https://arxiv.org/abs/2412.05334) | 0.795 | 0.504 | 0.832 | 0.932 | 1.182 | 2.857 |
-| π_Random | 0.497 | 0.238 | 0.656 | 0.430 | 6.395 | 18.617 |
-
-*Table: WOSAC baselines in PufferDrive on validation 10k dataset.*
-
-
-> ✏️ Download the dataset from [Hugging Face](https://huggingface.co/datasets/daphne-cornelisse/pufferdrive_womd_val) to reproduce these results or benchmark your policy.
+- **Random agent:** Following the [WOSAC 2023 paper](https://arxiv.org/abs/2305.12032), the random agent samples future trajectories by independently sampling (x, y, θ) at each timestep from a Gaussian distribution in the AV coordinate frame `(mu=1.0, sigma=0.1)`, producing uncorrelated random motion over the horizon of 80 steps.
+- **Goal-conditioned self-play RL agent**: An agent trained through self-play RL to reach the end point points ("goals") without colliding or going off-road. Baseline can be reproduced using the default settings in the `drive.ini` file with the Waymo dataset. We also open-source the weights of this policy, see `pufferlib/resources/drive/puffer_drive_weights` `.bin` and `.pt`.
 
 
+> ✏️ Download the dataset from [Hugging Face](https://huggingface.co/datasets/daphne-cornelisse/pufferdrive_wosac_val_clean) to reproduce these results or benchmark your policy.
 
 ## Evaluating trajectories
 

diff --git a/docs/theme/extra.css b/docs/theme/extra.css
@@ -463,3 +463,8 @@ blockquote {
   margin: 1rem 0;
   border-radius: 0 8px 8px 0;
 }
+
+/* Fix table visibility - remove alternating row colors */
+table tr:nth-child(2n) {
+  background-color: transparent !important;
+}
diff --git a/pufferlib/config/ocean/drive.ini b/pufferlib/config/ocean/drive.ini
@@ -177,8 +177,14 @@ render_map = none
 eval_interval = 1000
 ; Path to dataset used for evaluation
 map_dir = "resources/drive/binaries/training"
-; Evaluation will run on the first num_maps maps in the map_dir directory
-num_maps = 20
+; Number of scenarios to process per batch
+wosac_batch_size = 32
+; Target number of unique scenarios perform evaluation in
+wosac_target_scenarios = 64
+; Total pool of scenarios to sample from
+wosac_scenario_pool_size = 1000
+; Max batches, used as a timeout to prevent an infinite loop
+wosac_max_batches = 100
 backend = PufferEnv
 ; WOSAC (Waymo Open Sim Agents Challenge) evaluation settings
 ; If True, enables evaluation on realism metrics each time we save a checkpoint
@@ -198,10 +204,14 @@ wosac_goal_radius = 2.0
 wosac_sanity_check = False
 ; Only return aggregate results across all scenes
 wosac_aggregate_results = True
+; Evaluation mode: "policy", "ground_truth"
+wosac_eval_mode = "policy"
 ; If True, enable human replay evaluation (pair policy-controlled agent with human replays)
 human_replay_eval = False
 ; Control only the self-driving car
 human_replay_control_mode = "control_sdc_only"
+; Number of scenarios for human replay evaluation equals the number of agents
+human_replay_num_agents = 16
 
 [render]
 ; Mode to render a bunch of maps with a given policy

diff --git a/pufferlib/ocean/benchmark/evaluate_imported_trajectories.py b/pufferlib/ocean/benchmark/evaluate_imported_trajectories.py
@@ -1,49 +1,30 @@
 import sys
 import pickle
 import numpy as np
-from scipy.spatial import cKDTree
 import pufferlib.pufferl as pufferl
 from pufferlib.ocean.benchmark.evaluator import WOSACEvaluator
 
 
-def align_trajectories_by_initial_position(simulated, ground_truth, tolerance=1e-4):
-    """
-    If the trajectories where generated using the same dataset, then regardless of the algorithm the initial positions should be the same.
-    We use this information to align the trajectories for WOSAC evaluation.
-
-    Ideally we would not have to use a tolerance, but the preprocessing in SMART shifts some values by around 2-e5 for some agents.
-
-    Also, the preprocessing in SMART messes up some heading values, so I decided not to include heading.
-
-    Idea of this script, use a nearest neighbor algorithm to associate all initial positions in gt to positions in simulated,
-    and check that everyone matching respects the tolerance and there are no duplicates.
-    """
-
-    sim_pos = np.stack([simulated["x"][:, 0, 0], simulated["y"][:, 0, 0], simulated["z"][:, 0, 0]], axis=1).astype(
-        np.float64
-    )
+def align_trajectories(simulated, ground_truth):
+    # Idea is to use the (scenario_id, id) pair to reindex simulated_trajectories in order to align it with GT
+    gt_scenario_ids = ground_truth["scenario_id"][:, 0]
+    sim_scenario_ids = simulated["scenario_id"][:, 0, 0]
 
-    gt_pos = np.stack(
-        [ground_truth["x"][:, 0, 0], ground_truth["y"][:, 0, 0], ground_truth["z"][:, 0, 0]], axis=1
-    ).astype(np.float64)
+    gt_ids = ground_truth["id"][:, 0]
+    sim_ids = simulated["id"][:, 0, 0]
 
-    tree = cKDTree(sim_pos)
+    lookup = {(s_id, a_id): idx for idx, (s_id, a_id) in enumerate(zip(sim_scenario_ids, sim_ids))}
 
-    dists, indices = tree.query(gt_pos, k=1)
+    try:
+        indices = [lookup[(s, i)] for (s, i) in zip(gt_scenario_ids, gt_ids)]
+        indices = np.array(indices, dtype=int)
+    except KeyError:
+        print("An agent present in the GT is missing in your simulation")
+        raise
 
-    tol_check = dists <= tolerance
+    sim_traj = {k: v[indices] for k, v in simulated.items()}
 
-    if not np.all(tol_check):
-        max_dist = np.max(dists)
-        raise ValueError(f"Didn't find a match for {np.sum(~tol_check)} agents, tolerance broken by {max_dist}m.")
-
-    if len(set(indices)) != len(indices):
-        raise ValueError("Duplicate matching found, I am sorry but this likely indicates that your data is wrong")
-
-    reordered_sim = {}
-    for key, val in simulated.items():
-        reordered_sim[key] = val[indices]
-    return reordered_sim
+    return sim_traj
 
 
 def check_alignment(simulated, ground_truth, tolerance=1e-4):
@@ -72,8 +53,7 @@ def evaluate_trajectories(simulated_trajectory_file, args):
     """
     env_name = "puffer_drive"
     args["env"]["map_dir"] = args["eval"]["map_dir"]
-    args["env"]["num_maps"] = args["eval"]["num_maps"]
-    args["env"]["use_all_maps"] = True
+    args["env"]["num_maps"] = args["eval"]["wosac_num_maps"]
     dataset_name = args["env"]["map_dir"].split("/")[-1]
 
     print(f"Running WOSAC realism evaluation with {dataset_name} dataset. \n")
@@ -97,30 +77,26 @@ def evaluate_trajectories(simulated_trajectory_file, args):
 
     print(f"Number of scenarios: {len(np.unique(gt_trajectories['scenario_id']))}")
     print(f"Number of controlled agents: {num_agents_gt}")
-    print(f"Number of evaluated agents: {np.sum(gt_trajectories['id'] >= 0)}")
+    print(f"Number of evaluated agents: {gt_trajectories['is_track_to_predict'].sum()}")
 
     print(f"Loading simulated trajectories from {simulated_trajectory_file}...")
     with open(simulated_trajectory_file, "rb") as f:
         sim_trajectories = pickle.load(f)
 
-    if sim_trajectories["x"].shape[0] != gt_trajectories["x"].shape[0]:
-        print("\nThe number of agents in simulated and ground truth trajectories do not match.")
-        print("This is okay if you are running this script on a subset of the val dataset")
-        print("But please also check that in drive.h MAX_AGENTS is set to 256 and recompile")
-
-    if not check_alignment(sim_trajectories, gt_trajectories):
-        print("\nTrajectories are not aligned, trying to align them, if it fails consider changing the tolerance.")
-        sim_trajectories = align_trajectories_by_initial_position(sim_trajectories, gt_trajectories)
-        assert check_alignment(sim_trajectories, gt_trajectories), (
-            "There might be an issue with the way you generated your data."
-        )
-        print("Alignment successful")
-    else:
-        sim_trajectories = {k: v[:num_agents_gt] for k, v in sim_trajectories.items()}
-
-    # Evaluator code expects to have matching ids between gt and sim trajectories
-    # Since alignment is checked it is safe to do that
-    sim_trajectories["id"][:] = gt_trajectories["id"][..., None]
+    num_agents_sim = sim_trajectories["x"].shape[0]
+    assert num_agents_sim >= num_agents_gt, (
+        "There is less agents in your simulation than in the GT, so the computation won't be valid"
+    )
+
+    if num_agents_sim > num_agents_gt:
+        print("If you are evaluating on a subset of your trajectories it is fine.")
+        print("\n Else, you should consider changing the value of MAX_AGENTS in drive.h and compile")
+
+    sim_trajectories = align_trajectories(sim_trajectories, gt_trajectories)
+
+    assert check_alignment(sim_trajectories, gt_trajectories), (
+        "There might be an issue with the way you generated your data."
+    )
 
     agent_state = vecenv.driver_env.get_global_agent_state()
     road_edge_polylines = vecenv.driver_env.get_road_edge_polylines()