ricky/merge conflicts 3.0 beta and 3.0 by riccardosavorgnan · Pull Request #325 · Emerge-Lab/PufferDrive

riccardosavorgnan · 2026-03-03T03:52:08Z

Add unit tests badge to README (Add unit tests badge to README #238)
Improve Carla train docs.
Run tests when PR is opened to 2.0. (Fix: Change default PR branch from main to 2.0 #256)
Always initialize SDC first (Always initialize SDC first #259)
Fix incorrect argument name for online WOSAC logging. (Fix incorrect argument name for online WOSAC logging. #260)
Fixes: Align WOSAC metric calculation with original (Fixes: Align WOSAC metric calculation with original #258)
Fix Config Read in env_config.h for demo and visualizer (Fix Config Read in env_config.h for demo and visualizer #272)
Demo/visualize fixes and docs changes (Demo/visualize fixes and docs changes #273)
Change the collision count to be a Mean over the rollout axis instead of a Sum, it makes way more sense now (Change the collision count to be a Mean over the rollout axis instead… #277)
More complete and Robust evaluation of WOSAC rollouts (More complete and Robust evaluation of WOSAC rollouts #271)
Ensure we always add a placeholder if scenario_id is not present in metadata. (Fix: Ensure scenario_id placeholder in map binaries. #279)
Enable fast WOSAC evals on large datasets with resampling (Enable fast WOSAC evals on large datasets with resampling #280)
Enable simple eval with GT. (Enable simple eval with GT #282)
Solve some merge messy stuff

greptile-apps · 2026-03-03T04:19:21Z

Greptile Summary

This PR merges a large batch of improvements from 3.0 into 3.0_beta, covering WOSAC evaluation robustness, simulator correctness fixes, improved demo/visualizer CLI tooling, documentation updates, and CI branch alignment.

Key changes:

WOSAC evaluation overhaul: WOSACEvaluator gains a batched evaluate() loop that samples scenarios until a target count is reached, a ground_truth eval mode for upper-bound scoring, and a collect_wosac_random_baseline() method. The metric pipeline is corrected — log-likelihoods are now averaged per-scenario first, then exponentiated, which is the mathematically sound approach.
Metric weight alignment: wosac.ini is updated to the 2024 WOSAC challenge config (traffic_light_violation weight = 0.0, time_to_collision and distance_to_road_edge weights adjusted to 0.1 each). Weights still sum to 1.0, so the removed normalization division in _compute_metametric is correct.
Simulator correctness: SDC is now guaranteed to be initialized first (drive.h), collision/observation loops guard against zero static_agent_count, scenario_id is stored as a fixed char[16] buffer instead of a pointer, and env_config.h correctly parses string-valued init_mode/control_mode settings.
Trajectory alignment improvement: KD-tree spatial alignment in evaluate_imported_trajectories.py replaced with exact (scenario_id, id) pair lookup, removing the scipy dependency.
Per-scenario metric columns are global scalars (evaluator.py lines 601–608): np.mean([series1, series2, …]) produces a single scalar broadcast to all rows. Per-scenario breakdowns of kinematic_metrics, interactive_metrics, and map_based_metrics are therefore incorrect; only the final aggregate is unaffected.
Duplicate include in drive.c (lines 4 & 6): #include "../env_config.h" appears twice — harmless due to header guards but should be removed.
Redundant branch in utils.py: elif len(model_files) > 0 is always True after if not model_files: returns False.

Confidence Score: 3/5

This PR is mostly safe but contains a logic issue in per-scenario metric aggregation that produces misleading per-scene data and a duplicate include in C source.
The core evaluation pipeline change (average log-likelihoods then exponentiate) is mathematically sound, simulator safety guards are improved, and the _compute_metametric normalization removal is valid. However, the per-scenario kinematic/interactive/map-based metric columns are computed as global scalars rather than per-scenario means due to incorrect use of np.mean([series1, series2, ...]). This doesn't break the aggregate WandB metrics but produces misleading per-scenario data in the returned DataFrame. The duplicate include in drive.c and the redundant branch in utils.py are minor cleanup items.
pufferlib/ocean/benchmark/evaluator.py (per-scenario metric columns), pufferlib/ocean/drive/drive.c (duplicate include)

Important Files Changed

Filename	Overview
pufferlib/ocean/benchmark/evaluator.py	Major overhaul of WOSAC evaluation: adds batched `evaluate()` loop, ground-truth eval mode, random baseline, and correct log-likelihood→likelihood pipeline (average log-likelihoods per scene, then exponentiate). The `_compute_metametric` normalization was correctly removed because weights now sum to 1.0. Key issue: per-scenario kinematic/interactive/map-based metric columns are assigned a global scalar instead of per-scenario means.
pufferlib/ocean/drive/drive.c	Added CLI argument parsing for demo/visualizer tool and error handling when no active agents are found. Contains a duplicate `#include "../env_config.h"` (lines 4 and 6) introduced by the merge, which is harmless due to header guards but should be cleaned up.
pufferlib/ocean/drive/drive.h	Changed `scenario_id` from `char*` to a fixed `char[16]` buffer, added SDC-first initialization logic, fixed collision_check/observation to guard `static_agent_count > 0`, renamed `get_track_id_or_placeholder` to `is_in_track_to_predicts` (now returns bool), and updated `c_get_global_ground_truth_trajectories` to expose `is_vehicle` and `is_track_to_predict` fields.
pufferlib/ocean/drive/drive.py	Removed `use_all_maps` parameter, refactored resampling into a standalone `resample_maps()` method, updated ground-truth trajectory retrieval to include `is_vehicle`, `is_track_to_predict`, and string `scenario_id` (S16 dtype). The `truncations` reset was correctly removed from `step()` because the C layer uses `memset` to clear it.
pufferlib/pufferl.py	Eval interval fix: changed `epoch % eval_interval` to `(epoch - 1) % eval_interval` to avoid running at epoch 0; updated eval flow to use new batched `WOSACEvaluator.evaluate()` and exposes new WandB metrics (kinematic, interactive, map-based scores, and std of realism score).
pufferlib/utils.py	WOSAC subprocess now gracefully handles missing model files (runs with random policy instead of early-return). Contains a redundant `elif len(model_files) > 0` condition that should be a plain `else:`. New eval hyperparameters (`wosac_batch_size`, `wosac_target_scenarios`, `wosac_scenario_pool_size`) are forwarded to the subprocess.
pufferlib/ocean/env_binding.h	Updated ground-truth trajectory bindings to accept two new arrays (`is_vehicle`, `is_track_to_predict`) and changed `scenario_id` from `int` to `char` (16-byte fixed string). Argument count checks updated correctly (non-vec: 7→9, vec: 8→10 reflecting additional drive env handle + 2 new arrays).
pufferlib/ocean/env_config.h	Fixed config parsing for `init_mode` and `control_mode` to accept human-readable string values (e.g., `"create_all_valid"`, `"control_vehicles"`) instead of raw integers, with fallback warnings for unknown values.
pufferlib/ocean/benchmark/evaluate_imported_trajectories.py	Replaced KD-tree spatial alignment with an exact `(scenario_id, id)` pair lookup (`align_trajectories`), removing the `scipy` dependency. Now uses `is_track_to_predict` flag instead of `id >= 0` to count evaluated agents. Logic is cleaner and more robust.
pufferlib/ocean/benchmark/wosac.ini	Updated from 2025 to 2024 WOSAC challenge config: `time_to_collision` weight increased from 0.05 to 0.1, `traffic_light_violation` set to 0.0, `distance_to_road_edge` weight increased from 0.05 to 0.1. Weights still sum to 1.0, so removing the normalization in `_compute_metametric` is valid.

Sequence Diagram

sequenceDiagram
    participant PL as pufferl.py (eval)
    participant E as WOSACEvaluator
    participant VE as VecEnv (Drive)
    participant C as C Simulator

    PL->>E: evaluate(args, vecenv, policy)
    loop Until target scenarios covered OR max_batches
        E->>VE: resample_maps() [batch > 0]
        VE->>C: vec_close / binding.shared / env_init / vec_reset
        E->>VE: collect_ground_truth_trajectories()
        VE->>C: vec_get_global_ground_truth_trajectories()<br/>(x, y, z, heading, valid, id, is_vehicle, is_track_to_predict, scenario_id)
        C-->>VE: GT trajectory arrays
        alt eval_mode == "policy"
            E->>VE: collect_simulated_trajectories(policy)
            VE->>C: vec_step (num_rollouts × sim_steps)
            C-->>VE: simulated x/y/heading
        else eval_mode == "ground_truth"
            E->>E: repeat GT trajectories as simulated
        end
        E->>E: compute_metrics(gt, sim, agent_state, road_edges)
        Note over E: avg log-likelihoods per scenario → exp → realism_meta_score
        E->>E: track unique scenario_ids
    end
    E-->>PL: df_scene_level (per-scenario scores)
    PL->>PL: df.mean() → aggregate metrics dict
    PL->>PL: log to WandB

_{Last reviewed commit: b5adcd6}

greptile-apps

_{27 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-03T04:19:26Z

pufferlib/ocean/drive/drive.c

+#include "../env_config.h"
 #include <string.h>
 #include "../env_config.h"


Duplicate #include "../env_config.h"

env_config.h is included twice — once on line 4 and again on line 6. Although the header's #ifndef ENV_CONFIG_H guard prevents a double-definition compilation error, the second include is clearly an artifact from the merge and should be removed.

Suggested change

#include "../env_config.h"

#include <string.h>

#include "../env_config.h"

#include "drivenet.h"

#include "error.h"

#include "libgen.h"

#include "../env_config.h"

#include <string.h>

This is worth fixing

greptile-apps · 2026-03-03T04:19:31Z

pufferlib/utils.py

+        if not model_files:
+            print("No model files found for WOSAC evaluation. Running WOSAC with random policy.")
+        elif len(model_files) > 0:
+            latest_cpt = max(model_files, key=os.path.getctime)
+            cmd.extend(["--load-model-path", latest_cpt])


Redundant elif len(model_files) > 0 condition

After the if not model_files: branch, model_files is guaranteed to be non-empty, so elif len(model_files) > 0 is always True at that point — it's dead conditional logic. Replacing it with a plain else: makes the intent clearer and avoids the misleading branch.

greptile-apps · 2026-03-03T04:19:33Z

Additional Comments (1)

pufferlib/ocean/benchmark/evaluator.py
Per-scenario metric columns receive a global scalar, not per-scenario averages

np.mean([series1, series2, series3, series4]) where each series_i is a pandas Series of length N (one row per scenario) does not compute the element-wise (axis=1) mean. NumPy stacks the four series into a shape (4, N) array and then takes the mean over all 4 × N elements, producing a single scalar. That scalar is then broadcast to every row of df_scene_level.

As a result, df_scene_level["kinematic_metrics"], ["interactive_metrics"], and ["map_based_metrics"] are identical constants across all scenarios. While the final aggregate logged to WandB happens to be numerically correct (mean of a constant column = that constant), any per-scenario inspection of these columns will be misleading, and the intent is clearly to compute per-scenario means.

Fix by using DataFrame.mean(axis=1):

kinematic_cols = [
    "likelihood_linear_speed",
    "likelihood_linear_acceleration",
    "likelihood_angular_speed",
    "likelihood_angular_acceleration",
]
interactive_cols = [
    "likelihood_collision_indication",
    "likelihood_distance_to_nearest_object",
    "likelihood_time_to_collision",
]
map_cols = [
    "likelihood_distance_to_road_edge",
    "likelihood_offroad_indication",
]

df_scene_level["kinematic_metrics"] = df_scene_level[kinematic_cols].mean(axis=1)
df_scene_level["interactive_metrics"] = df_scene_level[interactive_cols].mean(axis=1)
df_scene_level["map_based_metrics"] = df_scene_level[map_cols].mean(axis=1)

eugenevinitsky · 2026-03-03T15:18:34Z

@daphne-cornelisse does this comment hold any water?
Per-scenario metric columns are global scalars (evaluator.py lines 601–608): np.mean([series1, series2, …]) produces a single scalar broadcast to all rows. Per-scenario breakdowns of kinematic_metrics, interactive_metrics, and map_based_metrics are therefore incorrect; only the final aggregate is unaffected.

eugenevinitsky · 2026-03-08T04:30:48Z

.github/workflows/perf-ci.yml

 on:
  pull_request:
-    branches: [ main ]
+    branches: [ 2.0 ]


these must be from the auto-merge, will fix

eugenevinitsky · 2026-03-09T04:14:46Z

pufferlib/pufferl.py

    """Evaluate a policy."""

    args = args or load_config(env_name)
+    args["env"]["termination_mode"] = 0


what's this do? Magic number make me nervous

termination mode 0 means we terminate the episode once all steps have been performed (I believe this is necessary for WOSAC). by contrast mode 1 means we terminate if all agents have been respawned at least once.

eugenevinitsky · 2026-03-09T04:15:43Z

pufferlib/ocean/env_binding.h

        c_get_road_edge_counts(drive, &np, &tp);
        c_get_road_edge_polylines(drive, &x_base[pt_offset], &y_base[pt_offset], &lengths_base[poly_offset],
-                                  &scenario_ids_base[poly_offset]);
+                                  &scenario_ids_base[poly_offset * 16]);


what's this magic number do>

It's because in WOMD the scenarios ids are a string of size 16

eugenevinitsky · 2026-03-09T04:30:27Z

pufferlib/config/ocean/drive.ini

+wosac_target_scenarios = 64
+; Total pool of scenarios to sample from
+wosac_scenario_pool_size = 1000
+; Max batches, used as a timeout to prevent an infinite loop


@WaelDLZ how are these two numbers different? I can see it in the code but it's unclear from the comment

I think that in the long run these should be replaced by the multiprocessed eval code.

From my understanding:

Pool size is the equivalent of the num_maps we would have in training.

target_scenario is like stop the eval once we saw 64 distinct scenarios (this is needed because we sample with replacement)

batch_size is we compute the metrics by batches of 32

But I hope that we can replace this by the mechanism I proposed in the multiprocessed eval PR

eugenevinitsky · 2026-03-09T04:30:54Z

pufferlib/config/ocean/drive.ini

 human_replay_eval = False
 ; Control only the self-driving car
 human_replay_control_mode = "control_sdc_only"
+; Number of scenarios for human replay evaluation equals the number of agents


also unclear comment

eugenevinitsky · 2026-03-09T04:32:11Z

pufferlib/ocean/benchmark/wosac.ini

 [traffic_light_violation]
 bernoulli = true
-metametric_weight = 0.05
+metametric_weight = 0.0


@WaelDLZ why not set to correct value?

Because we actually don't have a traffic light violation metric in our codebase. So to replicate the 2024 Wosac challenge we decided to do this

eugenevinitsky · 2026-03-09T04:35:53Z

pufferlib/ocean/drive/drive.h

+        // Skip if its the SDC
+        if (i == sdc_index) {
+            continue;
+        }


eugenevinitsky · 2026-03-09T04:36:34Z

pufferlib/ocean/drive/drive.h

+            char *scenario_id_ptr = scenario_ids_out + poly_idx * 16;
+            memcpy(scenario_id_ptr, env->scenario_id, 16);


magic numbers are to be avoided

eugenevinitsky · 2026-03-09T04:39:09Z

pufferlib/ocean/drive/drive.py

            "y": np.zeros(total_points, dtype=np.float32),
            "lengths": np.zeros(num_polylines, dtype=np.int32),
-            "scenario_id": np.zeros(num_polylines, dtype=np.int32),
+            "scenario_id": np.zeros(num_polylines, dtype="S16"),


why this dtype switch?

eugenevinitsky · 2026-03-09T04:39:48Z

pufferlib/ocean/env_binding.h

+        c_get_global_ground_truth_trajectories(
+            drive, &x_base[traj_offset], &y_base[traj_offset], &z_base[traj_offset], &heading_base[traj_offset],
+            &valid_base[traj_offset], &id_base[agent_offset], &is_vehicle_base[agent_offset],
+            &is_track_to_predict_base[agent_offset], &scenario_id_base[agent_offset * 16]);


magic number

- resolve conflicts on merge - adopt resample_maps() refactoring in drive.py - made the scenario_id string length attached to a constant.

riccardosavorgnan requested review from Aditya-Gupta26, daphne-cornelisse and mpragnay March 3, 2026 03:53

riccardosavorgnan marked this pull request as ready for review March 3, 2026 04:11

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

riccardosavorgnan force-pushed the ricky/merge_conflicts_3.0_beta_and_3.0 branch from b5adcd6 to 3464357 Compare March 8, 2026 00:50

eugenevinitsky reviewed Mar 8, 2026

View reviewed changes

eugenevinitsky reviewed Mar 9, 2026

View reviewed changes

pufferlib/ocean/drive/drive.h

Comment on lines +1928 to +1931

// Skip if its the SDC

if (i == sdc_index) {

continue;

}

Copy link

eugenevinitsky Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

eugenevinitsky reviewed Mar 9, 2026

View reviewed changes

eugenevinitsky approved these changes Mar 9, 2026

View reviewed changes

Merge 3.0 into 3.0_beta

bca5dca

- resolve conflicts on merge - adopt resample_maps() refactoring in drive.py - made the scenario_id string length attached to a constant.

riccardosavorgnan force-pushed the ricky/merge_conflicts_3.0_beta_and_3.0 branch from aa91598 to bca5dca Compare March 9, 2026 16:59

riccardosavorgnan merged commit fea9871 into 3.0_beta Mar 9, 2026
10 checks passed

		char scenario_id_ptr = scenario_ids_out + poly_idx 16;
		memcpy(scenario_id_ptr, env->scenario_id, 16);

Conversation

riccardosavorgnan commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

eugenevinitsky commented Mar 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants