Skip to content

ricky/merge conflicts 3.0 beta and 3.0#325

Merged
riccardosavorgnan merged 1 commit into3.0_betafrom
ricky/merge_conflicts_3.0_beta_and_3.0
Mar 9, 2026
Merged

ricky/merge conflicts 3.0 beta and 3.0#325
riccardosavorgnan merged 1 commit into3.0_betafrom
ricky/merge_conflicts_3.0_beta_and_3.0

Conversation

@riccardosavorgnan
Copy link
Collaborator

@riccardosavorgnan riccardosavorgnan marked this pull request as ready for review March 3, 2026 04:11
@greptile-apps
Copy link

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR merges a large batch of improvements from 3.0 into 3.0_beta, covering WOSAC evaluation robustness, simulator correctness fixes, improved demo/visualizer CLI tooling, documentation updates, and CI branch alignment.

Key changes:

  • WOSAC evaluation overhaul: WOSACEvaluator gains a batched evaluate() loop that samples scenarios until a target count is reached, a ground_truth eval mode for upper-bound scoring, and a collect_wosac_random_baseline() method. The metric pipeline is corrected — log-likelihoods are now averaged per-scenario first, then exponentiated, which is the mathematically sound approach.
  • Metric weight alignment: wosac.ini is updated to the 2024 WOSAC challenge config (traffic_light_violation weight = 0.0, time_to_collision and distance_to_road_edge weights adjusted to 0.1 each). Weights still sum to 1.0, so the removed normalization division in _compute_metametric is correct.
  • Simulator correctness: SDC is now guaranteed to be initialized first (drive.h), collision/observation loops guard against zero static_agent_count, scenario_id is stored as a fixed char[16] buffer instead of a pointer, and env_config.h correctly parses string-valued init_mode/control_mode settings.
  • Trajectory alignment improvement: KD-tree spatial alignment in evaluate_imported_trajectories.py replaced with exact (scenario_id, id) pair lookup, removing the scipy dependency.
  • Per-scenario metric columns are global scalars (evaluator.py lines 601–608): np.mean([series1, series2, …]) produces a single scalar broadcast to all rows. Per-scenario breakdowns of kinematic_metrics, interactive_metrics, and map_based_metrics are therefore incorrect; only the final aggregate is unaffected.
  • Duplicate include in drive.c (lines 4 & 6): #include "../env_config.h" appears twice — harmless due to header guards but should be removed.
  • Redundant branch in utils.py: elif len(model_files) > 0 is always True after if not model_files: returns False.

Confidence Score: 3/5

  • This PR is mostly safe but contains a logic issue in per-scenario metric aggregation that produces misleading per-scene data and a duplicate include in C source.
  • The core evaluation pipeline change (average log-likelihoods then exponentiate) is mathematically sound, simulator safety guards are improved, and the _compute_metametric normalization removal is valid. However, the per-scenario kinematic/interactive/map-based metric columns are computed as global scalars rather than per-scenario means due to incorrect use of np.mean([series1, series2, ...]). This doesn't break the aggregate WandB metrics but produces misleading per-scenario data in the returned DataFrame. The duplicate include in drive.c and the redundant branch in utils.py are minor cleanup items.
  • pufferlib/ocean/benchmark/evaluator.py (per-scenario metric columns), pufferlib/ocean/drive/drive.c (duplicate include)

Important Files Changed

Filename Overview
pufferlib/ocean/benchmark/evaluator.py Major overhaul of WOSAC evaluation: adds batched evaluate() loop, ground-truth eval mode, random baseline, and correct log-likelihood→likelihood pipeline (average log-likelihoods per scene, then exponentiate). The _compute_metametric normalization was correctly removed because weights now sum to 1.0. Key issue: per-scenario kinematic/interactive/map-based metric columns are assigned a global scalar instead of per-scenario means.
pufferlib/ocean/drive/drive.c Added CLI argument parsing for demo/visualizer tool and error handling when no active agents are found. Contains a duplicate #include "../env_config.h" (lines 4 and 6) introduced by the merge, which is harmless due to header guards but should be cleaned up.
pufferlib/ocean/drive/drive.h Changed scenario_id from char* to a fixed char[16] buffer, added SDC-first initialization logic, fixed collision_check/observation to guard static_agent_count > 0, renamed get_track_id_or_placeholder to is_in_track_to_predicts (now returns bool), and updated c_get_global_ground_truth_trajectories to expose is_vehicle and is_track_to_predict fields.
pufferlib/ocean/drive/drive.py Removed use_all_maps parameter, refactored resampling into a standalone resample_maps() method, updated ground-truth trajectory retrieval to include is_vehicle, is_track_to_predict, and string scenario_id (S16 dtype). The truncations reset was correctly removed from step() because the C layer uses memset to clear it.
pufferlib/pufferl.py Eval interval fix: changed epoch % eval_interval to (epoch - 1) % eval_interval to avoid running at epoch 0; updated eval flow to use new batched WOSACEvaluator.evaluate() and exposes new WandB metrics (kinematic, interactive, map-based scores, and std of realism score).
pufferlib/utils.py WOSAC subprocess now gracefully handles missing model files (runs with random policy instead of early-return). Contains a redundant elif len(model_files) > 0 condition that should be a plain else:. New eval hyperparameters (wosac_batch_size, wosac_target_scenarios, wosac_scenario_pool_size) are forwarded to the subprocess.
pufferlib/ocean/env_binding.h Updated ground-truth trajectory bindings to accept two new arrays (is_vehicle, is_track_to_predict) and changed scenario_id from int* to char* (16-byte fixed string). Argument count checks updated correctly (non-vec: 7→9, vec: 8→10 reflecting additional drive env handle + 2 new arrays).
pufferlib/ocean/env_config.h Fixed config parsing for init_mode and control_mode to accept human-readable string values (e.g., "create_all_valid", "control_vehicles") instead of raw integers, with fallback warnings for unknown values.
pufferlib/ocean/benchmark/evaluate_imported_trajectories.py Replaced KD-tree spatial alignment with an exact (scenario_id, id) pair lookup (align_trajectories), removing the scipy dependency. Now uses is_track_to_predict flag instead of id >= 0 to count evaluated agents. Logic is cleaner and more robust.
pufferlib/ocean/benchmark/wosac.ini Updated from 2025 to 2024 WOSAC challenge config: time_to_collision weight increased from 0.05 to 0.1, traffic_light_violation set to 0.0, distance_to_road_edge weight increased from 0.05 to 0.1. Weights still sum to 1.0, so removing the normalization in _compute_metametric is valid.

Sequence Diagram

sequenceDiagram
    participant PL as pufferl.py (eval)
    participant E as WOSACEvaluator
    participant VE as VecEnv (Drive)
    participant C as C Simulator

    PL->>E: evaluate(args, vecenv, policy)
    loop Until target scenarios covered OR max_batches
        E->>VE: resample_maps() [batch > 0]
        VE->>C: vec_close / binding.shared / env_init / vec_reset
        E->>VE: collect_ground_truth_trajectories()
        VE->>C: vec_get_global_ground_truth_trajectories()<br/>(x, y, z, heading, valid, id, is_vehicle, is_track_to_predict, scenario_id)
        C-->>VE: GT trajectory arrays
        alt eval_mode == "policy"
            E->>VE: collect_simulated_trajectories(policy)
            VE->>C: vec_step (num_rollouts × sim_steps)
            C-->>VE: simulated x/y/heading
        else eval_mode == "ground_truth"
            E->>E: repeat GT trajectories as simulated
        end
        E->>E: compute_metrics(gt, sim, agent_state, road_edges)
        Note over E: avg log-likelihoods per scenario → exp → realism_meta_score
        E->>E: track unique scenario_ids
    end
    E-->>PL: df_scene_level (per-scenario scores)
    PL->>PL: df.mean() → aggregate metrics dict
    PL->>PL: log to WandB
Loading

Last reviewed commit: b5adcd6

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

27 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 4 to 6
#include "../env_config.h"
#include <string.h>
#include "../env_config.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate #include "../env_config.h"

env_config.h is included twice — once on line 4 and again on line 6. Although the header's #ifndef ENV_CONFIG_H guard prevents a double-definition compilation error, the second include is clearly an artifact from the merge and should be removed.

Suggested change
#include "../env_config.h"
#include <string.h>
#include "../env_config.h"
#include "drivenet.h"
#include "error.h"
#include "libgen.h"
#include "../env_config.h"
#include <string.h>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth fixing

Comment on lines +124 to +128
if not model_files:
print("No model files found for WOSAC evaluation. Running WOSAC with random policy.")
elif len(model_files) > 0:
latest_cpt = max(model_files, key=os.path.getctime)
cmd.extend(["--load-model-path", latest_cpt])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant elif len(model_files) > 0 condition

After the if not model_files: branch, model_files is guaranteed to be non-empty, so elif len(model_files) > 0 is always True at that point — it's dead conditional logic. Replacing it with a plain else: makes the intent clearer and avoids the misleading branch.

@greptile-apps
Copy link

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

pufferlib/ocean/benchmark/evaluator.py
Per-scenario metric columns receive a global scalar, not per-scenario averages

np.mean([series1, series2, series3, series4]) where each series_i is a pandas Series of length N (one row per scenario) does not compute the element-wise (axis=1) mean. NumPy stacks the four series into a shape (4, N) array and then takes the mean over all 4 × N elements, producing a single scalar. That scalar is then broadcast to every row of df_scene_level.

As a result, df_scene_level["kinematic_metrics"], ["interactive_metrics"], and ["map_based_metrics"] are identical constants across all scenarios. While the final aggregate logged to WandB happens to be numerically correct (mean of a constant column = that constant), any per-scenario inspection of these columns will be misleading, and the intent is clearly to compute per-scenario means.

Fix by using DataFrame.mean(axis=1):

kinematic_cols = [
    "likelihood_linear_speed",
    "likelihood_linear_acceleration",
    "likelihood_angular_speed",
    "likelihood_angular_acceleration",
]
interactive_cols = [
    "likelihood_collision_indication",
    "likelihood_distance_to_nearest_object",
    "likelihood_time_to_collision",
]
map_cols = [
    "likelihood_distance_to_road_edge",
    "likelihood_offroad_indication",
]

df_scene_level["kinematic_metrics"] = df_scene_level[kinematic_cols].mean(axis=1)
df_scene_level["interactive_metrics"] = df_scene_level[interactive_cols].mean(axis=1)
df_scene_level["map_based_metrics"] = df_scene_level[map_cols].mean(axis=1)

@eugenevinitsky
Copy link

@daphne-cornelisse does this comment hold any water?
Per-scenario metric columns are global scalars (evaluator.py lines 601–608): np.mean([series1, series2, …]) produces a single scalar broadcast to all rows. Per-scenario breakdowns of kinematic_metrics, interactive_metrics, and map_based_metrics are therefore incorrect; only the final aggregate is unaffected.

@riccardosavorgnan riccardosavorgnan force-pushed the ricky/merge_conflicts_3.0_beta_and_3.0 branch from b5adcd6 to 3464357 Compare March 8, 2026 00:50
on:
pull_request:
branches: [ main ]
branches: [ 2.0 ]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these must be from the auto-merge, will fix

"""Evaluate a policy."""

args = args or load_config(env_name)
args["env"]["termination_mode"] = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this do? Magic number make me nervous

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

termination mode 0 means we terminate the episode once all steps have been performed (I believe this is necessary for WOSAC). by contrast mode 1 means we terminate if all agents have been respawned at least once.

c_get_road_edge_counts(drive, &np, &tp);
c_get_road_edge_polylines(drive, &x_base[pt_offset], &y_base[pt_offset], &lengths_base[poly_offset],
&scenario_ids_base[poly_offset]);
&scenario_ids_base[poly_offset * 16]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this magic number do>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because in WOMD the scenarios ids are a string of size 16

Comment on lines +183 to +186
wosac_target_scenarios = 64
; Total pool of scenarios to sample from
wosac_scenario_pool_size = 1000
; Max batches, used as a timeout to prevent an infinite loop

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WaelDLZ how are these two numbers different? I can see it in the code but it's unclear from the comment

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in the long run these should be replaced by the multiprocessed eval code.

From my understanding:

  • Pool size is the equivalent of the num_maps we would have in training.
  • target_scenario is like stop the eval once we saw 64 distinct scenarios (this is needed because we sample with replacement)
  • batch_size is we compute the metrics by batches of 32

But I hope that we can replace this by the mechanism I proposed in the multiprocessed eval PR

human_replay_eval = False
; Control only the self-driving car
human_replay_control_mode = "control_sdc_only"
; Number of scenarios for human replay evaluation equals the number of agents

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also unclear comment

[traffic_light_violation]
bernoulli = true
metametric_weight = 0.05
metametric_weight = 0.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WaelDLZ why not set to correct value?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we actually don't have a traffic light violation metric in our codebase. So to replicate the 2024 Wosac challenge we decided to do this

Comment on lines +1928 to +1931
// Skip if its the SDC
if (i == sdc_index) {
continue;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Comment on lines +2202 to +2203
char *scenario_id_ptr = scenario_ids_out + poly_idx * 16;
memcpy(scenario_id_ptr, env->scenario_id, 16);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic numbers are to be avoided

"y": np.zeros(total_points, dtype=np.float32),
"lengths": np.zeros(num_polylines, dtype=np.int32),
"scenario_id": np.zeros(num_polylines, dtype=np.int32),
"scenario_id": np.zeros(num_polylines, dtype="S16"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this dtype switch?

c_get_global_ground_truth_trajectories(
drive, &x_base[traj_offset], &y_base[traj_offset], &z_base[traj_offset], &heading_base[traj_offset],
&valid_base[traj_offset], &id_base[agent_offset], &is_vehicle_base[agent_offset],
&is_track_to_predict_base[agent_offset], &scenario_id_base[agent_offset * 16]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic number

- resolve conflicts on merge
- adopt resample_maps() refactoring in drive.py
- made the scenario_id string length attached to a constant.
@riccardosavorgnan riccardosavorgnan force-pushed the ricky/merge_conflicts_3.0_beta_and_3.0 branch from aa91598 to bca5dca Compare March 9, 2026 16:59
@riccardosavorgnan riccardosavorgnan merged commit fea9871 into 3.0_beta Mar 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants