Add periodic safe eval during training#329
Open
eugenevinitsky wants to merge 22 commits into3.0_betafrom
Open
Conversation
a200eb9 to
fd6b186
Compare
Periodically run policy with safe/law-abiding reward conditioning values (high collision, offroad, overspeed penalties) and render videos + collect metrics under the eval/ wandb tab. - Add [safe_eval] config section with safe reward conditioning values - Add --config flag to C visualizer for alternative INI files - Generate temp INI with min=max bounds for deterministic safe conditioning - Render safe eval videos and collect metrics via subprocess - Refactor eval subprocess functions to share common helper - Fix visualize.c: guard rand()%0 crash, propagate eval_gif return value - Fix safe_eval(): call policy.eval(), pass actual LSTM done/reward signals
fd6b186 to
479b24b
Compare
render_videos no longer deletes its input bin file as a side effect. Cleanup of bin_path and bin_path_epoch is now handled in a finally block in pufferl.py, eliminating the need for a defensive copy (bin_path_safe). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass --train.device from parent config so eval subprocesses use the correct device (e.g. cpu). Use = syntax for reward bound args to prevent argparse from interpreting negative values like -5e-05 as flags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Render and safe eval now trigger on their own intervals independently of checkpoint_interval. They use the latest available .pt checkpoint, so they don't require a fresh one. This means safe_eval_interval no longer needs to be a multiple of checkpoint_interval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename safe_eval config keys to drop the safe_eval_ prefix since they're already in the [safe_eval] section. Replace SAFE_EVAL_BOUND_KEYS mapping with a simple SAFE_EVAL_REWARD_BOUNDS list since config keys now match bound names directly (e.g. collision instead of safe_eval_collision). Also rename safe_eval→enabled, safe_eval_interval→interval, safe_eval_num_agents→num_agents, safe_eval_num_episodes→num_episodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of maintaining a hardcoded list of reward bound names, treat every key in [safe_eval] that isn't a control key (enabled, interval, num_agents, num_episodes) as a reward bound. Adding a new reward bound now only requires adding it to the config file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace SAFE_EVAL_CONTROL_KEYS exclusion list with _get_env_reward_bound_names() that discovers valid bound names by pattern-matching reward_bound_*_min keys in the [env] config section. Safe eval now only passes keys that match known env bounds, so adding a new control key to [safe_eval] can't accidentally become a bound arg. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix --eval.wosac-num-agents (nonexistent) to --eval.wosac-num-rollouts (actual config key). Also increase stderr output from 500 to 1000 chars to show full tracebacks when eval subprocesses fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… logs WOSAC ground truth trajectories are 9.1s at 10Hz = 91 steps. Without setting episode_length, the eval subprocess used the default training episode_length, causing shape mismatches between simulated and reference trajectories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eval subprocesses (wosac, human replay, safe eval metrics) previously blocked the training loop. Now they optionally run in daemon threads, with cleanup on close(). Enabled by default via eval_async=True. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduced the number of episodes from 300 to 100 and added episode length, min and max goal distance parameters.
- safe_eval: pass episode_length, min_goal_distance, max_goal_distance to both the safe_eval() function (env overrides) and the subprocess (CLI args). Also pass num_agents to subprocess. - wosac/human_replay: pass map_dir and num_maps to subprocesses so CLI overrides are respected. - Add missing human_replay_num_agents to drive.ini config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- safe_eval() was using num_episodes as num_steps, running far too few
steps to complete any episodes. Now runs until enough episodes are
collected, counting by the 'n' field in each info dict.
- Fix wandb step conflict: async eval threads logged at stale steps,
causing wandb to drop metrics. Use define_metric("eval/*",
step_metric="eval_step") so eval metrics have their own step counter.
- Fix subprocess config passthrough: pass episode_length,
min_goal_distance, max_goal_distance as --safe-eval.* args (not
--env.*) so safe_eval() correctly applies them from safe_eval_config.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When render_async is enabled, the bin_path_epoch file was deleted in the finally block while the async render process was still reading it. Now the async render process passes bin_path through the render queue, and check_render_queue() deletes it after the process finishes. Also drains the queue in close() to clean up any remaining bin files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge resolution: - Keep _run_eval_subprocess helper and eval_async infrastructure from our branch - Adopt 3.0_beta's new WOSAC params: wosac_batch_size, wosac_target_scenarios, wosac_scenario_pool_size, wosac_eval_mode (replaces wosac_num_rollouts) - Adopt 3.0_beta's expanded wandb keys: realism_meta_score_std, kinematic_metrics, interactive_metrics, map_based_metrics - Adopt 3.0_beta's evaluator.evaluate() batched API in pufferl.py - Keep our eval_step metric, map_dir passthrough, episode_length - Resolve duplicate human_replay_num_agents (use 3.0_beta's value: 16) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace os.path.getctime with lexicographic max() for finding latest checkpoint (filenames are zero-padded epoch numbers) - Update SLURM account to torch_pr_355_tandon_advanced - Fix container overlay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Enable render_async and eval_async by default in drive.ini - Fix human replay subprocess: --eval.num-maps → --env.num-maps - Fix eval_async string "False" being truthy (explicit string-to-bool) - Align WOSAC/human replay trigger to same interval as render/safe eval - Forward env config to eval subprocesses - Add eval documentation to docs/src/evaluation.md - Add TIMING prints for eval component profiling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
eval/wandb tab[safe_eval]config section indrive.iniwith configurable interval, episode count, and safe reward values--configflag to C visualizer so it can render with alternative INI fileswosac,human_replay,safe_eval) to share a common_run_eval_subprocesshelper, eliminating ~50 lines of duplicated boilerplateHow it works
safe_eval_intervalepochs, copies the exported.binmodeleval/world_stateandeval/agent_viewin wandbpufferlib.pufferl safe_eval) to collect metrics → logged undereval/*in wandbTest plan
safe_eval = Trueand verify videos appear in wandbevaltabeval/prefix_run_eval_subprocessrefactorsafe_eval = False(default for non-drive envs) to confirm no side effectsNOTE: code heavily generated by claude, this is just a prototype