feat: eval gate, pytest-timeout fix, README consistency (cycle 1 audit P0s)#48
feat: eval gate, pytest-timeout fix, README consistency (cycle 1 audit P0s)#48ChunkyTortoise wants to merge 2 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onsistency Cycle 1 P0 fixes (score: 36/50 -> 39/50): - Fix test count inconsistency: 6,700 -> 7,678 across all README references - Add pytest-timeout>=0.5 to requirements-dev.txt (fixes collection error) - Add evals/run_evals_deterministic.py: 50/50 dataset structure checks, exits 0 - Add evals/RESULTS.md documenting last successful run - Gate deterministic evals in CI (blocking, no API key); LLM-as-judge advisory - Add make evals target to Makefile - Fix Contributing test command to use full pytest.ini testpaths - Add MCP server row to For Hiring Managers table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bed07190d2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if len(dataset) != EXPECTED_TOTAL: | ||
| failures.append(f"Expected {EXPECTED_TOTAL} cases, got {len(dataset)}") | ||
|
|
||
| ids = [tc["id"] for tc in dataset] |
There was a problem hiding this comment.
Guard missing IDs before duplicate scan
The deterministic validator reads every case ID via tc["id"] before it checks REQUIRED_FIELDS, so a single malformed record without id raises KeyError and aborts the run with a traceback. In that scenario CI still fails, but you lose the structured failure report this script is intended to produce, making dataset regressions harder to diagnose quickly.
Useful? React with 👍 / 👎.
| # mentions a disclosure trigger scenario. | ||
| for tc in dataset: | ||
| if tc.get("category") == "compliance": | ||
| props = tc["expected_output_properties"] |
There was a problem hiding this comment.
Skip malformed compliance cases in second pass
After the main loop already records missing top-level fields, the compliance-only pass dereferences tc["expected_output_properties"] unconditionally. A compliance case missing that field will crash with KeyError instead of being reported as a normal validation failure, which undermines the deterministic gate's usefulness when schema errors are introduced.
Useful? React with 👍 / 👎.
Summary
Hero repo audit cycle 1 P0 fixes. Score: 36/50 -> 39/50.
pytest-timeout>=0.5torequirements-dev.txt-- resolvestest_strategic_claude_consultant.pycollection errorevals/run_evals_deterministic.py-- validates golden_dataset.json structure (50 cases, no duplicate IDs, required fields, category distribution). No API key needed.evals/RESULTS.md-- 50/50 PASS, 2026-04-26continue-on-error)make evalstargettestpathsfrompytest.iniTest plan
python evals/run_evals_deterministic.pyexits 0 (50/50 PASS)pytest tests/services/test_strategic_claude_consultant.py --collect-onlycollects 27 tests (no timeout marker error)grep "7,678" README.md | wc -lreturns 4 (all references consistent)🤖 Generated with Claude Code