Skip to content

feat: eval gate, pytest-timeout fix, README consistency (cycle 1 audit P0s)#48

Open
ChunkyTortoise wants to merge 2 commits into
mainfrom
feat/hiring-signal-enhancement
Open

feat: eval gate, pytest-timeout fix, README consistency (cycle 1 audit P0s)#48
ChunkyTortoise wants to merge 2 commits into
mainfrom
feat/hiring-signal-enhancement

Conversation

@ChunkyTortoise
Copy link
Copy Markdown
Owner

Summary

Hero repo audit cycle 1 P0 fixes. Score: 36/50 -> 39/50.

  • Fix test count inconsistency across all README references (6,700 -> 7,678, consistent with badge and pytest output)
  • Add pytest-timeout>=0.5 to requirements-dev.txt -- resolves test_strategic_claude_consultant.py collection error
  • Add evals/run_evals_deterministic.py -- validates golden_dataset.json structure (50 cases, no duplicate IDs, required fields, category distribution). No API key needed.
  • Add evals/RESULTS.md -- 50/50 PASS, 2026-04-26
  • Gate deterministic evals in CI as blocking step; keep LLM-as-judge as advisory (continue-on-error)
  • Add make evals target
  • Fix Contributing section test command to use full testpaths from pytest.ini
  • Add MCP server row to For Hiring Managers table

Test plan

  • python evals/run_evals_deterministic.py exits 0 (50/50 PASS)
  • pytest tests/services/test_strategic_claude_consultant.py --collect-only collects 27 tests (no timeout marker error)
  • grep "7,678" README.md | wc -l returns 4 (all references consistent)
  • CI eval job now runs deterministic checks without secrets

🤖 Generated with Claude Code

chunktort and others added 2 commits April 16, 2026 20:35
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onsistency

Cycle 1 P0 fixes (score: 36/50 -> 39/50):
- Fix test count inconsistency: 6,700 -> 7,678 across all README references
- Add pytest-timeout>=0.5 to requirements-dev.txt (fixes collection error)
- Add evals/run_evals_deterministic.py: 50/50 dataset structure checks, exits 0
- Add evals/RESULTS.md documenting last successful run
- Gate deterministic evals in CI (blocking, no API key); LLM-as-judge advisory
- Add make evals target to Makefile
- Fix Contributing test command to use full pytest.ini testpaths
- Add MCP server row to For Hiring Managers table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bed07190d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if len(dataset) != EXPECTED_TOTAL:
failures.append(f"Expected {EXPECTED_TOTAL} cases, got {len(dataset)}")

ids = [tc["id"] for tc in dataset]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard missing IDs before duplicate scan

The deterministic validator reads every case ID via tc["id"] before it checks REQUIRED_FIELDS, so a single malformed record without id raises KeyError and aborts the run with a traceback. In that scenario CI still fails, but you lose the structured failure report this script is intended to produce, making dataset regressions harder to diagnose quickly.

Useful? React with 👍 / 👎.

# mentions a disclosure trigger scenario.
for tc in dataset:
if tc.get("category") == "compliance":
props = tc["expected_output_properties"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip malformed compliance cases in second pass

After the main loop already records missing top-level fields, the compliance-only pass dereferences tc["expected_output_properties"] unconditionally. A compliance case missing that field will crash with KeyError instead of being reported as a normal validation failure, which undermines the deterministic gate's usefulness when schema errors are introduced.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants