Skip to content

Add Gherkin acceptance E2E harness example#2887

Merged
Hmbown merged 2 commits into
Hmbown:codex/v0.9.0-stewardshipfrom
aboimpinto:feat/2791-acceptance-e2e-harness
Jun 7, 2026
Merged

Add Gherkin acceptance E2E harness example#2887
Hmbown merged 2 commits into
Hmbown:codex/v0.9.0-stewardshipfrom
aboimpinto:feat/2791-acceptance-e2e-harness

Conversation

@aboimpinto

Copy link
Copy Markdown
Contributor

Summary

Refs #2886 and #2791. Reference branch/PR: #2851.

This PR adds the first Gherkin-style acceptance E2E example for the command/tool lifecycle work. It is intentionally a small layer: it does not refactor command structure. It adds an executable acceptance harness that can describe owner-level behavior in Given / When / Then language and then verify the first slice through public process and mocked-provider borders.

Why this layer

The command-strategy work is being split into smaller PRs. Before moving command ownership/routing code, we need behavior-level tests that describe the full user-visible flow. These tests complement the existing unit and narrower integration tests:

  • Unit tests are still best for parser, registry, routing, and rendering helpers.
  • The offline eval harness is useful for deterministic tool-loop fixtures.
  • The new Gherkin tests make the acceptance flow readable from the issue/owner perspective and executable in CI.
  • The mocked LLM boundary proves that CodeWhale sends the tool result back into the next model request, which is not visible from isolated tool tests.

What changed

  • Added cucumber as a TUI dev-dependency.
  • Added directory_listing_acceptance.rs with a simple Gherkin feature for the current directory-listing happy path.
  • Added tool_lifecycle_acceptance.rs with a fuller public-border lifecycle scenario.
  • Added Gherkin feature files under crates/tui/tests/features/.
  • Added an eval-harness regression that records the simulated tool loop and validates the expected tool plan.

Step-by-step behavior asserted

The main lifecycle scenario is written as:

Feature: Tool call lifecycle
  Scenario: Happy path lists the current directory through a tool
    Given an offline CodeWhale workspace containing:
      | path      | kind   |
      | README.md | file   |
      | notes.txt | file   |
      | src       | folder |
    And the mocked LLM will request the "list_dir" tool with:
      | path |
      | .    |
    And the mocked LLM will answer after the tool result:
      | content                                               |
      | The directory contains README.md, notes.txt, and src/. |
    When the user asks "list the current directory"
    Then CodeWhale should send the user request to the mocked LLM
    And the public tool lifecycle should show a running tool:
      | status  | marker | tool     | input |
      | running | [~]    | list_dir | .     |
    And the public tool result should return directory entries:
      | entry     | kind   |
      | README.md | file   |
      | notes.txt | file   |
      | src       | folder |
    And CodeWhale should send the tool result back to the mocked LLM
    And the public tool lifecycle should show a completed tool:
      | status    | marker | tool     | input |
      | completed | ✓      | list_dir | .     |
    And the public output should include "The directory contains README.md, notes.txt, and src/."

The executable step definitions assert this through these components:

  1. Workspace/filesystem: creates an offline temp workspace with files and a folder.
  2. LLM provider boundary: starts a local OpenAI-compatible mock server for /v1/models and /v1/chat/completions.
  3. CLI public border: runs the real codewhale-tui exec --auto --output-format stream-json binary.
  4. Initial model request: verifies the first chat request contains the user prompt and no tool result yet.
  5. Tool decision: the mocked LLM streams a list_dir tool call with {"path":"."}.
  6. Public lifecycle event: verifies CodeWhale emits the public tool_use event and enforces the running marker contract [~] from the scenario table.
  7. Tool execution: verifies the real list_dir result includes README.md, notes.txt, and src with file/folder metadata.
  8. Next prompt / model loop: verifies the second chat request sends the tool message back to the mocked LLM with the expected tool_call_id and directory entries.
  9. Final model answer: the mocked LLM streams the final formatted answer.
  10. Public output: verifies the final answer is emitted in the public stream.

Statusline and BlueWhale note

This PR does not claim to verify the interactive screen yet. The current executable slice uses the cross-platform exec stream, so it cannot honestly assert the rendered Statusline or the moving BlueWhale in the top-right UI.

The feature file includes a note for the next PTY/frame-capture layer. That layer should drive the real TUI and assert:

  • Task List running row: [~] list_dir .
  • Task List completed row: [✓] list_dir .
  • Statusline active/completed state if this workflow updates it.
  • BlueWhale activity while the tool or turn is running.
  • BlueWhale stopped/completed state after the turn finishes.
  • Final answer rendered in the transcript/screen.

Validation

cargo test -p codewhale-tui --test tool_lifecycle_acceptance happy_path_lists_current_directory_through_tool -- --exact
cargo test -p codewhale-tui --test tool_lifecycle_acceptance
cargo test -p codewhale-tui --test directory_listing_acceptance
cargo test -p codewhale-tui --test eval_harness
cargo fmt --check
git diff --check

I also temporarily mutated the scenario markers to prove the contract fails as expected:

  • running [~] changed to [x] failed with left: "[x]", right: "[~]"
  • completed changed to X failed with left: "X", right: "✓"

Paulo Aboim Pinto

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aboimpinto has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

Thanks @aboimpinto for taking the time to contribute.

This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in .github/APPROVED_CONTRIBUTORS will be closed automatically.

Please read CONTRIBUTING.md for the expected contribution shape. A maintainer can grant PR access by commenting /lgtm on a pull request.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Cucumber acceptance tests for directory listing and the public LLM/tool lifecycle in crates/tui, adding feature files and corresponding test runners. Feedback on the implementation focuses on improving the robustness of the test harness: resolving a potential deadlock in run_with_timeout by reading process output concurrently, recursively creating parent directories for workspace files to prevent write failures, and preserving terminal and locale environment variables when clearing the host environment.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/tui/tests/tool_lifecycle_acceptance.rs
Comment thread crates/tui/tests/tool_lifecycle_acceptance.rs
Comment thread crates/tui/tests/tool_lifecycle_acceptance.rs

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aboimpinto has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@aboimpinto aboimpinto marked this pull request as ready for review June 7, 2026 14:31

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aboimpinto has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@Hmbown Hmbown merged commit 2c56f77 into Hmbown:codex/v0.9.0-stewardship Jun 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants