Add Gherkin acceptance E2E harness example by aboimpinto · Pull Request #2887 · Hmbown/CodeWhale

aboimpinto · 2026-06-07T14:13:26Z

Summary

Refs #2886 and #2791. Reference branch/PR: #2851.

This PR adds the first Gherkin-style acceptance E2E example for the command/tool lifecycle work. It is intentionally a small layer: it does not refactor command structure. It adds an executable acceptance harness that can describe owner-level behavior in Given / When / Then language and then verify the first slice through public process and mocked-provider borders.

Why this layer

The command-strategy work is being split into smaller PRs. Before moving command ownership/routing code, we need behavior-level tests that describe the full user-visible flow. These tests complement the existing unit and narrower integration tests:

Unit tests are still best for parser, registry, routing, and rendering helpers.
The offline eval harness is useful for deterministic tool-loop fixtures.
The new Gherkin tests make the acceptance flow readable from the issue/owner perspective and executable in CI.
The mocked LLM boundary proves that CodeWhale sends the tool result back into the next model request, which is not visible from isolated tool tests.

What changed

Added cucumber as a TUI dev-dependency.
Added directory_listing_acceptance.rs with a simple Gherkin feature for the current directory-listing happy path.
Added tool_lifecycle_acceptance.rs with a fuller public-border lifecycle scenario.
Added Gherkin feature files under crates/tui/tests/features/.
Added an eval-harness regression that records the simulated tool loop and validates the expected tool plan.

Step-by-step behavior asserted

The main lifecycle scenario is written as:

Feature: Tool call lifecycle
  Scenario: Happy path lists the current directory through a tool
    Given an offline CodeWhale workspace containing:
      | path      | kind   |
      | README.md | file   |
      | notes.txt | file   |
      | src       | folder |
    And the mocked LLM will request the "list_dir" tool with:
      | path |
      | .    |
    And the mocked LLM will answer after the tool result:
      | content                                               |
      | The directory contains README.md, notes.txt, and src/. |
    When the user asks "list the current directory"
    Then CodeWhale should send the user request to the mocked LLM
    And the public tool lifecycle should show a running tool:
      | status  | marker | tool     | input |
      | running | [~]    | list_dir | .     |
    And the public tool result should return directory entries:
      | entry     | kind   |
      | README.md | file   |
      | notes.txt | file   |
      | src       | folder |
    And CodeWhale should send the tool result back to the mocked LLM
    And the public tool lifecycle should show a completed tool:
      | status    | marker | tool     | input |
      | completed | ✓      | list_dir | .     |
    And the public output should include "The directory contains README.md, notes.txt, and src/."

The executable step definitions assert this through these components:

Workspace/filesystem: creates an offline temp workspace with files and a folder.
LLM provider boundary: starts a local OpenAI-compatible mock server for /v1/models and /v1/chat/completions.
CLI public border: runs the real codewhale-tui exec --auto --output-format stream-json binary.
Initial model request: verifies the first chat request contains the user prompt and no tool result yet.
Tool decision: the mocked LLM streams a list_dir tool call with {"path":"."}.
Public lifecycle event: verifies CodeWhale emits the public tool_use event and enforces the running marker contract [~] from the scenario table.
Tool execution: verifies the real list_dir result includes README.md, notes.txt, and src with file/folder metadata.
Next prompt / model loop: verifies the second chat request sends the tool message back to the mocked LLM with the expected tool_call_id and directory entries.
Final model answer: the mocked LLM streams the final formatted answer.
Public output: verifies the final answer is emitted in the public stream.

Statusline and BlueWhale note

This PR does not claim to verify the interactive screen yet. The current executable slice uses the cross-platform exec stream, so it cannot honestly assert the rendered Statusline or the moving BlueWhale in the top-right UI.

The feature file includes a note for the next PTY/frame-capture layer. That layer should drive the real TUI and assert:

Task List running row: [~] list_dir .
Task List completed row: [✓] list_dir .
Statusline active/completed state if this workflow updates it.
BlueWhale activity while the tool or turn is running.
BlueWhale stopped/completed state after the turn finishes.
Final answer rendered in the transcript/screen.

Validation

cargo test -p codewhale-tui --test tool_lifecycle_acceptance happy_path_lists_current_directory_through_tool -- --exact
cargo test -p codewhale-tui --test tool_lifecycle_acceptance
cargo test -p codewhale-tui --test directory_listing_acceptance
cargo test -p codewhale-tui --test eval_harness
cargo fmt --check
git diff --check

I also temporarily mutated the scenario markers to prove the contract fails as expected:

running [~] changed to [x] failed with left: "[x]", right: "[~]"
completed ✓ changed to X failed with left: "X", right: "✓"

Paulo Aboim Pinto

greptile-apps

aboimpinto has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

github-actions · 2026-06-07T14:13:35Z

Thanks @aboimpinto for taking the time to contribute.

This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in .github/APPROVED_CONTRIBUTORS will be closed automatically.

Please read CONTRIBUTING.md for the expected contribution shape. A maintainer can grant PR access by commenting /lgtm on a pull request.

gemini-code-assist

Code Review

This pull request introduces Cucumber acceptance tests for directory listing and the public LLM/tool lifecycle in crates/tui, adding feature files and corresponding test runners. Feedback on the implementation focuses on improving the robustness of the test harness: resolving a potential deadlock in run_with_timeout by reading process output concurrently, recursively creating parent directories for workspace files to prevent write failures, and preserving terminal and locale environment variables when clearing the host environment.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

greptile-apps

aboimpinto has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

greptile-apps

aboimpinto has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

Add Gherkin acceptance E2E harness example

d90031f

greptile-apps Bot reviewed Jun 7, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread crates/tui/tests/tool_lifecycle_acceptance.rs

Comment thread crates/tui/tests/tool_lifecycle_acceptance.rs

Comment thread crates/tui/tests/tool_lifecycle_acceptance.rs

Address acceptance harness review feedback

c25f7af

greptile-apps Bot reviewed Jun 7, 2026

View reviewed changes

aboimpinto marked this pull request as ready for review June 7, 2026 14:31

greptile-apps Bot reviewed Jun 7, 2026

View reviewed changes

Hmbown merged commit 2c56f77 into Hmbown:codex/v0.9.0-stewardship Jun 7, 2026
1 check passed

aboimpinto mentioned this pull request Jun 7, 2026

EPIC: staged command-boundary refactor for #2791 #2870

Open

7 tasks

Hmbown mentioned this pull request Jun 8, 2026

v0.8.54 — Benchmark Runners, Community Harvests, Whaleflow Foundation #2902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gherkin acceptance E2E harness example#2887

Add Gherkin acceptance E2E harness example#2887
Hmbown merged 2 commits into
Hmbown:codex/v0.9.0-stewardshipfrom
aboimpinto:feat/2791-acceptance-e2e-harness

aboimpinto commented Jun 7, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aboimpinto commented Jun 7, 2026

Summary

Why this layer

What changed

Step-by-step behavior asserted

Statusline and BlueWhale note

Validation

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants