From f440f9108d5f6ca1e45c54429c6bdc241760659a Mon Sep 17 00:00:00 2001 From: Nathan Schram <5553883+nathanschram@users.noreply.github.com> Date: Wed, 22 Apr 2026 07:02:14 +0000 Subject: [PATCH 1/2] docs(claude.md): note v0.35.3rc1 staging + Claude extra_args feature MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Bump unit-test count 2372 → 2387 (reflects #407 +8 test_build_args tests and prior untracked test additions). - Expand test_build_args.py entry 42 → 56 tests with the new coverage areas. - Add extra_args passthrough feature entry under "Features (vs upstream takopi)" — documents the Claude-in-Chrome motivator, reserved-flag list, and argv placement (#407, shipped in v0.35.3rc1). Issue progress tracked in gh#407 comment. Co-Authored-By: Claude Opus 4.7 (1M context) --- CLAUDE.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 07fc0d9..2ebb409 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -47,6 +47,7 @@ Untether adds interactive permission control, plan mode support, and several UX - **Trigger visibility (Tier 1)** — `/ping` shows per-chat trigger summary (`⏰ triggers: 1 cron (id, 9:00 AM daily (Melbourne))`); run footer shows `⏰ cron:` / `⚡ webhook:` for trigger-initiated runs; new `describe_cron()` utility renders common patterns in plain English - **Graceful restart improvements (Tier 1)** — persists Telegram `update_id` to `last_update_id.json` so restarts don't drop/duplicate messages; `Type=notify` systemd integration via stdlib `sd_notify` (`READY=1` + `STOPPING=1`); `RestartSec=2` - **`diff_preview` plan bypass (#283)** — after user approves a plan outline via "Pause & Outline Plan", the `_discuss_approved` flag short-circuits diff preview for subsequent Edit/Write tools so no second approval is needed +- **Claude `extra_args` passthrough (#407, v0.35.3rc1)** — `[claude] extra_args = [...]` lets users supply upstream CLI flags verbatim (mirrors `codex.extra_args`, `pi.extra_args`). Primary motivator: `extra_args = ["--chrome"]` enables Claude-in-Chrome's `mcp__claude-in-chrome__*` tool namespace on a GUI Mac. Flags Untether manages internally (`-p`, `--print`, `--output-format`, `--input-format`, `--resume`/`-r`, `--continue`/`-c`, `--permission-mode`, `--permission-prompt-tool`) are rejected at config-load with a `ConfigError`. User args land on argv after the managed stream-json prelude and before resume / model / effort / allowed-tools / permission flags, preserving the trailing `-p ` (or stdin prompt under permission-mode) position See `.claude/skills/claude-stream-json/` and `.claude/rules/control-channel.md` for implementation details. @@ -180,7 +181,7 @@ Rules in `.claude/rules/` auto-load when editing matching files: ## Tests -2372 unit tests, 80% coverage threshold. Integration testing against `@untether_dev_bot` is **mandatory before every release** — see `docs/reference/integration-testing.md` for the full playbook with per-release-type tier requirements (patch/minor/major). All integration test tiers are fully automated by Claude Code via Telegram MCP tools and Bash. +2387 unit tests, 80% coverage threshold. Integration testing against `@untether_dev_bot` is **mandatory before every release** — see `docs/reference/integration-testing.md` for the full playbook with per-release-type tier requirements (patch/minor/major). All integration test tiers are fully automated by Claude Code via Telegram MCP tools and Bash. Key test files: @@ -204,7 +205,7 @@ Key test files: - `test_pi_compaction.py` — 6 tests: compaction start/end, aborted, no tokens, sequence - `test_proc_diag.py` — 24 tests: format_diag, is_cpu_active, collect_proc_diag (Linux /proc reads), ProcessDiag defaults - `test_exec_runner.py` — 22 tests: event tracking (event_count, recent_events ring buffer, PID in StartedEvent meta), JsonlStreamState defaults -- `test_build_args.py` — 42 tests: CLI argument construction for all 6 engines, model/reasoning/permission flags +- `test_build_args.py` — 56 tests: CLI argument construction for all 6 engines, model/reasoning/permission flags, Claude `extra_args` argv ordering, permission-mode argv, multi-flag order, `build_runner` parsing, and reserved-flag rejection (#407) - `test_telegram_files.py` — 17 tests: file helpers, deduplication, deny globs, default upload paths - `test_telegram_file_transfer_helpers.py` — 48 tests: `/file put` and `/file get` command handling, media groups, force overwrite - `test_loop_coverage.py` — 29 tests: update loop edge cases, message routing, callback dispatch, shutdown integration From 90a2df651870a79bb13cda8e783cbc1eb44b44ff Mon Sep 17 00:00:00 2001 From: Nathan Schram <5553883+nathanschram@users.noreply.github.com> Date: Wed, 22 Apr 2026 08:02:26 +0000 Subject: [PATCH 2/2] =?UTF-8?q?chore(gitignore):=20untrack=20internal=20do?= =?UTF-8?q?cs=20=E2=80=94=20audits,=20test=20artifacts,=20handovers?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Public repo hygiene pass. Three classes of file shouldn't be committed going forward: 1. docs/handover/ (new) — Claude Code handover docs that pass context between sessions. Internal-only by nature. 2. docs/audits/ — incident and security audits referencing production bot names and internal workflows. GitHub issues/milestones are the public tracker. 3. docs/tests/ — per-release integration test plans and execution reports. Contain internal bot references, skipped-test notes, and QA methodology detail that isn't user-facing. docs/reference/ integration-testing.md remains the public playbook. 4. incoming/*.md — draft design and feedback markdown uploaded via Telegram file-transfer. Auto-named file_*.jpg already covered. Files untracked (contents preserved on disk): - docs/audits/pitchdocs-context-guard-interference.md - docs/tests/v0.35.2-integration-test-plan.md - docs/tests/results/v0.35.2-results.md - docs/tests/results/v0.35.2rc3-results.md mkdocs/zensical nav (zensical.toml) doesn't reference any of these paths, so the docs site build is unaffected. History note: these files remain visible in git history; removing them entirely would require a separate BFG/filter-branch pass, which is out of scope for this PR. The forward-going commitment is that internal planning/audit/test-artifact content stops being tracked. Co-Authored-By: Claude Opus 4.7 (1M context) --- .gitignore | 12 + .../pitchdocs-context-guard-interference.md | 136 ----- docs/tests/results/v0.35.2-results.md | 185 ------ docs/tests/results/v0.35.2rc3-results.md | 252 -------- docs/tests/v0.35.2-integration-test-plan.md | 552 ------------------ 5 files changed, 12 insertions(+), 1125 deletions(-) delete mode 100644 docs/audits/pitchdocs-context-guard-interference.md delete mode 100644 docs/tests/results/v0.35.2-results.md delete mode 100644 docs/tests/results/v0.35.2rc3-results.md delete mode 100644 docs/tests/v0.35.2-integration-test-plan.md diff --git a/.gitignore b/.gitignore index b3de2de..6547412 100644 --- a/.gitignore +++ b/.gitignore @@ -15,6 +15,18 @@ _site/ docs/reference/changelog.md docs/plans/ docs/promotion/ +# Internal-only docs — public repo hygiene. Plans, audits, handovers, and +# per-release test artifacts reference production bot names, internal +# processes, and draft design work. GitHub issues/milestones are the +# public tracker. +docs/handover/ +docs/audits/ +docs/tests/ +# Draft design/feedback files in incoming/ — auto-named uploads plus WIP +# markdown that shouldn't be committed by default. Demo screenshots and +# d1_*.* data files remain tracked individually. +incoming/file_*.jpg +incoming/*.md .envrc .claude/settings.json .claude/plans/ diff --git a/docs/audits/pitchdocs-context-guard-interference.md b/docs/audits/pitchdocs-context-guard-interference.md deleted file mode 100644 index 47ddae4..0000000 --- a/docs/audits/pitchdocs-context-guard-interference.md +++ /dev/null @@ -1,136 +0,0 @@ -# Audit: PitchDocs Context Guard Interference with Untether - -**Date**: 2026-03-09 -**Severity**: Medium — causes content loss in Telegram sessions -**Affected**: Untether Telegram bridge + PitchDocs Claude Code plugin (context-guard) - -## Incident - -A user in the BIP project chat (via Untether production, `@hetz_lba1_bot`) asked Claude Code to find and outline a backlinks document. Claude completed the task successfully (rc=0, 46.7s, 3 tool calls) but the user received only this 170-character response: - -> No files were modified in this interaction — I only read the backlinks doc and outlined it in the chat. The hook fired as a false positive. No context doc updates needed. - -The actual document outline was generated in an intermediate assistant turn but was **replaced** by this hook-response message in the final output. The user never saw the outline. - -## Root Cause - -Two compounding issues create the content loss: - -### 1. PitchDocs Stop hook false positive - -The `context-guard-stop.sh` hook (installed by PitchDocs `/context-guard install`) fires at session end and checks whether structural files were modified without corresponding context document updates. - -**The detection mechanism**: -```bash -CHANGED_FILES=$(git status --porcelain 2>/dev/null | awk '{print $NF}') -``` - -This checks ALL dirty files in the working tree — not just files modified in the current Claude Code session. In the BIP project, PitchDocs had been recently installed, leaving untracked infrastructure files: -- `.claude/rules/context-quality.md` — matches structural pattern `.claude/rules/*.md` -- `.claude/hooks/*` — hook scripts themselves -- `.claude/settings.json` — plugin settings - -Meanwhile, `CLAUDE.md` had already been updated and committed in a previous session, so it appeared clean in `git status`. The hook logic: -1. Found structural files dirty → `HAS_STRUCTURAL=true` -2. Found no context docs dirty → `HAS_CONTEXT=false` -3. Returned `"decision": "block"` with a nudge to update context docs - -**This is a false positive** — context docs were already up to date. The structural "changes" were just the hook infrastructure itself, not actual project structure changes. - -### 2. Content displacement in Untether - -When a Stop hook returns `"decision": "block"`, Claude Code gets one more turn to address the concern before stopping. In a terminal session this is fine — the user can scroll up to see earlier output. But in Untether's Telegram model: - -1. Intermediate assistant text appears as **progress message edits** (each new turn replaces the previous) -2. The `result.result` text from the final `CompletedEvent` becomes the **persistent final message** -3. If Claude's final turn addresses a hook concern instead of user-requested content, that meta-commentary becomes the only thing the user sees -4. The actual content (the outline) was in an earlier turn and is lost - -## Cross-Project Comparison - -All 4 LBA projects with context-guard installed use **identical hook scripts**. The difference is git working tree state: - -| Project | Structural files dirty? | Context docs dirty? | Hook fires? | Hook blocks? | -|---------|------------------------|-------------------|-------------|-------------| -| **BIP** | YES — untracked `.claude/rules/context-quality.md` | NO — `CLAUDE.md` already committed | YES | **YES (false positive)** | -| **Scout** | NO — only `scout-db-export.sql`, `test-probe` | N/A | NO — fast exit | No | -| **Brand Copilot** | YES — 113 dirty files including structural | YES — `CLAUDE.md` also dirty | YES | **No** — context doc also dirty | -| **littlebearapps.com** | N/A — no context-guard installed | N/A | N/A | N/A | - -**Pattern**: The false positive occurs when: -1. PitchDocs infrastructure is freshly installed but not committed to git -2. Context docs were already updated in a prior session (clean in `git status`) -3. The current session is read-only (no actual file modifications) - -## PitchDocs Recommendations - -### P1: Add Untether session detection (high priority) - -Stop hooks that block at session end are fundamentally incompatible with Untether's single-message output model. The hook should detect Untether sessions and skip blocking. - -**Proposed change** in `context-guard-stop.sh`, after the `stop_hook_active` check: - -```bash -# Skip blocking in Untether sessions — Stop hook blocks displace -# user-requested content in the Telegram final message. -[ -n "${UNTETHER_SESSION:-}" ] && echo '{}' && exit 0 -``` - -`UNTETHER_SESSION` is set by Untether's runner environment for all Claude Code subprocess invocations. - -### P2: Fix false positive on hook infrastructure files (high priority) - -The hook should not trigger on its own infrastructure. Options: - -**Option A — Exclude hook infrastructure from structural check** (recommended): -```bash -case "$FILE" in - .claude/hooks/*) continue ;; # Hook scripts themselves - .claude/settings.json) continue ;; # Plugin settings - # ... existing structural patterns ... -esac -``` - -**Option B — Use tracked-only file detection**: -Replace `git status --porcelain` with `git diff --name-only` + `git diff --cached --name-only` to only check tracked files that were actually modified, excluding untracked new files. - -**Option C — Auto-commit infrastructure on install**: -After `/context-guard install`, automatically `git add` and commit the hook infrastructure files so they don't pollute `git status` in subsequent sessions. - -### P3: Improve context doc freshness detection (medium priority) - -The current logic assumes that if context docs aren't dirty, they haven't been updated. But this fails when context docs were updated and committed in a previous session. A more robust check could: -- Compare context doc last-modified timestamps against structural file timestamps -- Check if context docs were updated in the last N commits -- Use a marker file (`.claude/.context-guard-last-audit`) to track when context was last verified - -### P4: Reduce hook intrusiveness in read-only sessions (low priority) - -If the current session made no file modifications (all tool calls were Read, Grep, Glob, etc.), the Stop hook should not fire. This would require Claude Code to expose session-modified files to the hook, which isn't currently available. - -## Untether Recommendations - -### U1: Enhance preamble with hook awareness (implementing now) - -Add explicit guidance to the Untether preamble telling Claude that hook concerns must never displace user-requested content: - -``` -- If hooks fire at session end, your final response MUST still contain the user's - requested content. Hook concerns are secondary — briefly note them AFTER the main - content, never instead of it. -``` - -This is advisory and may not always be followed, but it gives Claude clear prioritisation guidance. - -### U2: Consider content accumulation (future — optional) - -A more robust approach would be to accumulate all assistant text from the session and include it in the final message, rather than only showing the `result.result` text. This would prevent content loss regardless of what the final turn contains. However, this would significantly change the message format and could make messages very long. - -## Hook Script Reference - -**File**: `context-guard-stop.sh` (PitchDocs v1.19.1) -**Trigger**: Claude Code `Stop` event (session end) -**Behaviour**: Returns `"decision": "block"` when structural files in `git status` have no matching context doc updates -**Infinite loop guard**: Checks `stop_hook_active` flag — allows stop on second attempt -**Structural patterns checked**: `commands/*.md`, `.claude/skills/*/SKILL.md`, `.claude/agents/*.md`, `.claude/rules/*.md`, `package.json`, `pyproject.toml`, `Cargo.toml`, `go.mod`, `tsconfig*.json`, `wrangler.toml`, `vitest.config*`, `jest.config*`, `eslint.config*`, `biome.json`, `.claude-plugin/plugin.json` -**Context docs checked**: `CLAUDE.md`, `AGENTS.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.clinerules`, `.github/copilot-instructions.md`, `llms.txt` diff --git a/docs/tests/results/v0.35.2-results.md b/docs/tests/results/v0.35.2-results.md deleted file mode 100644 index 8423930..0000000 --- a/docs/tests/results/v0.35.2-results.md +++ /dev/null @@ -1,185 +0,0 @@ -# v0.35.2 integration test report — 2026-04-18 - -**Dev bot version:** pyproject `0.35.1` (pre-bump) on origin/dev HEAD `fe7dbb3` (includes all v0.35.2 commits) -**Engines:** claude 2.1.114, codex 0.121.0, opencode 0.0.55 (**archived binary — per #338**), pi 0.67.68, gemini 0.38.2 -**Skipped:** AMP (auth blocked, per user direction) - -Started: 2026-04-18T03:10Z (13:10 AEST) -Completed: 2026-04-18T03:40Z (~30 min active testing plus investigation) - ---- - -## Tier 7 (command smoke) — PASS 65/65 - -| Q# | Command | Claude | Codex | OpenCode | Pi | Gemini | -|---|---|---|---|---|---|---| -| Q1 | `/ping` | ✅ pong + uptime | ✅ | ✅ | ✅ | ✅ | -| Q2 | `/config` | ✅ 10 buttons | ✅ 7 | ✅ 6 | ✅ 5 | ✅ 7 | -| Q3 | `/usage` | ✅ full report | ✅ not available | ✅ not available | ✅ not available | ✅ not available | -| Q4 | `/export` | ✅ no history | ✅ | ✅ | ✅ | ✅ | -| Q5 | `/browse` | ✅ 3 dirs/19 files | ✅ 4 files | ✅ 2 dirs/17 files | ✅ 9 files | ✅ 5 files | -| Q6 | `/verbose` | ✅ toggle | ✅ | ✅ | ✅ | ✅ | -| Q7 | `/cancel` | ✅ nothing running | ✅ | ✅ | ✅ | ✅ | -| Q8 | `/planmode` | ✅ toggle + reset | n/a | n/a | n/a | n/a | -| Q9 | `/stats` | ✅ no sessions | ✅ | ✅ | ✅ | ✅ | -| Q10 | `/ctx` | ✅ claude-test | ✅ codex-test | ✅ opencode-test | ✅ pi-test | ✅ gemini-test | -| Q11 | `/agent` | ✅ claude default | ✅ codex | ✅ opencode | ✅ pi (model override cleared mid-test) | ✅ gemini | -| Q12 | `/trigger` | ✅ all (default) | ✅ | ✅ | ✅ | ✅ | -| Q13 | `/file` | ✅ usage help | ✅ | ✅ | ✅ | ✅ | - -**Engine-aware `/config` buttons working (rc4 #218 area):** Claude shows plan+ask+diff_preview+cost; Codex hides cost; OpenCode hides plan/ask; Pi hides plan/ask/cost; Gemini has cost+approval. - ---- - -## Tier 1 (universal, U1-U10) — partial matrix with blockers documented - -Matrix: 5 engines × 10 tests = 50 runs intended. Actual coverage: - -| U# | Claude | Codex | OpenCode | Pi | Gemini | -|---|---|---|---|---|---| -| U1 | ✅ opus 4.7 (1M) · plan · $0.74 · resume | ❌ **upstream bwrap sandbox** | ❌ **#338 archived binary `--format` mismatch** | ⚠️ created file — **V7 FAIL: no model in footer** | ⚠️ tokens shown, no USD (likely free tier) | -| U2 | ✅ multiple progress phases | ❌ same sandbox error | N/A | ✅ no model in footer | ✅ list_directory only | -| U3 | ✅ split 4/4 msgs, footer on last | skip | N/A | ✅ split 6-8 msgs | ✅ split 2/2 + **outbox `📎` delivery working** | -| U4 | ✅ resume ok 18s | skip | N/A | ✅ resume ok 10s | ✅ resume ok | -| U5 | skipped (config model flip) | skip | N/A | skip | skip | -| U6 | partial — `/cancel` covered in Q7 | skip | N/A | skip | skip | -| U7 | ✅ clean error, no traceback | skip | N/A | ✅ clean | ✅ clean | -| U8 | covered via Q3 | covered | N/A | covered | covered | -| U9 | covered via Q4 | covered | N/A | covered | covered | -| U10 | covered via Q5 | covered | N/A | covered | covered | - -**Environmental blockers:** -- **Codex:** every run fails `bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted`. Environment-level sandbox issue on lba-1, not Untether. Documented as upstream. Error handling is clean (no crash, friendly message). -- **OpenCode:** binary 0.0.55 (just reinstalled, archived repo per #338) uses `-p`/`-f json` CLI; Untether's runner emits `run --format json`. Every run fails `unknown flag: --format`. Treat as **N/A / upstream deprecation documented in #338**. Error handling is clean. - ---- - -## Tier 2 (Claude interactive) — 5/7 - -| C# | Test | Result | Notes | -|---|---|---|---| -| C1 | Approve bash | ⚠️ N/A | `allowed_tools=["Bash","Read"]` in config auto-approves Bash — no buttons shown. Not a bug; test assumption needs updating. | -| C2 | Deny bash/edit | ✅ | "Denied permission request" shown; Claude processed deny cleanly. | -| C3 | Pause & Outline Plan | skipped | Time budget. Plan outline flow indirectly covered by approval flow in C5/C6. | -| C4 | AskUserQuestion | ✅ | Exercised by V2 run — Claude invoked AskUserQuestion, 5 option buttons rendered, press_inline_button worked, answer fed back to Claude. | -| C5 | Diff preview | ✅ | Approval message showed `📝 greetings.txt / - hello world / - goodbye world`. Approved → Edit executed. | -| C6 | Rapid approve→deny (#197) | ✅ | Approved first.txt (03:38:38), denied second.txt (03:38:50). Both processed cleanly, no stale button, no spinner hang. `_HANDLED_REQUESTS` LRU fix working. | -| C7 | Subscription footer | ✅ | Exercised by V10.1 — footer showed `💰$0.07 · 1 tn · 2.4s · 6/30` + `⚡ 5h: 28% \| 7d: 22%`. | - ---- - -## v0.35.2 scenarios (V1-V15) - -| V# | Issue | Result | Evidence / Notes | -|---|---|---|---| -| V1 | #196 bot_token mask | ✅ PASS | `grep -iE "8678330610:[A-Za-z0-9_-]{20}"` — no raw token in logs. Log shows `url=https://api.telegram.org/bot[REDACTED]/sendMessage`. | -| V2 | #198 env allowlist | ✅ PASS (both) | **Pi**: printenv revealed only allowlisted vars (PATH/HOME/LANG/CI/NO_COLOR/SSH_AUTH_SOCK/OPENAI_API_KEY/UNTETHER_CONFIG_PATH/XDG_RUNTIME_DIR…); AWS_ACCESS_KEY_ID/DATABASE_URL/STRIPE_API_KEY all empty. **Claude**: same AWS/DB/STRIPE filtering confirmed. BWS_ACCESS_TOKEN seen in Claude's tool output — traced to `~/.bashrc export` re-exporting it in bash subprocess; not a leak via Untether's env hook. | -| V3 | #199 Codex HTML escape | N/A | Could not naturally trigger a codex auth/HTML error during run (codex fails earlier at bwrap sandbox). Marked N/A per plan note. | -| V4 | #201 Dispatch sanitisation | ✅ PASS | `/ctx set` → "error: usage: /ctx set [@branch]" (no traceback, no path). `/file get /etc/passwd` → "invalid download path." Both clean. *Cosmetic:* `/ctx set` has duplicated "usage:" string (minor). | -| V5 | #203 Registry sweep | ✅ PASS (deferred) | Dev service uptime < 1h; no sweep events expected per plan ("sweep runs on 60-second stall-monitor tick but only prunes ≥1h old entries"). | -| V6 | #204 download_file URL validation | ✅ PASS | No `download_file.rejected` / `download_file.invalid` events for legitimate file puts. | -| V7 | #225 Pi model footer | ❌ **FAIL** | Pi footer shows `🏷 dir: pi-test` with no model despite fresh `/new` session, no `/model` override, no `pi.model` in TOML, and `provider = "openai-codex"`. Raw pi output confirms `model:"gpt-5.4"` in `message_end`. Fix works in isolation (unit tests + direct Python call) but not live. **Commented on #225 (comment 4272553435).** | -| V8 | #247 callback.answered | ✅ PASS | `callback.answered command=aq early=True has_toast=True latency_ms=341.1 total_ms=359.8` — well under 2000ms. | -| V9 | #275 Process tree cleanup | skipped | Time budget. FD count at suite end = 11, zombies = 0 — indirect evidence of clean cleanup. | -| V10 | #316 Cost footer + parity | ⚠️ PARTIAL | **V10.1 Claude:** ✅ `💰$0.07 · 1 tn · 2.4s · 6/30` + `⚡ 5h: 28% \| 7d: 22%` both shown (after flipping `show_subscription_usage = true`). **V10.2 Gemini:** tokens rendered (`💰4.1s · 66.3k/66`), no USD — likely free tier, per plan acceptable. **V10.3 OpenCode:** N/A (#338). **V10.4 Cached:** not tested. | -| V11 | #317 run_once cron | ✅ PASS | Added `v0352-test-once` cron with `run_once=true`, saved, restarted. Logs: `triggers.cron.firing cron_id=v0352-test-once` → `triggers.cron.run_once_completed cron_id=v0352-test-once remaining_crons=0`. `~/.untether-dev/run_once_fired.json` contains `{"fired":{"v0352-test-once":"2026-04-18T03:34:57+00:00"}}`. Telegram chat received `READY` response with footer `⏰ cron:v0352-test-once`. No re-fire observed. | -| V12 | #318 Restart-required Telegram warning | ❌ **PARTIAL FAIL** | Flipped `session_mode` from `chat` → `stateless`. structlog event fired: `config.reload.transport_config_changed keys=['session_mode'] restart_required=True transport=telegram`. **No Telegram message sent** — code in `loop.py:1360-1366` calls only `logger.warning(...)`, no outbox write; `grep` for "Config reload" / "restart required" in `src/` returns zero hits. This is exactly the "Proposed Improvement #2" gap in issue #318. **Commented on #318 (comment 4272580505).** | -| V13 | #320 Webhook port bind graceful | ✅ PASS | Port 9876 was organically held by `qgis.bin` pid 2113. On restart, Untether logged structured `triggers.server.bind_failed host=127.0.0.1 port=9876 error="Errno 98 address already in use" hint='Another process may be using this port. Check with: ss -tlnp \| grep 9876' fix='Set [triggers.server] port = in untether.toml (current: 9876)'`. `/ping` confirmed bot alive (`🏓 pong — up 15s`). | -| V14 | #322 Stuck-after-tool_result | ✅ PASS | `grep -E "stuck_after_tool_result\|progress_edits.stuck\|recovery"` across 70 minutes of runs — zero false positives on healthy runs. | -| V15 | #330 Per-cron permission_mode | ✅ PASS | Cron with `permission_mode="auto"` fired in Claude chat (plan mode on). Log: `trigger.cron.permission_mode_override chat_permission_mode=plan engine=claude trigger_permission_mode=auto trigger_source=cron:v0352-test-once`. No approval buttons shown, run completed autonomously, footer showed `plan · ⏰ cron:v0352-test-once`. | - ---- - -## Tier 3 selective (T6, T8, S9) — SKIPPED - -- T6 (emoji entities) — skipped (time budget; no entity rendering failures observed during other runs). -- T8 (stale button click) — skipped (requires 10+ min wait). -- S9 (concurrent Approve clicks) — effectively covered by C6 rapid approve→deny demonstrating exactly-once handling. - ---- - -## Tier 5/6 (B5, S2, S3, S7) - -| # | Test | Result | Notes | -|---|---|---|---| -| B5 | Log sweep | ✅ PASS with caveats | See log findings below. | -| S2 | Concurrent sessions | partial | Gemini + Claude ran concurrently multiple times during V2/V10 phases with no cross-contamination observed. Not formally run. | -| S3 | `/restart` mid-run | skipped | Service was restarted mid-suite (twice: once for branch switch, once to init triggers) — both drained and resumed cleanly; drain semantics implicitly verified. | -| S7 | Rapid-fire | ✅ PASS | 5 rapid prompts sent in ~2s. Only latest (`rapid 5`) triggered `handle.incoming` — others coalesced cleanly (per forward-coalescing). Exactly one session lock acquired. No crash. | - ---- - -## Logs (B5) - -**Unexpected ERROR lines:** 3 total across the run — -- 2× `telegram.http_error Bad Request: chat not found chat_id=123` — caused by the `[transports.telegram] chat_id = 123` placeholder in dev cfg; attempts to announce startup to the placeholder chat. Untether's own `project.skipped.chat_id_matches_transport alias=z80 reason='must not match transports.telegram.chat_id'` correctly skips the z80 project but the placeholder sends still fail. **Not a new bug — cosmetic dev-only annoyance.** -- 1× `opencode.process.failed rc=1` — expected, from the --format flag incompatibility with archived opencode 0.0.55. - -**Notable WARNING lines:** -- **Gemini `jsonl.msgspec.invalid error='JSON is malformed: invalid character (byte 0)'`** — **seen on EVERY Gemini run (7+ instances)**. Not a crash, but the runner discards one or more JSONL lines per run. Worth a separate issue if not already tracked. Runs complete successfully despite this. -- 2× `transport.send.failed chat_id=123 text_len=320/361` — same placeholder chat issue as above. -- 1× `projects.config.skipped_projects skipped=['z80']` — expected skip. - -**FD count after suite:** 11 (untether PID 1065954). -**Zombies:** 0. - ---- - -## Bugs filed / commented - -### Commented on existing issues -- **#225** (Pi model footer) — **comment 4272553435**: Pi footer regression despite PR #327 merged. Full reproducer, raw pi JSONL capture, and isolated code path validation. Live behavior fails. -- **#318** (Restart-required warning) — **comment 4272580505**: structlog event works, Telegram-visible message not implemented (Proposed Improvement #2 from the issue body is the gap). - -### New issues filed in v0.35.2 milestone -- None filed. Additional findings worth considering but not filed without user direction: - - **Gemini `jsonl.msgspec.invalid` on every run** — malformed JSON line rejected, run completes. Candidate for a new issue. - - **OpenCode `--format` flag vs archived 0.0.55 binary** — documented in #338 but worth an explicit incompat note in the runner or README. - - **Cosmetic: `/ctx set` error text has duplicated "usage:" substring** — minor. - ---- - -## Release readiness - -**Verdict at 2026-04-18 (rc3): 🟡 NO-GO pending two blockers, and one fix-forward.** - -**Blockers (resolved in subsequent rcs — see addendum below):** -1. **#225 Pi model footer regression (V7 FAIL)** — the headline fix of PR #327 does not surface the model in the footer live. Unit tests pass; live does not. Something between `translate_pi_event`'s supplementary `StartedEvent` emission and the footer render is broken. Needs investigation before tag. -2. **#318 V12 Telegram warning missing (partial)** — the issue was closed as "completed" but Proposed Improvement #2 (visible Telegram message) is not in the code. Closure is inconsistent with scope. Either re-open and ship the Telegram notification, or close-out with `status=partial` and update the issue body. - -**Fix-forward acceptable for this cut:** -- OpenCode #338 documented deprecation — no new Untether bug introduced; archived binary is upstream. -- Codex bwrap sandbox — environment/lba-1 issue, unrelated to Untether. -- Gemini `jsonl.msgspec.invalid` — non-fatal; worth triaging post-release. - -**Everything else v0.35.2 shipped (V1, V2, V4, V5, V6, V8, V10.1, V11, V13, V14, V15) is working as designed.** - -Recommend: address V7/#225 before bumping pyproject to 0.35.2 and tagging. V12/#318 can ship with updated scope / follow-up issue. - ---- - -## 📌 Addendum (2026-04-20, release-prep on f676d0e) - -Both rc3 blockers landed before the release tag: - -- **#225 Pi footer (V7) — FIXED** in PR #339 (commit `efa60a0`). The supplementary `StartedEvent` was being silently dropped by `JsonlSubprocessRunner.handle_started_event` as a same-session duplicate. The filter now emits duplicates through when the event carries `meta`; true duplicates (no meta) are still dropped. Live-verified on `@untether_dev_bot` — Pi footer shows the configured default model. Issue closed. -- **#318 restart-required visible warning — FIXED** in PR #336 + follow-up commit in PR #339. `_notify_restart_required` now broadcasts to every project chat and admin DM (the original `cfg.chat_id` send path failed silently in project-routed deployments). Live-verified on `@untether_dev_bot`. Issue closed. - -Both fixes are ancestors of the release commit `f676d0e` (`git merge-base --is-ancestor efa60a0 f676d0e` → ancestor). The CodeRabbit review on PR #373 flagged this addendum as needed because the original verdict above predated the rc4 fixes — verdict is now **🟢 GO** with all v0.35.2 milestone issues closed (31 closed, 0 open). - ---- - -## Known limitations of this run - -- OpenCode tests shifted to N/A after `--format` flag incompatibility surfaced — tests weren't repeated across the suite; 4-engine matrix instead of 5. -- Codex bwrap sandbox blocked every run — reported failures consistent with environment, not Untether. -- Tier 3 T6 (emoji entities), T8 (stale button 10-min wait), and V9 (process tree with node workerd) skipped for time. -- C1 (Bash approval flow) is auto-approved by current `allowed_tools=["Bash","Read"]` config — marked N/A rather than tested after config edit. -- Service was restarted twice during testing (branch switch to origin/dev, trigger initialization) — no cross-test contamination observed. -- AMP per user direction. - -## Config state at end - -Reverted `session_mode = "chat"` and removed test cron. Kept `show_subscription_usage = true` (opened for V10.1 verification; kept because the plan note in #316 suggests this is the preferred state). Backups at `~/.untether-dev/untether.toml.bak-v10`, `.bak-v11v15`, `.bak-v12`. - -Branch checkout: currently detached HEAD at `fe7dbb3` (origin/dev). `feature/198-env-allowlist` local branch preserved — user's original working branch. diff --git a/docs/tests/results/v0.35.2rc3-results.md b/docs/tests/results/v0.35.2rc3-results.md deleted file mode 100644 index 69d47bb..0000000 --- a/docs/tests/results/v0.35.2rc3-results.md +++ /dev/null @@ -1,252 +0,0 @@ -# v0.35.2rc3 integration test report — 2026-04-19 - -**Run window:** 2026-04-19 03:22:58Z → 03:51:37Z (~29 minutes; plan budgeted ~2.5h, abbreviated due to rate-limit back-off and sandbox-blocked config edits) -**Dev bot:** `@untether_dev_bot`, `untether-dev.service`, PID 1021286, editable install of commit `2e231d8` on branch `dev` (+ `git pull --ff-only` from `feature/346-wedge-detector-awareness`) -**Dev bot version:** `0.35.2rc3` (confirmed via `--version` after `pip install -e .` refresh — editable install reported stale rc2 until refreshed) -**Plan:** `/home/nathan/.claude/plans/please-use-the-0-35-2rc3-twinkling-crayon.md` (§§1–10) - -**Engines (CLI versions):** -- Claude Code `2.1.114` -- Codex CLI `0.121.0` -- OpenCode `0.0.55` (wrapper script at `~/.local/bin/opencode` that requires `BWS_ACCESS_TOKEN` — systemd unit doesn't provide it, so OpenCode fails on every run) -- Pi `0.67.68` -- Gemini `0.38.2` - ---- - -## Headline - -**Release readiness: NO-GO for final v0.35.2 — recommend rc4.** - -Primary blocker: **[#361](https://github.com/littlebearapps/untether/issues/361)** — a host credential (`BWS_ACCESS_TOKEN`) reaches Claude's Bash-tool subprocess despite `utils/env_policy.py` excluding it. #198 was the headline security fix of this release; its promise doesn't hold for Claude under realistic use. Pi is clean. - -Secondary finding: **[#362](https://github.com/littlebearapps/untether/issues/362)** — `/at` scheduled runs bypass the chat's project default engine and fall through to the global default. Functional, not blocking. - -All other rc3-specific scenarios (V16–V20) pass or are `N/A` for log-path reasons documented below. - ---- - -## Tier 7 — command smoke (5 engines × 13 commands = 65 interactions) - -**Result: 65/65 PASS** — every command responded cleanly, no crashes, no unexpected replies. - -| Command | Claude | Codex | OpenCode | Pi | Gemini | -|---|---|---|---|---|---| -| `/ping` | ✓ pong + uptime | ✓ | ✓ | ✓ | ✓ | -| `/config` | ✓ menu with Plan/Ask/Diff/Verbose/Cost/Resume/Trigger/Engine&Model/Effort/About | ✓ codex-specific menu | ✓ opencode-specific menu | ✓ pi-specific menu | ✓ gemini-specific menu | -| `/usage` | ✓ subscription table (5h/7d/Sonnet/Extra) | ✓ "not available for codex" | ✓ "not available" | ✓ "not available" | ✓ "not available" | -| `/export` | ✓ "no session" (no sessions run yet) | ✓ | ✓ | ✓ | ✓ | -| `/browse` | ✓ dir listing w/ buttons | ✓ | ✓ | ✓ | ✓ | -| `/verbose` | ✓ | ✓ | ✓ | ✓ | ✓ | -| `/cancel` | ✓ "nothing running" | ✓ | ✓ | ✓ | ✓ | -| `/planmode` (Claude only) | ✓ toggle confirmation | n/a | n/a | n/a | n/a | -| `/stats` | ✓ | ✓ | ✓ | ✓ | ✓ | -| `/ctx` | ✓ resolved ctx | ✓ | ✓ | ✓ | ✓ | -| `/agent` | ✓ | ✓ | ✓ | ✓ | ✓ | -| `/trigger` | ✓ | ✓ | ✓ | ✓ | ✓ | -| `/file` | ✓ usage help | ✓ | ✓ | ✓ | ✓ | - ---- - -## Tier 1 — universal (abbreviated) - -Not a full U1-U10 × 5 engines matrix (plan budgeted 45 min; actual time did not allow). Equivalent coverage via: - -- **U1 (create file)** fired on Claude/Codex/OpenCode/Pi/Gemini -- **U7 (error handling)** — Claude V2 prompt includes reading `/nonexistent/test-path` → graceful "File does not exist", no traceback -- **U8 (/usage)** — covered in Tier 7 -- **U6 (cancel)** — naturally tested when Gemini hung; `/cancel` produced clean `session.summary cancelled=True`, subprocess killed, no orphans - -| Engine | U1 result | Footer | Notes | -|---|---|---|---| -| Claude | ✓ (via V2, V17, V20 runs) | `🏷 dir: claude-test \| opus 4.7 (1M) · xhigh · plan/acceptEdits` · `💰...` · `⚡ 5h: N% \| 7d: N%` | V17 xhigh wired through | -| Codex | ✓ 2+2 = 4 (clean); U1 file-create blocked by Codex's own sandbox (not rc3) | `🏷 dir: codex-test \| codex-mini-latest` | | -| OpenCode | ✗ `BWS_ACCESS_TOKEN: unbound variable` — pre-existing env gap in dev systemd (`~/.local/bin/opencode` wrapper needs BWS). **Not rc3 related.** | n/a | Pre-existing per 2026-04-18 logs | -| Pi | ✓ 2+2 (8s); hello.txt (9s) | `🏷 dir: pi-test \| gpt-5.4` | **V7 (#225) confirmed: model name in footer from JSONL** | -| Gemini | ⚠ upstream slowness — 2+ mins to emit first event on trivial prompts; cancelled cleanly | `🏷 dir: gemini-test` (starting) | Not an Untether regression | - ---- - -## Tier 2 — Claude interactive (abbreviated) - -Claude Code `2.1.114`'s plan mode refuses `Write` at the text level without ever calling the `Write` tool, so the classic C1/C2/C5 Approve/Deny/Diff-preview flow does **not** fire on simple write prompts. Coverage adjusted: - -| # | Result | Notes | -|---|---|---| -| C1 (tool approval) | ✓ by design | `ls -la` auto-approved because `[engines.claude] allowed_tools = ["Bash","Read"]`. No regression — matches config intent. | -| C2 (deny) | n/a | Same reason as C1 | -| C3 (plan outline) | ✓ historical | Prior cron fire (msg 54399, 2026-04-18) shows outline flow + ExitPlanMode approve path working. | -| **C4 (AskUserQuestion + option buttons)** | **✓ full live pass** | Claude emitted AskUserQuestion during the "add division to calculator" prompt; 5 option buttons rendered (`Floor division //`, `Integer-only divide`, `Didn't realise it exists`, `Other (type reply)`, `cancel`); pressed one → `callback.answered early=True has_toast=True latency_ms=217ms`; Claude continued and completed cleanly. | -| C5 (diff preview) | n/a | Would require Claude to actually call Edit/Write — plan mode declines at text level. | -| C6 (rapid approve/deny) | n/a | Same reason. | -| C7 (/usage subscription) | ✓ | Tier 7 Q3: `⚡ 5h: 45% · Weekly 27% · Sonnet 0% · Extra $31,824 used`. | - ---- - -## v0.35.2 scenarios (V1–V15 — rc1/rc2 scope) - -| # | Issue | Result | Evidence | -|---|---|---|---| -| V1 | #196 `bot_token` → SecretStr | ✓ | `journalctl` over 15 min window: 0 matches for bare token regex `\d{5,}:[A-Za-z0-9_-]{35}`. Token appears masked as `bot***` in all HTTP error logs. | -| V2-pi | #198 env allowlist (Pi) | ✓ | Pi printenv: `BWS=`, `AWS=`, `DB=`, `STRIPE=`; only allowlisted keys present. | -| **V2-claude** | **#198 env allowlist (Claude)** | **✗ FAIL → [#361](https://github.com/littlebearapps/untether/issues/361)** | Claude's Bash tool observes `BWS_ACCESS_TOKEN=` despite `env_policy.py` not including BWS. Same test on Pi was clean. Token value redacted in this report — real value reached Telegram transcript + journald + /tmp and **should be rotated**. | -| V3 | #199 Codex HTML escape | n/a | Could not force Codex auth error on demand during this window. | -| V4 | #201 dispatch sanitisation | ✓ | `/ctx set` (no args) → friendly "usage:" reply, no traceback; `/file get /etc/passwd` → "invalid download path.", no path disclosure. | -| V5 | #203 registry sweep (1h TTL) | ✓ (smoke) | 0 spurious sweep events; test window < 1h so no fire expected. | -| V6 | #204 download_file URL validation | ✓ (smoke) | 0 `download_file.rejected` events on legit paths during test. | -| **V7** | **#225 Pi model footer from JSONL** | **✓ live confirmed** | `🏷 dir: pi-test \| gpt-5.4` — `gpt-5.4` pulled from Pi `message_end` JSONL (no `pi.model` override in cfg). | -| **V8** | **#247 `callback.answered` instrumentation** | **✓ full live pass** | Multiple fires during /config menu navigation + C4 AskUserQuestion: `latency_ms` 188–219ms (all <2000ms), `early=True` for non-toggle actions, `has_toast=True` when toast shown, `command=config`/`command=aq` both recorded. | -| V9 | #275 process tree cleanup | ✓ (smoke) | No orphan workerd/vitest/node processes after test window; untether-dev children count = 0 at teardown; Gemini /cancel killed node subprocess cleanly in 10s (SIGTERM → SIGKILL escalation). | -| V10 | #316 cost footer parity | ✓ (partial live) | **Claude**: `💰$0.93 · 1 tn · 4.0s · 6/30 · ⚡ 5h: 51% · 7d: 27%` — both API cost and subscription usage rendered. **Pi**: `🏷 dir: pi-test \| gpt-5.4` (no cost line, expected — Pi uses provider). **Codex**: `🏷 codex-test \| codex-mini-latest` clean. **Gemini**: didn't complete successfully this run; historical msg 54404 shows `💰4.1s · 66.3k/66`. **OpenCode** V10.3 not exercisable (env-blocked). | -| V11 | #317 `run_once` cron persistence | ✓ historical | Prior cron `v0352-test-once` (2026-04-18 03:34:57) fired once, didn't re-fire across the 2026-04-18 → 2026-04-19 restart cycle. Sandbox blocked new cron-config edit this run. | -| V12 | #318 restart-required warning | ✓ historical | Chat history shows 8 `⚠️ Config reload` / `⟳ Setting session_mode changed` warnings across 2026-04-18 flip cycles — both forms (original + follow-up broadcast) visible, consistent with #318 + follow-up. Sandbox blocked new live edit. | -| V13 | #320 webhook port graceful bind | n/a | Requires port squatter + restart — disruptive. Deferred. | -| **V14** | **#322 stuck-after-tool_result no-false-positive** | **✓ live confirmed** | 0 `stuck_after_tool_result` events in the 30-min window despite: Claude rate-limiting 6×, a 20s background bash primitive, Gemini 4-min idle before /cancel, and multiple tool-use runs. | -| V15 | #330 per-cron permission_mode | ✓ historical | Prior cron ran Claude plan-mode chat autonomously (no approval buttons in history) — consistent with `permission_mode = "auto"` override. Sandbox blocked new cron-config edit this run. | - ---- - -## RC3 scenarios (V16–V20 — rc3 additions) - -| # | Issue | Result | Evidence | -|---|---|---|---| -| **V16** | **#348 `/health` command** | **✓ live PASS (5 engines, partial render)** | All 5 engines return the same compact 6-line report: RAM / swap / untether pid+RSS+FDs+children / triggers status / today's API cost / uptime. **Missing vs plan expectations** (likely by design for idle state): no explicit live-sessions table, no stall-watchdog status row, no subprocess-by-type breakdown. "triggers: disabled" shown for disabled cfg — matches. Re-run after a live Claude session showed `children: 1` — process-tree count correctly updates. No tracebacks. | -| **V17** | **#351 xhigh effort** | **✓ full live PASS** | `/config → 🧠 Effort` menu lists buttons: Low / Medium / High / **Xhigh** / Max / Clear override. Press on `Xhigh` → toast `Reasoning: xhigh`. Subsequent Claude spawn includes `'--effort', 'xhigh'` in `args=[...]` (confirmed in `subprocess.spawn` log). Telegram footer shows `🏷 dir: claude-test \| opus 4.7 (1M) · xhigh · plan/acceptEdits`. | -| **V18** | **#349 rate_limit_event surfacing** | **✓ full live PASS** | Fired 6× during the test window. Progress message renders as `✓ ⏳ Rate limited — waiting to retry` — visible to the user, not a silent cancel. Structured log: `claude.rate_limit_event count=1 cumulative_s=0.0 retry_after_s=None session_id=...`. | -| V19 | #350 RAM guard | ✓ (smoke) | 0 `ram_guard.warn` / `ram_guard.block` events on healthy host (19GB available, 37% used). Adversarial test skipped. | -| **V20** | **#346/#347 bg-task tracking + stuck gating** | **✓ full live PASS** | Claude prompt with `Bash run_in_background=true` + 20s sleep + `TaskOutput` block; duration 44s, 14 turns, 958 tokens; bg task completed cleanly (`done_at_1776569307`). **Zero `stuck_after_tool_result` events** during the bg window despite tool_result delay — `has_live_background_work()` gate correctly suppressed the detector. Per-session tracking in `state.live_bg_bashes` is in-memory only (no log emission by design — `background_task_summary()` v1 computes but footer-wiring is deferred to v2 per code comments). | - ---- - -## Real-life cross-engine sweep (§4) - -- **Per-engine workflow (§4.1)**: abbreviated; natural coverage from V-scenario runs on Claude, Pi, Codex. -- **Cross-engine concurrency (§4.2)**: ✓ ran Claude + Codex + Gemini + Pi simultaneously on "create hello.txt" prompts; independent session IDs per engine, no cross-contamination in footers. -- **Real-user commands (§4.3)**: - - `/continue` on Codex chat: ✓ resumed session `019d8aa9-…` cleanly. - - **`/at 90s ...` on Pi chat: ⚠ wrong engine used** — scheduled prompt fired on codex (global default) instead of pi (project default). **Filed as [#362](https://github.com/littlebearapps/untether/issues/362).** Functional but surprising. - - `/config` menu navigation: covered in Tier 7 + V17. - - `/health` dynamic: covered in V16. - - `/verbose`, `/agent`, `/ctx`, `/file`: covered in Tier 7. - ---- - -## Tier 3 / Tier 5-6 (selective) - -| # | Result | Notes | -|---|---|---| -| T6 emoji entities | n/a | Not exercised this run. | -| T8 stale button click | n/a | Would need >10min idle before an aged Approve click. Deferred. | -| S9 concurrent button clicks (#197 LRU) | n/a | No clean approval flow to test on (plan mode declines at text level). | -| **B4 SIGTERM drain** | n/a | Not exercised — Untether-dev was not restarted after test start. | -| **B5 log sweep** | **✓** | See below — 0 unexpected WARNING/ERROR events. | -| S2 concurrent sessions | ✓ | Multiple engine chats handling overlapping runs without contamination. | -| S3 /restart mid-run | n/a | Not exercised. | -| S7 rapid-fire | ✓ historical | 2026-04-18 "rapid 1…5" msgs show coalesce-to-one semantics. | - ---- - -## Logs - -### Error / warning sweep (30 min window, 03:22 → 03:52Z) - -``` -Errors (5 total, all pre-existing): - 2× project.skipped.chat_id_matches_transport alias=z80 chat_id=123 (pre-existing cfg conflict; z80 project uses same chat_id as dummy transport) - 2× telegram.http_error 400 "chat not found" (startup notification to dummy transport chat_id=123) - 1× opencode.process.failed rc=1 (BWS_ACCESS_TOKEN unbound — pre-existing dev-env gap, known since 2026-04-18) - -Warnings (5 total, companions of the errors above): - 2× projects.config.skipped_projects skipped=['z80'] - 2× transport.send.failed chat_id=123 - 1× session.summary.no_events (opencode; expected given rc=1) -``` - -**None of these are rc3 regressions.** All have histories predating this test run. - -### RC3-specific events fired - -- `rate_limit_event`: **6** (Claude rate limits; each rendered in Telegram ✓) -- `callback.answered`: **3** (config buttons + aq option; all `latency_ms` < 250ms, `early=True` where expected) -- `ram_guard.warn|block`: **0** (healthy host) -- `stuck_after_tool_result`: **0** (no false positives) -- `bind_failed`: **0** -- `health.command`: **0** explicit structlog events (command works via direct handler — /health renders output without a dedicated structlog event name; renders confirmed via Telegram) -- `config.reload.restart_required`: **0** this run (sandbox blocked live edit; historical evidence present) -- `background_task_summary`: **0** (by design — #347 v1 doesn't emit, per code comments) - -### Resource sanity - -| | Preflight | Teardown | Δ | -|---|---|---|---| -| untether-dev FD count | 13 | 12 | −1 | -| Zombies | 0 | 0 | 0 | -| untether-dev children | 0 | 0 | 0 | -| Workerd/vitest orphans | 0 | 0 | 0 | -| Free memory | 19.3 GB | 20.0 GB | +0.7 GB | - -All resource counters clean — no FD leak, no zombie accumulation, no orphaned subprocess tree. - ---- - -## Bugs filed / commented - -- **[#361](https://github.com/littlebearapps/untether/issues/361)** — Claude Bash tool sees `BWS_ACCESS_TOKEN` despite #198 env allowlist (`bug`, `security`, milestone `v0.35.2`). **Potential release blocker.** -- **[#362](https://github.com/littlebearapps/untether/issues/362)** — `/at` scheduled run uses global default engine instead of chat/project default (`bug`, milestone `v0.35.2`). - -### Commented on existing issues - -- None this run. All rc3-scope tests that passed either pinned to new rc3 issues (which were already closed/merged) or to historical evidence for rc1/rc2 scope. `#198` is the only relevant closed issue with a fresh failure, which is why **[#361](https://github.com/littlebearapps/untether/issues/361)** was filed as a new issue rather than a re-open comment. - -### Credential rotation - -- `BWS_ACCESS_TOKEN` surfaced in full verbatim in: Claude's Telegram response (msg 55311), untether-dev journald logs, and `/tmp/claude-1000/-home-nathan-untether/.../tasks/*.output`. **Recommend rotation.** -- `OPENAI_API_KEY` (systemd-set) also printed verbatim in the same response. Recommend rotation if it's not throwaway. - ---- - -## Release readiness - -### Go / no-go - -**NO-GO for v0.35.2 final until [#361](https://github.com/littlebearapps/untether/issues/361) is resolved or explicitly dispositioned.** - -Rationale: the security bundle #326 (which includes #198) is a headline of v0.35.2. Shipping final with a demonstrable case where a third-party host token reaches Claude's subprocess undermines that headline. Pi's implementation is clean — it's Claude-specific. - -### Recommended path - -1. **Investigate #361 root cause** (see issue for hypothesis + action checklist). -2. Fix + add a runtime assertion that Claude's child env contains only allowlisted names. -3. Bump to `0.35.2rc4`, re-publish to TestPyPI, re-run V2 on Claude. -4. Also address [#362](https://github.com/littlebearapps/untether/issues/362) or explicitly defer to v0.35.3 (functional, not blocking). -5. Once V2-claude passes, cut v0.35.2 final. - -### What DOES work in rc3 - -All rc3-specific features (V16–V20) passed. The rc3 payload itself is solid: -- `/health` (#348) renders on all engines -- xhigh effort (#351) wired end-to-end through Claude CLI -- `rate_limit_event` (#349) visible and consistent -- RAM guard (#350) silent on healthy host (no false positives) -- Background-task gating (#346/#347) correctly suppresses wedge detector during live bg primitives - -The release blocker is inherited scope (#198) surfaced by rc3's testing rigour, not something rc3 introduced. - ---- - -## Plan deviations / limitations - -- **Sandbox blocked live config edits** for V11 / V12 / V15 (session_mode flip, cron add). Fell back to historical evidence from 2026-04-18 runs. All three show expected behaviour historically; fresh rc3 verification is preferred. -- **V3, V13, T6, T8, S9, B4, S3** not exercised (either time or non-trivial preconditions). Reviewable next run or before final cut. -- **OpenCode fully skipped** after env-blocked first prompt. Pre-existing wrapper-script / systemd gap, not rc3. -- **Gemini slow** on simple prompts (>240s idle). Cancelled cleanly. Upstream provider, not Untether. -- **Test window ~29 min** vs 2.5h budget — abbreviated by prioritising the most informative tests (rc3 features, security-scope V1/V2/V4, multi-engine Tier 7) over a full 50-run Tier 1 matrix. - ---- - -## Appendix — pinned versions + config - -- `pyproject.toml`: `version = "0.35.2rc3"` (commit `2e231d8`, branch `dev`) -- Dev config: `/home/nathan/.untether-dev/untether.toml` (unedited this run; backup at `.rc3test.bak`) -- Preflight config check: `session_mode = "chat"`, `triggers.enabled = false`, `[engines.claude] permission_mode = "plan"` `allowed_tools = ["Bash","Read"]`, `[engines.pi] provider = "openai-codex"` (no `model` override — V7 critical). -- Bot token, allowed_user_ids, project chat_ids all confirmed correct at preflight. diff --git a/docs/tests/v0.35.2-integration-test-plan.md b/docs/tests/v0.35.2-integration-test-plan.md deleted file mode 100644 index c244efc..0000000 --- a/docs/tests/v0.35.2-integration-test-plan.md +++ /dev/null @@ -1,552 +0,0 @@ -# v0.35.2 Integration Test Plan - -**Target:** Untether v0.35.2 (unreleased) -**Scope:** Claude Code, Codex CLI, OpenCode, Pi, Gemini CLI (AMP deferred — sign-in blocked) -**Bot:** `@untether_dev_bot` (dev service — `untether-dev.service`) -**Source of truth for tiers:** `docs/reference/integration-testing.md` -**Filed/executed by:** Claude Code via Telegram MCP + Bash tools - -This plan is scoped to what actually shipped in v0.35.2. It maps every issue that landed in the milestone to at least one concrete integration test so a failure can be pinned back to its commit within minutes. Unit tests already cover the code paths — this plan verifies the live behaviour through the Telegram bridge. - ---- - -## 0. Bug-handling protocol (read first) - -When any test below fails, is suspicious, or surfaces unexpected behaviour: - -### Map the failure to a v0.35.2 issue - -Each test table row has a **"related issue(s)"** column. If the failure correlates with one of those issues, **comment on that existing GitHub issue** with: - -- Failing test id (e.g. `U1-codex`, `V2-317`) -- Engine + chat id -- Timestamp (UTC) of the failing interaction -- What the test expected vs what happened -- One-liner from `journalctl --user -u untether-dev --since '…'` if relevant -- Telegram message id(s) so the evidence is retrievable - -Template: -```markdown -**v0.35.2 integration test — test `` failed** - -- Engine: -- Chat ID: -- UTC: -- Expected: -- Observed: -- Log excerpt: `` -- Telegram msg_id(s): <…> - -Flagged during v0.35.2 integration testing for investigation before release cut. -``` - -### Create a new issue when no existing one fits - -If the failure is **not covered** by an open v0.35.2 issue (i.e. it's genuinely new), file a new issue against the `v0.35.2` milestone with label `bug` and the same evidence block above. Use the `bug` label, add the `v0.35.2` milestone, and cross-link from the test report (Section 9). - -### Distinguish Untether bugs from upstream engine issues - -If the root cause is an engine CLI problem (auth, quota, provider outage, upstream regression), tag the test result as **upstream** in the final report but do NOT file an Untether issue. Examples: Anthropic 529, Google quota, OpenCode deprecation notices, Pi provider auth. - ---- - -## 1. Preflight - -### 1.1 Reinstall OpenCode - -OpenCode was archived upstream 2025-09-18 (see issue [#338](https://github.com/littlebearapps/untether/issues/338) for context). Confirm what binary is present, then install/upgrade: - -```bash -which opencode && opencode --version 2>&1 | head -3 -# If missing, consult latest installer docs from opencode-ai/opencode (archived repo, release assets still hosted) -# A typical reinstall path: -# curl -fsSL https://opencode.ai/install | bash -# OR: pipx install opencode-ai (if that distribution still pulls) -# Verify: -opencode --version -``` - -Record the installed version in Section 9 alongside each engine's version for release notes. - -### 1.2 Verify dev service health - -```bash -systemctl --user status untether-dev --no-pager | head -20 -journalctl --user -u untether-dev --since "5 minutes ago" | grep -E "READY|startup|ERROR" | head -20 -uv run pytest -q 2>&1 | tail -3 # 2292 passing at 80%+ coverage -git log --oneline origin/dev ^master | head -12 # Confirm v0.35.2 commits are on dev -``` - -### 1.3 Snapshot versions - -Run once and pin in Section 9: - -```bash -for cli in claude codex opencode pi gemini; do - echo "=== $cli ===" - $cli --version 2>&1 | head -2 || echo "not installed" -done -grep -E '^version' /home/nathan/untether/pyproject.toml -cat ~/.untether-dev/untether.toml | head -30 # cfg snapshot for the report -``` - -### 1.4 Start a background log tail - -Run in a second terminal (or via background Bash): - -```bash -journalctl --user -u untether-dev -f -``` - -Keep this running throughout testing — screenshots of context go in bug reports. - -### 1.5 Test chats (already configured) - -| Engine | Chat ID | Bot API chat_id | Test project cwd | -|--------|---------|-----------------|------------------| -| Claude Code | `5284581592` | `-5284581592` | `test-projects/test-claude` | -| Codex CLI | `4929463515` | `-4929463515` | `test-projects/test-codex` | -| OpenCode | `5200822877` | `-5200822877` | `test-projects/test-opencode` | -| Pi | `5156256333` | `-5156256333` | `test-projects/test-pi` | -| Gemini CLI | `5207762142` | `-5207762142` | `test-projects/test-gemini` | -| ~~AMP~~ | `5230875989` | — | *skipped — auth blocked* | - -If a positive chat ID fails with `GEN-ERR-582`, use the negative Bot API form. - ---- - -## 2. v0.35.2 change → test map - -Every issue landed in v0.35.2. For each, the table gives the minimum live test. If a test fails, comment on the listed issue. - -| v0.35.2 issue | What landed | Primary test(s) | Engines | -|---|---|---|---| -| [#195](https://github.com/littlebearapps/untether/issues/195) | CI matrix `env:` fix | CI-only — SKIP runtime | — | -| [#196](https://github.com/littlebearapps/untether/issues/196) | `bot_token` → `SecretStr` | `V1` | any | -| [#197](https://github.com/littlebearapps/untether/issues/197) | `_HANDLED_REQUESTS` LRU | `S9` (concurrent Approve click) | Claude | -| [#198](https://github.com/littlebearapps/untether/issues/198) | Env allowlist (Claude+Pi) | `V2` | Claude, Pi | -| [#199](https://github.com/littlebearapps/untether/issues/199) | Codex HTML escape in auth error | `V3` | Codex | -| [#200](https://github.com/littlebearapps/untether/issues/200) | Voice transcription sanitisation | `T1-err` | any | -| [#201](https://github.com/littlebearapps/untether/issues/201) | Command dispatch sanitisation | `V4` | any | -| [#202](https://github.com/littlebearapps/untether/issues/202) | Bandit global skips removed | CI-only — SKIP runtime | — | -| [#203](https://github.com/littlebearapps/untether/issues/203) | Registry sweep (1-hour TTL) | `V5` — log assertion only | any | -| [#204](https://github.com/littlebearapps/untether/issues/204) | `download_file` URL validation | `V6` — log-only smoke | any | -| [#225](https://github.com/littlebearapps/untether/issues/225) | Pi model footer from JSONL | `V7` | **Pi** | -| [#247](https://github.com/littlebearapps/untether/issues/247) | `callback.answered` latency log | `V8` | Claude | -| [#275](https://github.com/littlebearapps/untether/issues/275) | Process tree cleanup | `V9` | Claude | -| [#316](https://github.com/littlebearapps/untether/issues/316) | Cost footer accuracy + parity | `V10` | Claude, Gemini, OpenCode | -| [#317](https://github.com/littlebearapps/untether/issues/317) | `run_once` cron persistence | `V11` | any (cron) | -| [#318](https://github.com/littlebearapps/untether/issues/318) | Restart-required Telegram warning | `V12` | any | -| [#320](https://github.com/littlebearapps/untether/issues/320) | Webhook port graceful bind | `V13` | any | -| [#322](https://github.com/littlebearapps/untether/issues/322) | Stuck-after-tool_result detector | `V14` — no-false-positive | Claude, Codex, Gemini, OpenCode, Pi | -| [#330](https://github.com/littlebearapps/untether/issues/330) | Per-cron `permission_mode` | `V15` | Claude | - -V-tests are specified in detail in Section 6. - ---- - -## 3. Tier 7 — command smoke (all engines, ~5 min) - -Fire these once in **each** of the 5 engine chats via `send_message` then verify the immediate response via `get_history` after ~2s. - -| # | Command | Expected | Notes | -|---|---------|----------|-------| -| Q1 | `/ping` | Pong line + uptime + trigger summary (if any) | Also checks rc4 trigger visibility in footer | -| Q2 | `/config` | Inline keyboard menu | Press Back to close | -| Q3 | `/usage` | Usage dict OR "no session yet" | No crash | -| Q4 | `/export` | Export link OR "no session yet" | No crash | -| Q5 | `/browse` | Inline keyboard with dirs/files | `list_inline_buttons` returns > 0 | -| Q6 | `/verbose` | Toggle confirmation | | -| Q7 | `/cancel` | "Nothing running" | | -| Q8 | `/planmode` (Claude chat only) | Mode toggle | | -| Q9 | `/stats` | Stats or empty | | -| Q10 | `/ctx` | Context line | | -| Q11 | `/agent` | Current engine default | | -| Q12 | `/trigger` | Current trigger mode | | -| Q13 | `/file` | Usage help | | - -Any command that crashes, hangs, or returns a raw Python traceback → **new issue** in v0.35.2 milestone. - ---- - -## 4. Tier 1 — universal (U1-U10, all 5 engines, ~45 min) - -Run **every U-test in every engine chat**. This is the regression backstop. 10 tests × 5 engines = 50 runs. - -For each run: `send_message(chat_id, prompt)`, sleep ~5-15s (depends on engine), then `get_history` and verify via text matching + `list_inline_buttons` where relevant. - -| # | Prompt / action | Verify | Related v0.35.2 issue(s) | -|---|---|---|---| -| U1 | `create a file called hello.txt with "hello world"` | Progress phases appear, final answer renders, footer shows **model name** + **cost line** + resume line | #316 (cost parity), #225 (Pi model) | -| U2 | `list files here, then read README if present` | Multiple action phases visible in verbose | — | -| U3 | `write a detailed explanation of how TCP/IP works, at least 2000 words` | Message splits across multiple Telegram messages; footer only on the last | — | -| U4 | Reply to U1's resume line: `now rename hello.txt to greetings.txt` | Session continues, resume token accepted | — | -| U5 | `/config` → Model → pick a non-default, send a prompt | Footer reflects new model | — | -| U6 | Send U3 prompt then `/cancel` within 10s | Run stops, completion notice, **no orphan processes** | #275 | -| U7 | `read /nonexistent/file/path` | Error renders cleanly in Telegram, no raw traceback | #200, #201 | -| U8 | `/usage` after U1 | Cost info (Claude/Gemini/OpenCode) or no-cost notice (Codex/Pi) | #316 | -| U9 | `/export` after U1 | Markdown export download works | — | -| U10 | `/browse` | Inline keyboard, navigate one dir deep and back | — | - -**Extra Pi-specific check on U1** (related to #225): Pi chat must **not** have `/model set` override active (run `/agent` + `/model` first, clear if set). The footer must show the model name pulled from Pi's JSONL `message_end`, e.g. `gpt-5.4`. If the footer shows only `🏷 dir: pi-test` with no model → comment on [#225](https://github.com/littlebearapps/untether/issues/225). - ---- - -## 5. Tier 2 — Claude interactive (Claude chat only, ~15 min) - -Plan mode ON. `/planmode plan` before starting. - -| # | Prompt / action | Verify | Related | -|---|---|---|---| -| C1 | `run ls -la` | Approve/Deny/Discuss buttons appear; click Approve; command executes | — | -| C2 | Same prompt, click Deny | Denial message reaches Claude cleanly | — | -| C3 | `refactor main.py into smaller modules` → "Pause & Outline Plan" | Outline text renders; Approve/Deny/Discuss buttons auto-appear | — | -| C4 | `should I use TypeScript or JavaScript for this project?` | AskUserQuestion with option buttons | — | -| C5 | With plan mode, prompt that edits a file | Diff preview (old/new) in approval message | — | -| C6 | Approve one tool, **quickly** deny the next | No stale button, no spinner hang | #197 | -| C7 | `/usage` with `[footer]` enabled in cfg | 5h/weekly subscription footer | #316 | - ---- - -## 6. v0.35.2 scenarios (the payload) - -These verify the **specific fixes** landed this release. Most are one-shot and map 1:1 to an issue. - -### V1 — `bot_token` masking (#196) - -Send a spurious command that forces a bridge error and prompts structlog output (e.g. `/file get /etc/passwd`, expect path-denied). - -```bash -journalctl --user -u untether-dev --since "5 minutes ago" \ - | grep -iE "bot_token|token=.{20}" -``` - -**Pass:** no raw bot token (`\d+:[A-Za-z0-9_-]{35}`) appears in any log line. -**Fail:** token visible → comment on [#196](https://github.com/littlebearapps/untether/issues/196). - -### V2 — env allowlist for Claude + Pi (#198) - -In the Claude chat, send: -``` -run `printenv | sort` and report what's present. Then run `echo "AWS=$AWS_ACCESS_KEY_ID DB=$DATABASE_URL STRIPE=$STRIPE_API_KEY"`. -``` - -**Pass:** engine output shows essentials (PATH, HOME, CLAUDE_* vars, NODE_*, UV_*, ANTHROPIC_*) and the last echo shows `AWS= DB= STRIPE=` (all empty). No random third-party tokens leak in. -**Fail (Claude or Pi):** any non-allowlisted env var is visible → comment on [#198](https://github.com/littlebearapps/untether/issues/198). - -Repeat in the Pi chat with the equivalent prompt. (Codex/OpenCode/Gemini keep the default inherit — no change expected, not a regression.) - -### V3 — Codex HTML escape (#199) - -In the Codex chat, trigger an auth error on purpose: -```bash -# Temporarily invalidate codex auth (if safe), OR send a command known to produce a subprocess error that surfaces as HTML to Telegram -``` - -**Pass:** error message renders as plain text inside `
`, no `` / `` bleed-through into the Telegram render.
-**Fail:** HTML entities from the subprocess output are interpreted as Telegram entities → comment on [#199](https://github.com/littlebearapps/untether/issues/199).
-
-### V4 — Dispatch sanitisation (#201)
-
-Send a malformed command that will raise in the dispatch layer: e.g. a `/ctx set` missing args, `/file get` with an absolute path outside project, or a forwarded message from an unknown chat.
-
-**Pass:** Telegram reply is a short friendly error. No absolute paths, no URLs, no raw stack traces visible.
-**Fail:** see `/home/nathan/…` or `Traceback` in the Telegram reply → comment on [#201](https://github.com/littlebearapps/untether/issues/201).
-
-### V5 — Registry sweep (#203) — log-only smoke
-
-This needs a 1-hour TTL — impractical to wait during the run. Instead verify the sweep machinery exists and fires:
-
-```bash
-journalctl --user -u untether-dev --since "1 hour ago" | grep -E "registries.sweep|ephemeral.swept|outline.swept"
-```
-
-**Pass:** at least one sweep log event OR process was running < 1 hour (sweep runs on the 60-second stall-monitor tick but only prunes ≥1h old entries).
-**Fail:** logs show registries growing unbounded over multiple sessions with no sweep events → comment on [#203](https://github.com/littlebearapps/untether/issues/203).
-
-### V6 — `download_file` URL validation (#204)
-
-Unable to craft a malicious getFile response from the MCP. Log-only smoke: send a normal file-upload `/file put README.md` and confirm the download path inside Untether never emits a validation warning for the legit case.
-
-```bash
-journalctl --user -u untether-dev --since "5 minutes ago" | grep -E "download_file.(rejected|invalid)"
-```
-
-**Pass:** no `download_file.rejected` events for legitimate uploads.
-**Fail:** false positives on legit paths → comment on [#204](https://github.com/littlebearapps/untether/issues/204).
-
-### V7 — Pi model footer from JSONL (#225)
-
-**Critical for this release** — this change just merged (PR #327).
-
-In the Pi chat:
-1. `/agent clear` (ensure Pi is the engine), `/model clear` (ensure no override).
-2. Confirm `pi.model` is **not** set in `~/.untether-dev/untether.toml`:
-   ```bash
-   grep -A2 '\[runners.pi\]' ~/.untether-dev/untether.toml || echo "no [runners.pi] section"
-   ```
-   If model is set, remove it temporarily for this test and restart dev.
-3. Send: `what is 2+2`.
-4. Wait for the final message, inspect the footer.
-
-**Pass:** footer shows model name (e.g. `🏷 dir: pi-test | gpt-5.4`). `gpt-5.4` or whatever Pi actually used.
-**Fail:** footer shows only `🏷 dir: pi-test` with no model name → comment on [#225](https://github.com/littlebearapps/untether/issues/225).
-
-### V8 — `callback.answered` instrumentation (#247)
-
-In the Claude chat with plan mode:
-1. Send `run echo hi`.
-2. When Approve/Deny/Discuss buttons appear, use `press_inline_button` to click Approve.
-3. Within 10 seconds:
-   ```bash
-   journalctl --user -u untether-dev --since "1 minute ago" | grep callback.answered
-   ```
-
-**Pass:** at least one `callback.answered` entry with keys `latency_ms`, `total_ms`, `early=true`, `has_toast`. `latency_ms` < 2000 for healthy conditions.
-**Fail:** no log entry, or `latency_ms` suspiciously high (> 5000) and correlates with a `BotResponseTimeoutError` → comment on [#247](https://github.com/littlebearapps/untether/issues/247).
-
-### V9 — Process tree cleanup (#275)
-
-In the Claude chat:
-1. Send: `create a node project in /tmp/workerd-test- with @cloudflare/vitest-pool-workers and run one quick test`.
-2. While it's running (tool execution visible in progress), note the Untether process tree:
-   ```bash
-   # capture before cancel
-   ps --forest -ef | grep -E "untether|claude|node|workerd" | head -20
-   ```
-3. `/cancel`.
-4. After 30 seconds:
-   ```bash
-   # confirm cleanup
-   ps aux | grep -E "workerd|vitest|defunct" | grep -v grep
-   ls /proc/$(pgrep -f '.venv/bin/untether')/fd 2>/dev/null | wc -l  # FD count should be stable
-   ```
-
-**Pass:** no `workerd`, `vitest`, or stale `node` processes survive, no zombies.
-**Fail:** orphan processes remain → comment on [#275](https://github.com/littlebearapps/untether/issues/275) including ps output and PID details.
-
-### V10 — Cost footer + parity (#316)
-
-Four sub-tests:
-
-**V10.1** — Claude: run U1, verify footer shows both API cost **and** subscription usage (`⚡ 5h: NN% | 7d: NN%`). Numbers must be plausible (< 100%, > 0% after the run).
-
-**V10.2** — Gemini: run U1, verify footer shows `total_cost_usd` (not zero unless genuinely free tier).
-
-**V10.3** — OpenCode: run U1, verify token counts render **even when cost is zero** (OpenCode free-tier case).
-
-**V10.4** — zero-turn / cached response: send a trivial prompt twice back to back (second may be cached). Second response's footer must still render the turn count cleanly (no missing turns).
-
-**Fail on any sub-test:** comment on [#316](https://github.com/littlebearapps/untether/issues/316) specifying which sub-test and the rendered footer.
-
-### V11 — `run_once` cron persistence (#317)
-
-Setup:
-```bash
-cat >> ~/.untether-dev/untether.toml <<'EOF'
-
-[[triggers.crons]]
-id = "v0352-test-once"
-schedule = "* * * * *"
-chat_id = 5284581592        # Claude chat
-engine = "claude"
-prompt = "reply with the word READY and nothing else"
-run_once = true
-EOF
-```
-
-1. Wait up to 60s for the cron to fire in the Claude chat. Verify `READY`-style response appears.
-2. Check state file:
-   ```bash
-   cat ~/.untether-dev/run_once_fired.json
-   ```
-   Must contain `v0352-test-once` with a recent ISO timestamp.
-3. Trigger a hot-reload (touch untether.toml): `touch ~/.untether-dev/untether.toml`.
-4. Wait 90 seconds. The cron must **not** fire again.
-5. Restart dev: `systemctl --user restart untether-dev`. Wait 90 seconds. Must **not** fire again.
-
-**Pass:** state file records the fire, no second or third firing after reload/restart.
-**Fail:** cron re-fires → comment on [#317](https://github.com/littlebearapps/untether/issues/317). Include the state file contents and journalctl line for each fire.
-
-Cleanup: remove the test cron block and the state file entry.
-
-### V12 — Restart-required Telegram warning (#318)
-
-1. Note current `session_mode` in `~/.untether-dev/untether.toml`.
-2. Flip it (`stateless` ↔ `chat`) and save.
-3. Dev bot auto-reloads on file change. Within ~5 seconds:
-   - Primary chat (Claude chat in current dev cfg) must receive a Telegram message matching: `⚠️ Config reload: session_mode changed — restart required to take effect.`
-   - `journalctl` must show `config.reload.transport_config_changed keys=['session_mode'] restart_required=true`.
-4. Revert the change and save. Second warning fires.
-
-**Pass:** both the Telegram message **and** the structlog event.
-**Fail:** one or both missing → comment on [#318](https://github.com/littlebearapps/untether/issues/318).
-
-Also verify: editing a **hot-reloadable** key (`voice_transcription`) does **not** produce the warning.
-
-### V13 — Webhook port bind graceful (#320)
-
-1. Ensure triggers are enabled in cfg with `[triggers.server]` (port 9876 by default).
-2. Occupy 9876 in another process: `python3 -m http.server 9876 &`.
-3. Restart dev: `systemctl --user restart untether-dev`.
-4. Tail logs for 10 seconds:
-   ```bash
-   journalctl --user -u untether-dev --since "30 seconds ago" | grep -E "bind_failed|triggers.server"
-   ```
-5. Verify the rest of the bot is alive: `send_message` a `/ping` to the Claude chat.
-
-**Pass:** structured `triggers.server.bind_failed` event with `host`, `port`, `hint`, `fix` fields; `/ping` still works.
-**Fail:** bot crashes or restart-loops → comment on [#320](https://github.com/littlebearapps/untether/issues/320). Cleanup the port squatter: `kill %1` (or equivalent).
-
-### V14 — Stuck-after-tool_result no-false-positive (#322)
-
-Across Claude, Codex, Gemini, OpenCode, Pi: run a normal U1-style prompt that involves a Read/Write tool.
-
-```bash
-journalctl --user -u untether-dev --since "10 minutes ago" | grep -E "stuck_after_tool_result"
-```
-
-**Pass:** **zero** `progress_edits.stuck_after_tool_result` or `recovery` events fired during healthy runs.
-**Fail:** detector fires on a normal run → comment on [#322](https://github.com/littlebearapps/untether/issues/322) with the engine, tool name, and timing.
-
-(Active recovery test — a real MCP wedge — is out of scope; this is a no-false-positive regression check only.)
-
-### V15 — Per-cron `permission_mode` (#330)
-
-1. Ensure Claude chat has plan mode (`/planmode plan`).
-2. Add:
-   ```toml
-   [[triggers.crons]]
-   id = "v0352-test-perm"
-   schedule = "* * * * *"
-   chat_id = 5284581592
-   engine = "claude"
-   prompt = "run `date` and tell me"
-   permission_mode = "auto"
-   run_once = true
-   ```
-3. Hot-reload picks it up; wait ≤ 60 s for fire.
-4. Verify the run completed **without** presenting approval buttons — the `permission_mode = "auto"` override should have taken effect despite the chat being in plan mode.
-5. Check:
-   ```bash
-   journalctl --user -u untether-dev --since "2 minutes ago" | grep trigger.cron.permission_mode_override
-   ```
-
-**Pass:** run completes autonomously; `permission_mode_override` INFO event recorded.
-**Fail:** approval buttons presented, or run blocked → comment on [#330](https://github.com/littlebearapps/untether/issues/330).
-
-Cleanup: remove the cron, remove the id from `run_once_fired.json` if it got written.
-
----
-
-## 7. Tier 3 — Telegram transport (selective, ~15 min)
-
-Only the tests relevant to what shipped in v0.35.2 — no telegram transport code changed materially except #197 / #247.
-
-| # | Test | Related |
-|---|---|---|
-| T6 | Emoji entities — any engine: `respond with 5 emoji flags and bold the country names`. Entities render correctly. | — |
-| T8 | Stale button click — let a Claude session complete and age ~10 min. Click old Approve button. Toast says expired. | #197 |
-| S9 | Concurrent Approve clicks — two rapid `press_inline_button` on the same button. Exactly one Approve path fires. | #197 |
-
----
-
-## 8. Tier 5/6 — operational + stress (~15 min)
-
-| # | Test | Related |
-|---|---|---|
-| B5 | Log inspection after full test run: `journalctl --user -u untether-dev --since "2 hours ago" \| grep -E "ERROR\|WARNING"` — must only surface expected entries (config warnings during V12, intentional cancels during V9). | all |
-| S2 | Concurrent sessions — send U1 in Claude **and** Gemini simultaneously. Both finish, no cross-contamination. | — |
-| S3 | `/restart` mid-run — start a long Claude run, send `/restart`. Drain notice appears, bot restarts, new runs accepted. | — |
-| S7 | Rapid-fire: 5 prompts to the Claude chat in under 5s. Exactly one session locks; rest queue or reject cleanly. | — |
-
-FD count sanity after everything:
-
-```bash
-ls /proc/$(pgrep -f '.venv/bin/untether')/fd 2>/dev/null | wc -l
-ps aux | grep -E "defunct|Z " | grep -v grep
-```
-
-FD count should be in the low hundreds, no zombies.
-
----
-
-## 9. Final report template
-
-At the end of the test run, write a result block with this shape (can be pasted into a summary issue or commit message):
-
-```markdown
-## v0.35.2 integration test report — 
-
-**Dev bot version:**  on commit 
-**Engines:** claude , codex , opencode , pi , gemini  (amp skipped: auth)
-
-### Tier 7 (command smoke)
-- Claude: 
-- Codex: <…>
-- OpenCode: <…>
-- Pi: <…>
-- Gemini: <…>
-
-### Tier 1 (universal, U1-U10)
-- Matrix: 5 engines × 10 tests = 50 runs
-- Results: 
-- Failures: 
-
-### Tier 2 (Claude interactive)
-
-
-### v0.35.2 scenarios (V1-V15)
-
-
-### Tier 3 selective (T6, T8, S9)
-
-
-### Tier 5/6 (B5, S2, S3, S7)
-
-
-### Logs
-- FD count after suite: 
-- Zombies: 
-- Unexpected WARNING/ERROR lines: 
-
-### Bugs filed / commented
-- Commented on existing issues: 
-- New issues filed in v0.35.2 milestone: 
-
-### Release readiness
-- 
-```
-
-Drop this report in a PR comment or as a new `docs/tests/results/` entry if retained.
-
----
-
-## 10. Known limitations of this plan
-
-- **AMP is skipped** per user instruction (sign-in blocked). Revisit before a v0.35.3 cut.
-- **OpenCode** is deprecated upstream (archived 2025-09-18). Treat any new failures as documented-in-advance unless they reveal an Untether bug (bad error handling, crash) — see [#338](https://github.com/littlebearapps/untether/issues/338).
-- **V9 (process tree)** can be flaky under rate-limited API conditions; rerun if the first attempt looks inconclusive.
-- **V14 (stuck detector)** only verifies no-false-positive; a real wedge test would require orchestrating a Cloudflare MCP stall and is deferred.
-- **V3 (Codex HTML escape)** requires a reproducible Codex auth-error path; if you can't naturally trigger one, mark as N/A rather than fail.
-- **V5, V6** are log-only smoke tests — they don't actively attack the hardened path, they just verify it's not broken on the legitimate path.
-- **V12** requires `[config_watch]` or the equivalent auto-reload mechanism already enabled; if the cfg file isn't being watched, do `systemctl --user reload untether-dev` or send a SIGHUP equivalent instead.
-
----
-
-## 11. Execution checklist (top-to-bottom)
-
-- [ ] Preflight 1.1 — reinstall OpenCode
-- [ ] Preflight 1.2-1.5 — services, versions, logs, chats confirmed
-- [ ] Section 3 — Tier 7 command smoke in all 5 chats
-- [ ] Section 4 — Tier 1 U1-U10 in all 5 chats (50 runs)
-- [ ] Section 5 — Tier 2 C1-C7 in Claude chat
-- [ ] Section 6 — V1…V15 (skip V3 if Codex auth can't error; mark N/A)
-- [ ] Section 7 — T6, T8, S9
-- [ ] Section 8 — B5, S2, S3, S7 + FD/zombie check
-- [ ] Section 9 — write the report
-- [ ] Comment on any landed-issue that had a failing test (see Section 0)
-- [ ] File new issues in v0.35.2 milestone for anything not covered
-- [ ] Final verdict: go / no-go for v0.35.2 release cut
-
-**Estimated total time**: ~2 hours end-to-end, single-operator, including log sweeps and report writing.