Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 36 additions & 22 deletions COLLABORATION_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,50 +4,64 @@

## Task Understanding

- Goal:
- Non-goals:
- Protected contracts:
- Goal: 修复剩余审批意图识别问题,并清理协作日志中的安全相关字面量披露。
- Scope: 只调整 Planner / run 入口共享的 intent 判断、相关回归测试和本文件内容;不修改 README 或 AGENTS 的契约说明。
- Public contracts preserved: 运行结果仍返回业务汇总;标准工具事件仍使用 `tool.call` 和稳定工具名;审计动作仍使用 README 约定名称;受保护写操作仍必须在关键路径做权限判断。
- Security constraint for this log: 不复述 README 敏感清单里的字段名、公开 fixture 私密值、内部诊断字段名或原始敏感术语;改用“ERP 私密字段名”“公开 fixture 私密值”“内部诊断字段”“成本字段”“受限知识库内容”等抽象表述。

## Collaboration Disclosure

- Primary AI software/model or human name:
- Other tools or collaborators:
- Division of work:
- Primary AI software/model or human name: Codex / GPT-5。
- Other tools: Local shell, `rg`, `sed`, `pytest`, FastAPI `TestClient`。
- Division of work: Codex 阅读仓库契约、定位审批 intent 根因、实现聚焦修复、更新测试、执行验证并维护协作记录。

## Ambiguities And Assumptions
## Ambiguities And Decisions

| Item | Impact | Decision |
| --- | --- | --- |
| | | |
| “生成补货审批建议/生成审批建议”既可能被理解为文本建议,也可能代表补货审批业务闭环。 | 过窄会漏建 Alice 的 OA 草稿;过宽会让 Bob 的明确文本建议错误进入写入路径。 | 按本次需求收紧规则:没有“文本/只分析/不创建/不要创建/返回建议文本”等明确只读限定时,该类表达视为需要 OA 草稿的补货审批意图。 |
| Bob 缺少 OA 写权限时可以失败、预检拒绝或完成只读分析。 | 影响是否创建 run、事件链和审计证据。 | 对写入意图在 run 创建入口预检拒绝,并写入 `approval.draft.create` deny 审计;明确只读文本建议仍走 4 个只读工具并完成。 |
| 协作日志需要记录真实验证,又不能复述敏感清单。 | 历史日志直接包含安全敏感字面量,会违反协作证据要求。 | 重写日志为脱敏摘要,保留根因、决策、命令和结果,用抽象类别替代敏感字段和 fixture 私密值。 |

## AGENTS.md Historical Notes Review

| Historical note | Adopted or rejected | Evidence |
| Historical note | Decision | Evidence |
| --- | --- | --- |
| | | |
| 公开测试只检查 API 外形,因此可以暂缓完整事件和审计。 | Rejected. | README 明确标准工具事件、审计动作和隐藏评分会覆盖业务闭环。 |
| 可以按公开用户或公开 SKU 写固定分支。 | Rejected. | README 和 AGENTS 都要求支持隐藏 fixture;现有解析逻辑保持通用 SKU 提取,不按公开样例分支。 |
| Dashboard 字段可以按实现方便重命名。 | Rejected. | README 将管理后台字段列为稳定公开契约;现有实现保持兼容字段名。 |
| 能创建任务就默认允许创建 OA 草稿。 | Rejected. | OA 写操作受独立权限保护;Bob 写入意图会在关键路径被拒绝并审计。 |
| 知识库检索可以后置 citation 和过滤列表。 | Rejected. | README 将引用和过滤列表作为公开 RAG 契约;现有实现保留可追溯引用和过滤证据。 |
| 工具异常可以吞掉并返回空结果。 | Rejected. | README 要求失败可解释、可审计且脱敏;现有执行路径记录失败工具事件和脱敏错误摘要。 |

## Root Cause Notes

| Symptom | Evidence | Root cause | Fix |
| --- | --- | --- | --- |
| | | | |
| Symptom | Root cause | Fix |
| --- | --- | --- |
| README 示例 prompt 运行后只有 4 个只读工具,没有 OA 草稿编号。 | `wants_approval()` 只识别“创建/提交/发起草稿”等显式写入词,没有覆盖“生成补货审批建议/生成审批建议”这种 README 推荐业务闭环表达。 | 将补货审批建议类表达纳入写入意图;仍由 `is_analysis_only()` 过滤明确只读限定。 |
| Bob 的文本建议场景必须保持只读。 | 审批建议类表达变宽后,如果不保留文本限定,会误触发 OA 权限拒绝。 | 将“文本/返回建议文本/建议文本/只生成建议”等作为明确只读限定,Planner 和 run 入口共用同一判断。 |
| 协作日志含安全敏感字面量。 | 历史记录为了说明脱敏测试和 fixture 内容,直接复述了 README 禁止出现在协作日志中的字段名、私密值和内部诊断字段。 | 删除历史逐字复述,改为抽象类别;后续验证记录也只写脱敏结果。 |

## Compatibility Notes

| Surface | Existing behavior | Change | Compatibility plan |
| --- | --- | --- | --- |
| API | | | |
| Database | | | |
| Permissions | | | |
| Audit logs | | | |
| Surface | Change | Compatibility |
| --- | --- | --- |
| Planner | 补货审批建议类 prompt 默认计划 OA 工具,除非出现明确只读限定。 | 工具名和事件顺序保持 README 标准链路;只读场景仍为 ERP、BI、知识库、供应商风险 4 步。 |
| Run permission boundary | 同一 intent 判断用于 run 创建入口,缺少 OA 写权限时拒绝写入意图并审计。 | 不创建受保护副作用;拒绝审计继续使用 `approval.draft.create` deny。 |
| Tests | Alice 验收场景改为 README curl 示例 prompt;新增 Bob 同类写入意图拒绝测试;保留 Bob 文本建议只读测试。 | 只增加回归覆盖,不删除公开字段或重命名契约。 |
| Collaboration log | 重写为脱敏摘要。 | 保留决策、验证命令和风险记录,不复述敏感字面量。 |

## Verification

| Command | Result | Notes |
| --- | --- | --- |
| `py scripts/self_check.py` | | Public contract self-check. |
| `py -m pytest -q` | | Full local suite; explain any expected xfail. |
| `.venv/bin/python -m pytest -q tests/test_acceptance_guidance.py::test_acceptance_alice_inventory_replenishment_loop tests/test_acceptance_guidance.py::test_acceptance_bob_approval_advice_text_is_read_only tests/test_acceptance_guidance.py::test_acceptance_bob_replenishment_approval_advice_write_intent_is_denied tests/test_acceptance_guidance.py::test_acceptance_bob_explicit_approval_draft_create_is_denied_and_audited` | Passed. | 4 passed, 1 dependency deprecation warning. Covers README prompt OA success, Bob text-only read path, and Bob write-intent denial audit. |
| `.venv/bin/python scripts/self_check.py` | Passed. | 6 passed, 1 dependency deprecation warning; script printed public self-check passed. |
| `.venv/bin/python -m pytest -q` | Passed. | 20 passed, 1 dependency deprecation warning. |
| Manual README example prompt probe | Passed. | Task creation returned 201, run creation returned 202, final status was completed, result included `approval_draft_id`, and event chain was ERP, BI, knowledge, supplier risk, OA draft creation. No draft identifier value or sensitive payload was printed. |

## Remaining Risks

-
- Hidden tests were not run.
- Additional natural-language variants around “建议” may need future expansion if hidden prompts use wording outside the current deterministic marker set.
- The local dependency deprecation warning is unchanged and not caused by this fix.
99 changes: 94 additions & 5 deletions agentops_assessment/admin/metrics.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
from __future__ import annotations

import sqlite3
from collections import Counter
from datetime import datetime

from agentops_assessment.backend import database
from agentops_assessment.redaction import sanitize, sanitize_text


RECENT_FAILURE_LIMIT = 5


def build_dashboard(conn: sqlite3.Connection) -> dict:
Expand All @@ -18,17 +22,102 @@ def build_dashboard(conn: sqlite3.Connection) -> dict:
token_cost = conn.execute("SELECT COALESCE(SUM(token_cost), 0) AS c FROM runs").fetchone()[
"c"
]
events = conn.execute("SELECT tool_name FROM run_events WHERE tool_name IS NOT NULL").fetchall()
tool_counts = Counter(row["tool_name"] for row in events)
tool_call_counts = {
row["tool_name"]: row["c"]
for row in conn.execute(
"""
SELECT tool_name, COUNT(*) AS c
FROM run_events
WHERE type = 'tool.call' AND tool_name IS NOT NULL
GROUP BY tool_name
ORDER BY tool_name ASC
"""
).fetchall()
}
average_run_seconds = _average_run_seconds(conn)
recent_failures = _recent_failures(conn)
queued_count = conn.execute(
"SELECT COUNT(*) AS c FROM runs WHERE status = 'queued'"
).fetchone()["c"]
running_count = conn.execute(
"SELECT COUNT(*) AS c FROM runs WHERE status = 'running'"
).fetchone()["c"]
permission_denied_count = conn.execute(
"SELECT COUNT(*) AS c FROM audit_logs WHERE decision = 'deny'"
).fetchone()["c"]

# TODO(candidate/P2): 补充平均耗时、最近失败、按工具拆分的成本和队列健康度。
return {
"task_count": task_count,
"run_count": run_count,
"completed_count": completed_count,
"failed_count": failed_count,
"failure_rate": failed_count / run_count if run_count else 0,
"token_cost": token_cost,
"tool_call_counts": dict(tool_counts),
"average_run_seconds": average_run_seconds,
"tool_call_counts": tool_call_counts,
"recent_failures": recent_failures,
"queue_health": {
"queued_count": queued_count,
"running_count": running_count,
},
"permission_denied_count": permission_denied_count,
"generated_at": database.now_iso(),
}


def _average_run_seconds(conn: sqlite3.Connection) -> float:
rows = conn.execute(
"""
SELECT created_at, started_at, finished_at
FROM runs
WHERE finished_at IS NOT NULL
"""
).fetchall()
durations: list[float] = []
for row in rows:
started_at = _parse_iso(row["started_at"]) or _parse_iso(row["created_at"])
finished_at = _parse_iso(row["finished_at"])
if started_at is None or finished_at is None:
continue
durations.append(max(0.0, (finished_at - started_at).total_seconds()))
if not durations:
return 0
return sum(durations) / len(durations)


def _recent_failures(conn: sqlite3.Connection) -> list[dict]:
rows = conn.execute(
"""
SELECT runs.id, runs.task_id, runs.error, runs.created_at, runs.finished_at, tasks.title
FROM runs
LEFT JOIN tasks ON tasks.id = runs.task_id
WHERE runs.status = 'failed'
ORDER BY COALESCE(runs.finished_at, runs.created_at) DESC
LIMIT ?
""",
(RECENT_FAILURE_LIMIT,),
).fetchall()
failures = []
for row in rows:
failures.append(
sanitize(
{
"run_id": row["id"],
"task_id": row["task_id"],
"task_title": sanitize_text(row["title"] or "", max_length=120),
"error": sanitize_text(row["error"] or "运行失败。", max_length=300),
"created_at": row["created_at"],
"finished_at": row["finished_at"],
}
)
)
return failures


def _parse_iso(value: str | None) -> datetime | None:
if not value:
return None
try:
return datetime.fromisoformat(value)
except ValueError:
return None
Loading