Skip to content

Pipeline Resilience Phase 1: Add retry logic to daily critical LLM scripts #53

@madjin

Description

@madjin

Context

PR #52 already hardened extract-facts.py and extract_daily_facts.yml with:

  • Retry loop (2 attempts, 5s delay) in extract-facts.py
  • Completion token sanity check (<200 = truncated)
  • Debug sidecar file on failure
  • Quality-gated daily.json permalink in workflow
  • Workflow-level retry (30s delay)
  • Discord alert on error stubs
  • Aggregation size cap in aggregate-sources.py (200KB user_summaries limit)
  • Backfilled all 9 error stubs successfully

This issue covers the remaining daily-critical scripts that still have zero retry logic.

Remaining Scope

1A. scripts/etl/generate-council-context.py -- Add retry loop

Already has solid schema validation (lines 354-387). Needs retry wrapper around lines 334-434.

  • Add import time, constants: MAX_ATTEMPTS = 2, RETRY_DELAY_SECONDS = 5, MIN_COMPLETION_TOKENS = 150
  • Wrap lines 334-434 in for attempt in range(MAX_ATTEMPTS): loop
  • Add completion token sanity check after token usage logging
  • On success (line 399), break; on failure, log attempt, time.sleep(), continue
  • Save raw response to the-council/council_briefing/.debug/{date}.raw.txt on final failure
  • Add council_metadata["attempts"] field

1B. scripts/etl/generate-daily-highlights.py -- Add retry to call_llm()

The call_llm() function (lines 115-187) is a clean, isolated API call.

  • Add import time, constants: MAX_ATTEMPTS = 2, RETRY_DELAY_SECONDS = 5, MIN_COMPLETION_TOKENS = 100
  • Wrap lines 134-187 in retry loop inside call_llm()
  • Add token usage logging (currently none)
  • After loop exhausted, return None (existing callers handle None)

1C. scripts/integrations/discord/webhook.py -- Add retry to summarize()

  • Add MAX_LLM_ATTEMPTS = 2, RETRY_DELAY_SECONDS = 3
  • Wrap requests.post in summarize() (line 99) with retry loop
  • On exhaustion, fall back to smart_truncate() (existing behavior)

1D. .github/workflows/generate-council-briefing.yml -- Quality gate + retry

Same pattern as extract_daily_facts.yml (PR #52):

  • Add "Check generation quality" step (grep for "status": "success")
  • Add "Retry generation on failure" step (30s delay)
  • Quality-gate the daily.json permalink copy

1E. .gitignore

  • Add the-council/council_briefing/.debug/ and the-council/highlights/.debug/

Files to modify

File Est. lines changed
scripts/etl/generate-council-context.py ~80 added
scripts/etl/generate-daily-highlights.py ~50 added
scripts/integrations/discord/webhook.py ~30 added
.github/workflows/generate-council-briefing.yml ~30 added
.gitignore 2 lines

Already done (PR #52)

  • scripts/etl/extract-facts.py -- retry, token check, debug sidecar
  • scripts/etl/aggregate-sources.py -- size cap, token warning
  • .github/workflows/extract_daily_facts.yml -- quality gate, retry, alert
  • .gitignore -- the-council/facts/.debug/
  • Backfilled all 9 error stubs

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions