Skip to content

Pipeline Resilience Phase 3: Add schema validation to ETL output files #55

@madjin

Description

@madjin

Context

No formal schema validation exists for the pipeline's JSON output files. Upstream changes (e.g., images: [] -> images: null) and LLM output drift cause silent failures in downstream consumers (posters, Discord, website). Currently only extract-facts.py has manual field-presence checks.

Approach

Pure Python validators per script (no new dependencies like Pydantic -- respects the script independence pattern). Validation runs after LLM parse, before file write. Failures are logged and recorded in _metadata but don't block file write (degraded operation).

Scope

3A. scripts/etl/extract-facts.py -- Add validate_facts_schema()

~80-line function validating:

  • Required top-level: briefing_date (str), overall_summary (str), categories (dict)
  • Optional: key_facts (list[str]), open_questions (list[str])
  • Tags: themes (list), sentiment (dict with overall + context), story_type (list)
  • Categories: github_updates (dict), all others (list)

Call after tags are merged (~line 682), before writing _metadata.
Add _metadata.schema_validation = "passed"/"failed" and _metadata.schema_errors list.

3B. scripts/etl/generate-council-context.py -- Annotate existing validation

Lines 354-387 already validate the full nested schema. Just add:

  • _metadata.schema_validation = "passed" on success (~line 402)
  • _metadata.schema_validation = "failed" in ValueError catch (line 419)
  • ~5 lines total

3C. scripts/etl/generate-daily-highlights.py -- Add validate_highlights_schema()

~50-line function validating:

  • Required: date (str), highlights (list)
  • Each highlight: headline (str), body (str), character (str), sources (list)

Call in generate_highlights() before return.

Files to modify

File Est. lines changed
scripts/etl/extract-facts.py ~90 added
scripts/etl/generate-council-context.py ~5 added
scripts/etl/generate-daily-highlights.py ~60 added

Verification

Run extract-facts against daily.json and check _metadata.schema_validation = "passed" in output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions