Skip to content

Commit f322478

Browse files
authored
Merge pull request #165 from cbwinslow/codex/create-congress.gov-folder-with-migration-scripts
Add Congress.gov schema and ingestion toolkit
2 parents 3584ab8 + ced3a26 commit f322478

8 files changed

+1301
-0
lines changed

DIFF_20251109_030348.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
## Diff Summary (${timestamp})
2+
3+
- Added `congress.gov/README.md` describing the new Congress.gov integration toolkit directory.
4+
- Authored `congress.gov/data_model.md` outlining the relational schema aligned with Congress.gov collections.
5+
- Created PostgreSQL migrations under `congress.gov/migrations/` to materialize the core, legislative, and activity tables.
6+
- Implemented `congress.gov/ingest_congress_data.py`, an asynchronous, GPU-aware ingestion pipeline for api.congress.gov.
7+
8+
## Diff Summary (2025-11-09 03:03:48 UTC)
9+
- Added `congress.gov/README.md` describing the Congress.gov integration toolkit directory.
10+
- Authored `congress.gov/data_model.md` outlining the relational schema aligned with Congress.gov collections.
11+
- Created PostgreSQL migrations under `congress.gov/migrations/` to materialize the core, legislative, and activity tables.
12+
- Implemented `congress.gov/ingest_congress_data.py`, an asynchronous, GPU-aware ingestion pipeline for api.congress.gov.

RECOMMENDATIONS_20251109_030359.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
## Recommendations (20251109_030359)
2+
3+
- Add dedicated loaders for remaining Congress.gov collections (treaties, nominations, congressional record) by subclassing to fully cover the schema.
4+
- Introduce automated schema migration tooling (e.g., Alembic or sqitch) to manage versioned deployments across environments.
5+
- Configure integration tests that mock the Congress.gov API and assert end-to-end database persistence for the ingestion pipeline.
6+
7+
## Additional Notes (2025-11-09 03:03:59 UTC)
8+
- Add dedicated loaders for remaining Congress.gov collections (treaties, nominations, congressional record) by subclassing `BaseResourceLoader` to fully cover the schema.
9+
- Introduce automated schema migration tooling (e.g., Alembic or sqitch) to manage versioned deployments across environments.
10+
- Configure integration tests that mock the Congress.gov API and assert end-to-end database persistence for the ingestion pipeline.

congress.gov/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Congress.gov Data Integration Toolkit
2+
3+
This directory contains documentation, database migrations, and ingestion tooling to replicate core
4+
entities exposed by [Congress.gov](https://www.congress.gov/) and its public API at
5+
`https://api.congress.gov/`.
6+
7+
## Contents
8+
9+
- `data_model.md` — Detailed explanation of the logical data model distilled from the public
10+
website and API documentation.
11+
- `migrations/` — PostgreSQL-compatible SQL migration scripts that materialize the schema needed
12+
to store the Congress.gov dataset.
13+
- `ingest_congress_data.py` — Asynchronous, GPU-aware ingestion pipeline that streams data from
14+
the public API into the relational schema. The script supports sampling, parallel downloads, and
15+
resumable checkpoints for efficient development and production workflows.
16+
17+
## Prerequisites
18+
19+
1. **Database** — PostgreSQL 14+ with the `pgcrypto` extension enabled (used for UUID generation
20+
and hashing utilities).
21+
2. **Python environment** — Python 3.10+ with dependencies listed in the module-level docstring of
22+
`ingest_congress_data.py`.
23+
3. **Congress.gov API key** — Request an API key and export it via `CONGRESS_API_KEY` or supply it
24+
with the `--api-key` command line flag.
25+
4. **GPU acceleration (optional)** — Install the RAPIDS stack (`cudf`, `cupy`) if a CUDA-capable
26+
GPU is available. The ingestion pipeline auto-detects GPU libraries and falls back to CPU-only
27+
processing when they are absent.
28+
29+
## Usage Overview
30+
31+
1. Run the migrations in `migrations/` (in lexical order) against your target PostgreSQL database.
32+
2. Configure access credentials via environment variables or CLI flags.
33+
3. Execute `python ingest_congress_data.py --resource bills` (or any supported resource) to ingest
34+
records. Use `--sample-size` to limit the number of items while testing.
35+
36+
## Extensibility
37+
38+
The toolkit is intentionally modular. You can:
39+
40+
- Add new SQL migrations to extend the schema when Congress.gov publishes new collections.
41+
- Implement additional resource loaders by subclassing `BaseResourceLoader`.
42+
- Adjust concurrency knobs (`--max-concurrent-requests`, `--thread-pool-size`) to match your
43+
infrastructure capacity.
44+

congress.gov/data_model.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Congress.gov Relational Data Model
2+
3+
The schema described below synthesizes official Congress.gov documentation, publicly available data
4+
samples, and structural cues from the website. It is designed to accommodate every major collection
5+
published by the API while remaining normalized and query-friendly.
6+
7+
> **Note:** Congress.gov evolves continuously. Treat this model as a strong baseline and monitor the
8+
> API changelog for new fields or entities that may require schema extensions.
9+
10+
## Core Reference Tables
11+
12+
| Table | Purpose |
13+
| --- | --- |
14+
| `congress.sessions` | One row per numbered Congress (e.g., 118th), including start/end dates and calendar year range. |
15+
| `congress.chambers` | Enumerates the House, Senate, and Joint designations used by multiple resources. |
16+
| `congress.parties` | Catalogues political parties for member affiliation history. |
17+
| `congress.states` | ISO-like references for U.S. states and territories, re-used across members and committees. |
18+
19+
## People and Organizations
20+
21+
- **Members**: Stored in `congress.members` with biographical metadata. Temporal service information
22+
lives in `congress.member_terms`, enabling many-to-one relationships across Congress sessions,
23+
chambers, and parties.
24+
- **Committees**: Captured by `congress.committees` with optional `parent_committee_id` for
25+
subcommittees. Committee membership (including leadership roles) is managed via
26+
`congress.committee_members`.
27+
28+
## Legislative Instruments
29+
30+
| Entity | Tables | Highlights |
31+
| --- | --- | --- |
32+
| Bills & Resolutions | `congress.bills`, `congress.bill_titles`, `congress.bill_actions`, `congress.bill_text_versions`, `congress.bill_summaries`, `congress.bill_subjects`, `congress.bill_cosponsors`, `congress.related_bills` | Covers metadata, multi-lingual titles, action history, full-text versions, CRS summaries, topical subjects, and co-sponsorship data. |
33+
| Amendments | `congress.amendments`, `congress.amendment_actions`, `congress.amendment_sponsors` | Mirrors the bill structure for amendment records. |
34+
| Nominations | `congress.nominations`, `congress.nomination_actions`, `congress.nomination_candidates` | Tracks presidential nominations and Senate action. |
35+
| Treaties | `congress.treaties`, `congress.treaty_actions`, `congress.treaty_topics` | Stores treaty documents and consideration steps. |
36+
| Congressional Records | `congress.congressional_record_sections`, `congress.congressional_record_pages` | Enables ingestion of daily Congressional Record text and metadata. |
37+
| Committee Materials | `congress.committee_reports`, `congress.hearings`, `congress.hearing_witnesses` | Supports published reports, hearing schedules, and witness rosters. |
38+
39+
## Legislative Activity
40+
41+
- **Actions and Votes**: All legislative actions are normalized in `congress.bill_actions` and related
42+
tables. Roll call votes from both chambers live in `congress.roll_calls` with individual positions in
43+
`congress.roll_call_votes`.
44+
- **Calendars**: Floor calendars and schedule entries are modeled in `congress.floor_calendars` and
45+
`congress.floor_calendar_entries`.
46+
47+
## Supporting Structures
48+
49+
- **Documents & Media**: `congress.documents` stores references to PDFs, XML, and other artifacts,
50+
linking them back to primary entities through join tables.
51+
- **Search Indexing**: The schema includes `tsvector` columns (e.g., `search_document`) in several
52+
tables to enable PostgreSQL full-text search acceleration.
53+
- **Audit Columns**: Every table carries `created_at`, `updated_at`, and immutable natural keys from
54+
the API, making the ingestion process idempotent.
55+
56+
## Entity Relationship Diagram (Textual)
57+
58+
```
59+
members 1---* member_terms *---1 chambers
60+
members 1---* bill_sponsors *---1 bills
61+
bills 1---* bill_actions
62+
bills 1---* bill_text_versions
63+
bills *---* subjects (via bill_subjects)
64+
bills *---* committees (via bill_committees)
65+
bills *---* roll_calls *---* members (via roll_call_votes)
66+
amendments *---1 bills (via parent_bill_id)
67+
```
68+
69+
## API Alignment
70+
71+
- **Pagination**: All API collections use cursor-based pagination. The ingestion pipeline stores the
72+
`next` token in `congress.ingest_checkpoints` for resumability.
73+
- **Change Tracking**: Congress.gov publishes `lastModifiedDate` fields. These populate
74+
`updated_at` columns and support incremental refreshes with `--changed-since` CLI filters.
75+
- **Identifiers**: Primary keys follow the API's composite keys (e.g., bill type + number + congress)
76+
to avoid synthetic IDs where unnecessary.
77+
78+
## Future Extensions
79+
80+
1. **Historical Data Normalization**: Some early Congresses have incomplete metadata. Consider
81+
augmenting the schema with archival datasets when available.
82+
2. **Event Streams**: If near-real-time updates are required, add Kafka topics fed by the ingestion
83+
script and consumers that apply change sets to the database.
84+
3. **Analytic Warehousing**: Mirror the normalized schema into star schemas in your analytics layer
85+
for simplified reporting (e.g., `fact_votes`, `dim_member`).
86+

0 commit comments

Comments
 (0)