Fix stale calibration targets by deriving time_period from dataset by baogorek · Pull Request #505 · PolicyEngine/policyengine-us-data

baogorek · 2026-02-02T15:36:42Z

Summary

Fixes state/CD calibration using stale 2022-2023 targets instead of correct 2024 values
Removes hardcoded CBO_YEAR and TREASURY_YEAR constants from etl_national_targets.py
Adds --dataset CLI argument to specify the source dataset
Derives time_period from sim.default_calculation_period - the dataset itself is now the single source of truth

Root Cause

The ETL had hardcoded year constants:

CBO_YEAR = 2023  # was pulling 2023 CBO values
TREASURY_YEAR = 2023  # was pulling 2023 Treasury values

But the calibration runs at time_period=2024. This caused an 18% gap for income tax alone ($2,051B vs $2,426B).

The Fix

Instead of hardcoding years, we now derive the time period from the dataset:

sim = Microsimulation(dataset=args.dataset)
time_period = int(sim.default_calculation_period)  # e.g., 2024

This ensures CBO/Treasury targets always match the dataset's year, preventing future drift when updating to new base years annually.

Usage

# Default: uses HuggingFace production dataset
python policyengine_us_data/db/etl_national_targets.py

# Or specify a local dataset
python policyengine_us_data/db/etl_national_targets.py \
  --dataset /path/to/stratified_extended_cps.h5

Test plan

Run make database to regenerate policy_data.db
Verify CBO/Treasury targets now show 2024 values
Verify income_tax target is ~$2,426B (not $2,051B)

Closes #503

🤖 Generated with Claude Code

- Remove hardcoded CBO_YEAR and TREASURY_YEAR constants - Add --dataset CLI argument to etl_national_targets.py - Derive time_period from sim.default_calculation_period - Default to HuggingFace production dataset The dataset itself is now the single source of truth for the calibration year, preventing future drift when updating to new base years. Closes #503 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The CBO income_tax parameter represents positive-only receipts (refundable credit payments in excess of liability are classified as outlays, not negative receipts). Using income_tax_positive matches this definition. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

All ETL scripts now derive their target year from the dataset's default_calculation_period instead of hardcoding years. This ensures all calibration targets stay synchronized when updating to a new base year annually. Updated scripts: - create_initial_strata.py - etl_age.py - etl_irs_soi.py (with configurable --lag for IRS data delay) - etl_medicaid.py - etl_snap.py - etl_state_income_tax.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update parse_ucgid to recognize both 5001800US (118th) and 5001900US (119th Congress) - Expand Puerto Rico and territory filters to handle both Congress code formats - Update TERRITORY_UCGIDS and NON_VOTING_GEO_IDS with 119th Congress codes This ensures consistent redistricting alignment: 2024 ACS data uses 119th Congress codes natively, and IRS SOI data is converted via the 116th→119th mapping matrix. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Revert deterministic hash-based medicaid/SSI seed logic in cps.py, update Makefile seed to 3526. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Needed for income_tax_positive variable used in loss.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

baogorek · 2026-02-05T19:13:24Z

@MaxGhenis we're doing pretty well on the new income tax target from CBO

The SNAP CBO target looks equally good.

We're roughly 25% off on social security, ssi, and eitc, which is not great obviously.

I still would highly recommend pushing this through and we can adjust from here. We're going to be in 2024 finally for local areas and mapped to the 119th congress.

MaxGhenis · 2026-02-05T21:52:27Z

PR Review

🔴 Critical (Must Fix)

Hardcoded target values tagged with dynamic time_period but only valid for 2024: The direct sum targets in etl_national_targets.py (e.g., medicaid: 871.7e9, net_worth: 160e12, rent: 735e9, social_security_retirement: 1_060e9, etc.) are specific dollar amounts for 2024, yet they're now tagged with year: time_period which is dynamically derived from the dataset. If the dataset's default_calculation_period is ever not 2024, these hardcoded dollar targets would be misattributed to the wrong year. The PR description says the fix is about deriving the year from the dataset, but the dollar values themselves are still hardcoded for 2024 — only CBO_YEAR and TREASURY_YEAR parameter lookups actually benefit from the dynamic year. Consider either:
- Keeping the hardcoded targets explicitly labeled as year: 2024 (since they are 2024 values), or
- Parameterizing these values too (e.g., reading from a parameter tree or scaling with uprating factors)
The same issue exists in loss.py where HARD_CODED_TOTALS are still used directly without year awareness.
pull_soi_targets.py still hardcodes 5001800US (118th Congress) for GEO_ID construction (pull_soi_targets.py:252): While the PR adds 119th Congress codes (5001900US) to exclusion/filter sets, the SOI target extraction still constructs GEO_IDs with the 118th Congress prefix. When the IRS releases SOI data for 2024 (which will use 119th Congress districts), this will break or produce mismatched IDs.

🟡 Should Address

DEFAULT_DATASET constant placed between imports: In multiple ETL files (e.g., etl_age.py:9, etl_snap.py:14, etl_irs_soi.py:12, create_initial_strata.py:11, etl_state_income_tax.py:21), the DEFAULT_DATASET constant is defined between import blocks, splitting the imports. Per PEP 8, all imports should be grouped at the top before module-level constants. Move these constants below all imports.
ssi_resource_test_seed added to cps.py without corresponding variable in policyengine-us: The new seed ssi_resource_test_seed is added to the dataset (cps.py:216), but there's no grep match for this variable name elsewhere in this repo. Confirm this variable exists in policyengine-us and is expected by a model variable. If it's not yet defined upstream, the seed would be unused data.
get_pseudo_input_variables() gutted to return set() without deprecation path: The function body was replaced with documentation and return set() (calibration_utils.py:251-272). The rationale in the docstring is clear and well-reasoned, but the function signature is preserved while the body is removed. Consider either deleting the function and updating callers, or adding a deprecation warning, rather than leaving a no-op function that could confuse future readers.
Repeated boilerplate across 7 ETL scripts: Each ETL script now has near-identical argparse + Microsimulation + year derivation logic (~15 lines each). Consider extracting this into a shared utility function (e.g., def derive_year_from_dataset(parser_description) -> (args, year)) to reduce duplication and ensure consistency.
Makefile stratified CPS parameters changed without explanation in PR description: The Makefile changes 10500 to 12000 --top=99.5 --seed=3526, but the PR description doesn't mention why these specific values were chosen. The seed enables reproducibility (good), but what motivated 12000 households and the 99.5th percentile threshold? A comment in the Makefile or changelog would help.

🟢 Suggestions

Changelog entry could be split: The single changelog line covers three distinct changes (stale targets, income_tax_positive, 119th Congress support). Consider splitting into separate entries for clarity, since these are logically independent fixes.
income_tax → income_tax_positive is a significant conceptual change: The switch from calibrating to income_tax (which can be negative due to refundable credits) to income_tax_positive (only positive tax liability) is well-justified by CBO accounting conventions. The comment references are helpful. This is a good change.
The IRS_SOI_LAG_YEARS = 2 constant in etl_irs_soi.py is a good pattern — making the lag configurable via CLI (--lag) is thoughtful engineering.

Validation Summary

Check	Result
CI Status	✅ All passing (lint, smoke test, full test)
Core fix (derive year from dataset)	✅ Correctly implemented for CBO/Treasury parameters
income_tax → income_tax_positive	✅ Consistent between `loss.py` and `etl_national_targets.py`
119th Congress district support	⚠️ Exclusion filters updated but `pull_soi_targets.py` GEO_ID construction still 118th-only
Hardcoded dollar targets	⚠️ Values are 2024-specific but tagged as dynamic `time_period`
Code duplication	⚠️ Argparse boilerplate repeated across 7 files
Test coverage	ℹ️ No new tests added, but CI passes with existing tests

Recommendation: COMMENT

The core fix (deriving CBO/Treasury year from the dataset) is sound and addresses the 18% income tax gap described in #503. The income_tax_positive change is well-reasoned. However, the critical issue of hardcoded dollar values being labeled with a dynamic year should be addressed to prevent future confusion when datasets change base years.

baogorek and others added 2 commits February 2, 2026 10:36

baogorek force-pushed the fix-stale-calibration-targets-503 branch from ee54587 to 69406d6 Compare February 2, 2026 18:04

baogorek and others added 6 commits February 2, 2026 13:29

Use deterministic hash for medicaid_take_up_seed

634a75d

Remove seed-related changes to reduce PR scope

22bbe20

Revert deterministic hash-based medicaid/SSI seed logic in cps.py, update Makefile seed to 3526. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add ssi_resource_test_seed using standard generator convention

c0fd193

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Upgrade policyengine-us to 1.550.1 in uv.lock

b618feb

Needed for income_tax_positive variable used in loss.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

baogorek requested a review from MaxGhenis February 5, 2026 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale calibration targets by deriving time_period from dataset#505

Fix stale calibration targets by deriving time_period from dataset#505
baogorek wants to merge 8 commits intomainfrom
fix-stale-calibration-targets-503

baogorek commented Feb 2, 2026 •

edited

Loading

Uh oh!

baogorek commented Feb 5, 2026

Uh oh!

MaxGhenis commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

baogorek commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

The Fix

Usage

Test plan

Uh oh!

baogorek commented Feb 5, 2026

Uh oh!

MaxGhenis commented Feb 5, 2026

PR Review

🔴 Critical (Must Fix)

🟡 Should Address

🟢 Suggestions

Validation Summary

Recommendation: COMMENT

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baogorek commented Feb 2, 2026 •

edited

Loading