Skip to content

Fix stale calibration targets by deriving time_period from dataset#505

Open
baogorek wants to merge 8 commits intomainfrom
fix-stale-calibration-targets-503
Open

Fix stale calibration targets by deriving time_period from dataset#505
baogorek wants to merge 8 commits intomainfrom
fix-stale-calibration-targets-503

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Feb 2, 2026

Summary

  • Fixes state/CD calibration using stale 2022-2023 targets instead of correct 2024 values
  • Removes hardcoded CBO_YEAR and TREASURY_YEAR constants from etl_national_targets.py
  • Adds --dataset CLI argument to specify the source dataset
  • Derives time_period from sim.default_calculation_period - the dataset itself is now the single source of truth

Root Cause

The ETL had hardcoded year constants:

CBO_YEAR = 2023  # was pulling 2023 CBO values
TREASURY_YEAR = 2023  # was pulling 2023 Treasury values

But the calibration runs at time_period=2024. This caused an 18% gap for income tax alone ($2,051B vs $2,426B).

The Fix

Instead of hardcoding years, we now derive the time period from the dataset:

sim = Microsimulation(dataset=args.dataset)
time_period = int(sim.default_calculation_period)  # e.g., 2024

This ensures CBO/Treasury targets always match the dataset's year, preventing future drift when updating to new base years annually.

Usage

# Default: uses HuggingFace production dataset
python policyengine_us_data/db/etl_national_targets.py

# Or specify a local dataset
python policyengine_us_data/db/etl_national_targets.py \
  --dataset /path/to/stratified_extended_cps.h5

Test plan

  • Run make database to regenerate policy_data.db
  • Verify CBO/Treasury targets now show 2024 values
  • Verify income_tax target is ~$2,426B (not $2,051B)

Closes #503

🤖 Generated with Claude Code

baogorek and others added 2 commits February 2, 2026 10:36
- Remove hardcoded CBO_YEAR and TREASURY_YEAR constants
- Add --dataset CLI argument to etl_national_targets.py
- Derive time_period from sim.default_calculation_period
- Default to HuggingFace production dataset

The dataset itself is now the single source of truth for the
calibration year, preventing future drift when updating to new
base years.

Closes #503

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CBO income_tax parameter represents positive-only receipts (refundable
credit payments in excess of liability are classified as outlays, not
negative receipts). Using income_tax_positive matches this definition.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@baogorek baogorek force-pushed the fix-stale-calibration-targets-503 branch from ee54587 to 69406d6 Compare February 2, 2026 18:04
baogorek and others added 6 commits February 2, 2026 13:29
All ETL scripts now derive their target year from the dataset's
default_calculation_period instead of hardcoding years. This ensures
all calibration targets stay synchronized when updating to a new
base year annually.

Updated scripts:
- create_initial_strata.py
- etl_age.py
- etl_irs_soi.py (with configurable --lag for IRS data delay)
- etl_medicaid.py
- etl_snap.py
- etl_state_income_tax.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update parse_ucgid to recognize both 5001800US (118th) and 5001900US (119th Congress)
- Expand Puerto Rico and territory filters to handle both Congress code formats
- Update TERRITORY_UCGIDS and NON_VOTING_GEO_IDS with 119th Congress codes

This ensures consistent redistricting alignment: 2024 ACS data uses 119th Congress
codes natively, and IRS SOI data is converted via the 116th→119th mapping matrix.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Revert deterministic hash-based medicaid/SSI seed logic in cps.py,
update Makefile seed to 3526.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Needed for income_tax_positive variable used in loss.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@baogorek baogorek requested a review from MaxGhenis February 5, 2026 15:55
@baogorek
Copy link
Collaborator Author

baogorek commented Feb 5, 2026

@MaxGhenis we're doing pretty well on the new income tax target from CBO
image

The SNAP CBO target looks equally good.

We're roughly 25% off on social security, ssi, and eitc, which is not great obviously.

I still would highly recommend pushing this through and we can adjust from here. We're going to be in 2024 finally for local areas and mapped to the 119th congress.

@MaxGhenis
Copy link
Contributor

PR Review

🔴 Critical (Must Fix)

  1. Hardcoded target values tagged with dynamic time_period but only valid for 2024: The direct sum targets in etl_national_targets.py (e.g., medicaid: 871.7e9, net_worth: 160e12, rent: 735e9, social_security_retirement: 1_060e9, etc.) are specific dollar amounts for 2024, yet they're now tagged with year: time_period which is dynamically derived from the dataset. If the dataset's default_calculation_period is ever not 2024, these hardcoded dollar targets would be misattributed to the wrong year. The PR description says the fix is about deriving the year from the dataset, but the dollar values themselves are still hardcoded for 2024 — only CBO_YEAR and TREASURY_YEAR parameter lookups actually benefit from the dynamic year. Consider either:

    • Keeping the hardcoded targets explicitly labeled as year: 2024 (since they are 2024 values), or
    • Parameterizing these values too (e.g., reading from a parameter tree or scaling with uprating factors)

    The same issue exists in loss.py where HARD_CODED_TOTALS are still used directly without year awareness.

  2. pull_soi_targets.py still hardcodes 5001800US (118th Congress) for GEO_ID construction (pull_soi_targets.py:252): While the PR adds 119th Congress codes (5001900US) to exclusion/filter sets, the SOI target extraction still constructs GEO_IDs with the 118th Congress prefix. When the IRS releases SOI data for 2024 (which will use 119th Congress districts), this will break or produce mismatched IDs.

🟡 Should Address

  1. DEFAULT_DATASET constant placed between imports: In multiple ETL files (e.g., etl_age.py:9, etl_snap.py:14, etl_irs_soi.py:12, create_initial_strata.py:11, etl_state_income_tax.py:21), the DEFAULT_DATASET constant is defined between import blocks, splitting the imports. Per PEP 8, all imports should be grouped at the top before module-level constants. Move these constants below all imports.

  2. ssi_resource_test_seed added to cps.py without corresponding variable in policyengine-us: The new seed ssi_resource_test_seed is added to the dataset (cps.py:216), but there's no grep match for this variable name elsewhere in this repo. Confirm this variable exists in policyengine-us and is expected by a model variable. If it's not yet defined upstream, the seed would be unused data.

  3. get_pseudo_input_variables() gutted to return set() without deprecation path: The function body was replaced with documentation and return set() (calibration_utils.py:251-272). The rationale in the docstring is clear and well-reasoned, but the function signature is preserved while the body is removed. Consider either deleting the function and updating callers, or adding a deprecation warning, rather than leaving a no-op function that could confuse future readers.

  4. Repeated boilerplate across 7 ETL scripts: Each ETL script now has near-identical argparse + Microsimulation + year derivation logic (~15 lines each). Consider extracting this into a shared utility function (e.g., def derive_year_from_dataset(parser_description) -> (args, year)) to reduce duplication and ensure consistency.

  5. Makefile stratified CPS parameters changed without explanation in PR description: The Makefile changes 10500 to 12000 --top=99.5 --seed=3526, but the PR description doesn't mention why these specific values were chosen. The seed enables reproducibility (good), but what motivated 12000 households and the 99.5th percentile threshold? A comment in the Makefile or changelog would help.

🟢 Suggestions

  1. Changelog entry could be split: The single changelog line covers three distinct changes (stale targets, income_tax_positive, 119th Congress support). Consider splitting into separate entries for clarity, since these are logically independent fixes.

  2. income_taxincome_tax_positive is a significant conceptual change: The switch from calibrating to income_tax (which can be negative due to refundable credits) to income_tax_positive (only positive tax liability) is well-justified by CBO accounting conventions. The comment references are helpful. This is a good change.

  3. The IRS_SOI_LAG_YEARS = 2 constant in etl_irs_soi.py is a good pattern — making the lag configurable via CLI (--lag) is thoughtful engineering.


Validation Summary

Check Result
CI Status ✅ All passing (lint, smoke test, full test)
Core fix (derive year from dataset) ✅ Correctly implemented for CBO/Treasury parameters
income_tax → income_tax_positive ✅ Consistent between loss.py and etl_national_targets.py
119th Congress district support ⚠️ Exclusion filters updated but pull_soi_targets.py GEO_ID construction still 118th-only
Hardcoded dollar targets ⚠️ Values are 2024-specific but tagged as dynamic time_period
Code duplication ⚠️ Argparse boilerplate repeated across 7 files
Test coverage ℹ️ No new tests added, but CI passes with existing tests

Recommendation: COMMENT

The core fix (deriving CBO/Treasury year from the dataset) is sound and addresses the 18% income tax gap described in #503. The income_tax_positive change is well-reasoned. However, the critical issue of hardcoded dollar values being labeled with a dynamic year should be addressed to prevent future confusion when datasets change base years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

State calibration (policy_data.db) uses stale 2022-2023 targets for 2024 sim

2 participants