Skip to content

Move all randomness to data package for deterministic country package#451

Merged
MaxGhenis merged 8 commits intomainfrom
migrate-random-to-data-upstream
Feb 7, 2026
Merged

Move all randomness to data package for deterministic country package#451
MaxGhenis merged 8 commits intomainfrom
migrate-random-to-data-upstream

Conversation

@MaxGhenis
Copy link
Contributor

@MaxGhenis MaxGhenis commented Dec 3, 2025

Summary

Move ALL random number generation from policyengine-us into dataset generation, making the country package a purely deterministic rules engine. Adds name-based seeding for order-independent reproducibility and covers all 10 stochastic programs.

MERGE ORDER: This PR must be merged BEFORE PolicyEngine/policyengine-us#7326

Background

This PR and its companion (policyengine-us#7326) supersede the earlier policyengine-us#6635, which was built against an older data package design (shared-seed RNG, no WIC draws, no SSI, uniform Medicaid rate). Rather than rebasing that branch through months of conflicts, we started fresh on current main to incorporate name-based seeding, state-specific Medicaid, and Section 1931 deprivation rules in one shot.

Changes

Name-based seeding

  • Ported _stable_string_hash from policyengine-core to utils/randomness.py
  • Each variable gets its own RNG via seeded_rng(variable_name), making results independent of generation order
  • seeded_rng supports an optional salt parameter for per-subgroup streams (e.g., congressional districts), while the unsalted default remains distinct from any salted variant

State-specific Medicaid takeup rates

  • Replaced uniform 93% rate with per-state rates (53% Utah to 99% Colorado/DC/Indiana)
  • 51 jurisdictions (50 states + DC) from MACPAC enrollment targets

New variables generated

  • meets_ssi_resource_test (bool, rate=0.4)
  • wic_takeup_draw (float [0,1))
  • wic_nutritional_risk_draw (float [0,1))

Existing variables updated

  • EITC, DC PTC, SNAP, ACA, Medicaid, Head Start, Early Head Start now use per-variable seeded RNG

Related PRs

Test plan

  • pytest passes for stochastic variable tests
  • Verified takeup rates match YAML parameters (~0.672 for ACA, etc.)
  • Name-based seeding produces identical results across runs
  • Salt parameter produces distinct but reproducible streams

@MaxGhenis MaxGhenis force-pushed the migrate-random-to-data-upstream branch from c5844cf to a502f1a Compare December 3, 2025 22:42
@MaxGhenis MaxGhenis requested a review from baogorek December 10, 2025 16:14
@baogorek
Copy link
Collaborator

I haven't forgotten about this. Will get to it (to review) Tues, Dec 16th.

Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGhenis apologies for taking so long on this review.

My first thought is, do we really want to add the responsibility of dealing with parameters (the take up rates) in this package when they're already handled in policyengine-us? That's a lot of yaml files if we don't need them, and it feels like duplication.

I'm not crazy about the order of the computations mattering. There will almost certainly be breaks in reproducibility if the generator is used in between these take-up decisions.

MaxGhenis and others added 5 commits February 5, 2026 17:32
This change moves ALL random number generation from policyengine-us into the
dataset generation in policyengine-us-data. The country package is now a
purely deterministic rules engine.

## Key Changes

### policyengine-us-data:
- Add take-up rate YAML parameter files in `parameters/take_up/`
- Generate all stochastic boolean take-up decisions in CPS dataset
- Use seeded RNG (seed=100) for full reproducibility

### Stochastic variables generated:
**Take-up decisions (boolean):**
- takes_up_snap_if_eligible
- takes_up_aca_if_eligible
- takes_up_medicaid_if_eligible
- takes_up_eitc (already boolean)
- takes_up_dc_ptc (already boolean)

All random generation now uses np.random.default_rng(seed=100) for full
reproducibility across dataset builds.

## Trade-offs

**IMPORTANT**: Take-up rates can no longer be adjusted dynamically via policy
reforms or in the web app. They are fixed in the microdata. This is an
acceptable trade-off for the cleaner architecture of keeping the country
package purely deterministic. To adjust take-up rates, the microdata must be
regenerated.

Related: policyengine-us PR (must be merged after this)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Create takeup parameter files with rates from NIEER report
- Head Start: 40% (pre-pandemic), 30% (pandemic 2020-2021)
- Early Head Start: 9%
- Generate stochastic takeup in CPS dataset using same pattern as SNAP/Medicaid
- Coordinates with policyengine-us PR adding takeup variables
Tests verify:
- Take-up rate parameters load correctly (EITC, SNAP, Medicaid, etc.)
- Seeded RNG produces deterministic results
- Take-up proportions match expected rates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replace shared RNG (seed=100) with per-variable name-based seeding using
_stable_string_hash for order-independent reproducibility. Add state-specific
Medicaid takeup rates (53%-99%), SSI resource test pass rate, and WIC
takeup/nutritional risk draw variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Supports per-subgroup RNG streams (e.g., congressional districts)
while keeping the unsalted default distinct from any salted variant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rng = seeded_rng("meets_ssi_resource_test")
data["meets_ssi_resource_test"] = rng.random(n_persons) < ssi_pass_rate

# WIC draws (country package compares against category-specific rates)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallel to the -us pr, lets change these to bools

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI

❯ Do you think we did the right thing? Do you regret anything you coded? Should we have pushed back? 

● Honest take: the change is directionally right but has a couple of warts.

  Why Max is right: The float draws were a half-measure. The whole architecture says "data package owns randomness, country package is deterministic." But
  with draws, the country package was still doing draw < rate[category] - it was half-owning the randomness. Now would_claim_wic is just a data-read bool,
  same as every other takeup variable. That's clean.

  What I don't love:

  1. Parameter duplication. The WIC rates now live in both repos. If USDA publishes 2023 rates, someone has to update them in two places. The other takeup
  variables (SNAP, EITC, etc.) don't have this problem because their rates were always only in the data package. We could have deleted the parameter files
  from policyengine-us since nothing references them anymore - but that's a separate PR concern.
  2. The category gate in is_wic_at_nutritional_risk. With draws, NONE people got 0 < 0 = False for free - the rate structure handled it naturally. Now with
  default_value = True, we need an explicit category != NONE check. It works, but it's compensating for a thing the draw pattern handled more elegantly.

  Should we have pushed back? No. Consistency across all takeup variables is worth more than the cleverness of the draw pattern. But I'd flag the parameter
  duplication in the PR description - long term, the rates in policyengine-us are dead code now and should probably be removed.

…riables

Resolves category-specific rate comparisons at data generation time so only
bools (would_claim_wic, wic_nutritional_risk_imputed) are stored in the
dataset, matching the pattern used by all other takeup variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek baogorek self-requested a review February 6, 2026 19:22
@baogorek baogorek dismissed their stale review February 6, 2026 19:22

I am baogorek and I've taken over the code writing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 09641de into main Feb 7, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants