Move all randomness to data package for deterministic country package#451
Move all randomness to data package for deterministic country package#451
Conversation
c5844cf to
a502f1a
Compare
|
I haven't forgotten about this. Will get to it (to review) Tues, Dec 16th. |
There was a problem hiding this comment.
@MaxGhenis apologies for taking so long on this review.
My first thought is, do we really want to add the responsibility of dealing with parameters (the take up rates) in this package when they're already handled in policyengine-us? That's a lot of yaml files if we don't need them, and it feels like duplication.
I'm not crazy about the order of the computations mattering. There will almost certainly be breaks in reproducibility if the generator is used in between these take-up decisions.
This change moves ALL random number generation from policyengine-us into the dataset generation in policyengine-us-data. The country package is now a purely deterministic rules engine. ## Key Changes ### policyengine-us-data: - Add take-up rate YAML parameter files in `parameters/take_up/` - Generate all stochastic boolean take-up decisions in CPS dataset - Use seeded RNG (seed=100) for full reproducibility ### Stochastic variables generated: **Take-up decisions (boolean):** - takes_up_snap_if_eligible - takes_up_aca_if_eligible - takes_up_medicaid_if_eligible - takes_up_eitc (already boolean) - takes_up_dc_ptc (already boolean) All random generation now uses np.random.default_rng(seed=100) for full reproducibility across dataset builds. ## Trade-offs **IMPORTANT**: Take-up rates can no longer be adjusted dynamically via policy reforms or in the web app. They are fixed in the microdata. This is an acceptable trade-off for the cleaner architecture of keeping the country package purely deterministic. To adjust take-up rates, the microdata must be regenerated. Related: policyengine-us PR (must be merged after this) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Create takeup parameter files with rates from NIEER report - Head Start: 40% (pre-pandemic), 30% (pandemic 2020-2021) - Early Head Start: 9% - Generate stochastic takeup in CPS dataset using same pattern as SNAP/Medicaid - Coordinates with policyengine-us PR adding takeup variables
Tests verify: - Take-up rate parameters load correctly (EITC, SNAP, Medicaid, etc.) - Seeded RNG produces deterministic results - Take-up proportions match expected rates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Replace shared RNG (seed=100) with per-variable name-based seeding using _stable_string_hash for order-independent reproducibility. Add state-specific Medicaid takeup rates (53%-99%), SSI resource test pass rate, and WIC takeup/nutritional risk draw variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
83cdc3f to
7f67999
Compare
Supports per-subgroup RNG streams (e.g., congressional districts) while keeping the unsalted default distinct from any salted variant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| rng = seeded_rng("meets_ssi_resource_test") | ||
| data["meets_ssi_resource_test"] = rng.random(n_persons) < ssi_pass_rate | ||
|
|
||
| # WIC draws (country package compares against category-specific rates) |
There was a problem hiding this comment.
parallel to the -us pr, lets change these to bools
There was a problem hiding this comment.
FYI
❯ Do you think we did the right thing? Do you regret anything you coded? Should we have pushed back?
● Honest take: the change is directionally right but has a couple of warts.
Why Max is right: The float draws were a half-measure. The whole architecture says "data package owns randomness, country package is deterministic." But
with draws, the country package was still doing draw < rate[category] - it was half-owning the randomness. Now would_claim_wic is just a data-read bool,
same as every other takeup variable. That's clean.
What I don't love:
1. Parameter duplication. The WIC rates now live in both repos. If USDA publishes 2023 rates, someone has to update them in two places. The other takeup
variables (SNAP, EITC, etc.) don't have this problem because their rates were always only in the data package. We could have deleted the parameter files
from policyengine-us since nothing references them anymore - but that's a separate PR concern.
2. The category gate in is_wic_at_nutritional_risk. With draws, NONE people got 0 < 0 = False for free - the rate structure handled it naturally. Now with
default_value = True, we need an explicit category != NONE check. It works, but it's compensating for a thing the draw pattern handled more elegantly.
Should we have pushed back? No. Consistency across all takeup variables is worth more than the cleverness of the draw pattern. But I'd flag the parameter
duplication in the PR description - long term, the rates in policyengine-us are dead code now and should probably be removed.
…riables Resolves category-specific rate comparisons at data generation time so only bools (would_claim_wic, wic_nutritional_risk_imputed) are stored in the dataset, matching the pattern used by all other takeup variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
I am baogorek and I've taken over the code writing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Move ALL random number generation from policyengine-us into dataset generation, making the country package a purely deterministic rules engine. Adds name-based seeding for order-independent reproducibility and covers all 10 stochastic programs.
MERGE ORDER: This PR must be merged BEFORE PolicyEngine/policyengine-us#7326
Background
This PR and its companion (policyengine-us#7326) supersede the earlier policyengine-us#6635, which was built against an older data package design (shared-seed RNG, no WIC draws, no SSI, uniform Medicaid rate). Rather than rebasing that branch through months of conflicts, we started fresh on current main to incorporate name-based seeding, state-specific Medicaid, and Section 1931 deprivation rules in one shot.
Changes
Name-based seeding
State-specific Medicaid takeup rates
New variables generated
Existing variables updated
Related PRs
Test plan