Phase 4: SMAP soil moisture, wired into the spine as an encoder feature by tmart234 · Pull Request #11 · tmart234/OpenFlow

tmart234 · 2026-05-18T02:16:33Z

Consolidates nasa_moisture into a single main(lat, lon, start, end) ->
DataFrame[Date, soil_moisture] and deletes the three competing (and unused)
implementations (appeears.py, soilmoisture.py, soilmoisture2.py).

Wiring:

combine_data fetches SMAP per station alongside SWE; lazy-imports
nasa_moisture so the training env doesn't fail to start when
earthaccess/h5py aren't installed
merge_dataframes treats soil_moisture like SWE: slow-varying interior
interpolation, missing rows default to 0 (never drop a row)
normalize_data adds soil_moisture to OPTIONAL_NUMERIC with the same
z-score path
windowing adds soil_moisture to ENCODER_FEATURES only; decoder window
stays flow- / SWE- / SM-free (no skillful 14-day SM forecast exists)
train.py sanity-check includes the new column

Failure modes (HUC8 lookup, Earthdata auth, search, download, extraction)
all degrade gracefully to an empty SMAP series so the spine still produces
rows. OPENFLOW_DISABLE_SMAP=1 short-circuits the fetch for the ablation
baseline.

CI:

ml_training.yml installs earthaccess/h5py/shapely after tensorflow so
pip resolves shared transitive deps against the tensorflow pin; passes
EARTHDATA_USERNAME / EARTHDATA_PASSWORD to the training step
tests/test_nasa_moisture.py covers extraction, granule date parsing,
auth/search/download failure paths, and the end-to-end main() flow
with mocked earthaccess + a tiny real HDF5 fixture
test_combine_data, test_normalize_data, test_windowing updated for the
new soil_moisture column

Consolidates nasa_moisture into a single main(lat, lon, start, end) -> DataFrame[Date, soil_moisture] and deletes the three competing (and unused) implementations (appeears.py, soilmoisture.py, soilmoisture2.py). Wiring: - combine_data fetches SMAP per station alongside SWE; lazy-imports nasa_moisture so the training env doesn't fail to start when earthaccess/h5py aren't installed - merge_dataframes treats soil_moisture like SWE: slow-varying interior interpolation, missing rows default to 0 (never drop a row) - normalize_data adds soil_moisture to OPTIONAL_NUMERIC with the same z-score path - windowing adds soil_moisture to ENCODER_FEATURES only; decoder window stays flow- / SWE- / SM-free (no skillful 14-day SM forecast exists) - train.py sanity-check includes the new column Failure modes (HUC8 lookup, Earthdata auth, search, download, extraction) all degrade gracefully to an empty SMAP series so the spine still produces rows. OPENFLOW_DISABLE_SMAP=1 short-circuits the fetch for the ablation baseline. CI: - ml_training.yml installs earthaccess/h5py/shapely after tensorflow so pip resolves shared transitive deps against the tensorflow pin; passes EARTHDATA_USERNAME / EARTHDATA_PASSWORD to the training step - tests/test_nasa_moisture.py covers extraction, granule date parsing, auth/search/download failure paths, and the end-to-end main() flow with mocked earthaccess + a tiny real HDF5 fixture - test_combine_data, test_normalize_data, test_windowing updated for the new soil_moisture column

…handling The Phase 4 review flagged three issues that would substantively bias the SMAP signal during training; this fixes all three so the with-vs-without-SMAP ablation actually measures the feature, not the imputation noise. 1. Better missing-value handling for soil_moisture - fillna(0) said "Sahara desert" on every SMAP gap (RFI, frozen ground, sensor outage). Replaced with: forward-fill + back-fill (SM is slow- varying) -> station median -> 0 only if the station has no observations at all. - New sm_observed indicator (0/1) tells the model when soil_moisture is a real / short-gap-interpolated retrieval vs imputed via the fallback. Wired through normalize_data (NOT z-scored -- binary semantics) and into windowing.ENCODER_FEATURES (encoder-only, like soil_moisture). 2. SMAP retrieval_qual_flag filtering - _extract_polygon_mean now drops pixels where bit 0 of the recommended- quality flag is set. Wintertime Colorado retrievals routinely include frozen-ground pixels that look numerically valid but the SMAP team flags as not recommended; without this filter they corrupted the polygon mean. - Falls back to the prior behavior (no quality filter) when the dataset is absent in the granule, since older versions sometimes omit it. 3. Granule cache to avoid the obvious redownload-per-station waste - SPL3SMP_E granules are global daily files (~30-100 MB each), so every station sees the same granule for a given day. New module-level _GRANULE_PATH_CACHE plus a persistent cache dir (OPENFLOW_SMAP_CACHE_DIR, defaults to /tmp/openflow_smap_cache) means each granule is fetched once per process and -- with the env var set to a stable path -- once across runs. - Replaced the TemporaryDirectory in main() since deleting the cache after every per-station call defeated the point. Tests: 4 new cases for quality-flag filtering (recommended / all-rejected / dataset-missing / disabled), 2 new cases for the granule cache (in-process dedup and disk-resume), 3 new cases for combine_data SM handling (indicator truth, median fallback, zero-only-as-last-resort), 2 new cases for normalize_data preserving binary sm_observed. Full local suite: 102 passed.

Adds four new data sources -- two wired as encoder features, two structured as baseline-comparison hooks for train.py. == New encoder features == USDM drought (data/get_drought.py) - Clean public USDM data service per HUC8; weekly snapshots of percent area in each drought category (D0..D4) collapsed to a single intensity index (D0*1 + D1*2 + ... + D4*5, range 0..500). - Weekly snapshots forward-filled to daily in combine_data with a 14-day ffill limit (USDM cadence is ~weekly so 14 days of headroom is safe). - 0 means "no drought", which IS the legitimate default for missing rows. USBR RISE reservoirs (data/get_reservoir.py) - REST fetch of daily storage + release for any USBR-managed reservoir, via the public RISE JSON:API. Per-page pagination is followed via the JSON:API `links.next` cursor. - Station -> reservoir(s) mapping lives in .github/reservoir_mapping.txt (commented template shipped; unmapped stations are treated as unregulated and the two reservoir columns + reservoir_observed indicator stay at 0). - For sites with multiple upstream reservoirs, storage and release are summed (total water held back / total outflow). - reservoir_observed (0/1) tells the model when the storage / release columns are real data vs the unregulated default, the same way sm_observed flags real-vs-imputed soil moisture. == Baseline comparison hooks (stubs) == NOAA CBRFC (data/get_cbrfc.py) + USBR S2F (data/get_s2f.py) - Define the train.py integration point (`baseline_predictions(test_samples) -> Optional[np.ndarray]`) and the per-sample fetch API. - Both return None until the historical archive is wired in; train.py treats None as "skip this baseline" and the persistence comparison is unaffected. - The module docstrings spell out the data-access gap and the obvious follow-up paths (AHPS tarball archive for CBRFC; per-basin CSV/PDF scrape for S2F, plus the seasonal->daily disaggregation it needs to be comparable to our 14-day horizon). == Spine wiring == combine_data - merge_dataframes accepts drought_data + reservoir_data; per-source interpolation limits (MAX_DROUGHT_GAP_DAYS=14, MAX_RESERVOIR_GAP_DAYS=14) mirror the SWE / SMAP patterns. - fetch_and_process_data returns a single dict instead of a growing tuple; each source has its own try/except so a single fetch failure can't take down the rest of the spine. - Per-source ablation env vars: OPENFLOW_DISABLE_DROUGHT, OPENFLOW_DISABLE_RESERVOIR (joining OPENFLOW_DISABLE_SMAP). Surfaced as workflow_dispatch inputs in ml_training.yml so you can run an ablation from the GitHub UI without touching yaml. normalize_data + windowing - OPTIONAL_NUMERIC now includes drought_index + reservoir_{storage,release}. - INDICATOR_COLUMNS adds reservoir_observed (kept binary, NOT z-scored). - ENCODER_FEATURES grows from 9 to 13 (the new auxiliaries are all history-window features; the decoder window stays clean of them since none has a skillful 14-day forecast). train.py - Reports CBRFC + S2F per-horizon MAE alongside the persistence comparison whenever those modules return a non-None prediction tensor. With the current stubs, they're silently skipped; the integration is ready to activate the moment fetch() returns real data. Tests: 130 passed locally. New test files for drought (8 cases), reservoir (7 cases), and external baselines (7 cases). Existing combine_data, normalize_data, windowing tests updated for the new columns.

claude added 3 commits May 17, 2026 03:38

tmart234 merged commit aab12da into dev May 18, 2026
4 checks passed

tmart234 deleted the claude/phase-4-smap-moisture-vdyYW branch May 18, 2026 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 4: SMAP soil moisture, wired into the spine as an encoder feature#11

Phase 4: SMAP soil moisture, wired into the spine as an encoder feature#11
tmart234 merged 3 commits into
devfrom
claude/phase-4-smap-moisture-vdyYW

tmart234 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tmart234 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants