Skip to content

Phase 4: SMAP soil moisture, wired into the spine as an encoder feature#11

Merged
tmart234 merged 3 commits into
devfrom
claude/phase-4-smap-moisture-vdyYW
May 18, 2026
Merged

Phase 4: SMAP soil moisture, wired into the spine as an encoder feature#11
tmart234 merged 3 commits into
devfrom
claude/phase-4-smap-moisture-vdyYW

Conversation

@tmart234

Copy link
Copy Markdown
Owner

Consolidates nasa_moisture into a single main(lat, lon, start, end) ->
DataFrame[Date, soil_moisture] and deletes the three competing (and unused)
implementations (appeears.py, soilmoisture.py, soilmoisture2.py).

Wiring:

  • combine_data fetches SMAP per station alongside SWE; lazy-imports
    nasa_moisture so the training env doesn't fail to start when
    earthaccess/h5py aren't installed
  • merge_dataframes treats soil_moisture like SWE: slow-varying interior
    interpolation, missing rows default to 0 (never drop a row)
  • normalize_data adds soil_moisture to OPTIONAL_NUMERIC with the same
    z-score path
  • windowing adds soil_moisture to ENCODER_FEATURES only; decoder window
    stays flow- / SWE- / SM-free (no skillful 14-day SM forecast exists)
  • train.py sanity-check includes the new column

Failure modes (HUC8 lookup, Earthdata auth, search, download, extraction)
all degrade gracefully to an empty SMAP series so the spine still produces
rows. OPENFLOW_DISABLE_SMAP=1 short-circuits the fetch for the ablation
baseline.

CI:

  • ml_training.yml installs earthaccess/h5py/shapely after tensorflow so
    pip resolves shared transitive deps against the tensorflow pin; passes
    EARTHDATA_USERNAME / EARTHDATA_PASSWORD to the training step
  • tests/test_nasa_moisture.py covers extraction, granule date parsing,
    auth/search/download failure paths, and the end-to-end main() flow
    with mocked earthaccess + a tiny real HDF5 fixture
  • test_combine_data, test_normalize_data, test_windowing updated for the
    new soil_moisture column

claude added 3 commits May 17, 2026 03:38
Consolidates nasa_moisture into a single main(lat, lon, start, end) ->
DataFrame[Date, soil_moisture] and deletes the three competing (and unused)
implementations (appeears.py, soilmoisture.py, soilmoisture2.py).

Wiring:
  - combine_data fetches SMAP per station alongside SWE; lazy-imports
    nasa_moisture so the training env doesn't fail to start when
    earthaccess/h5py aren't installed
  - merge_dataframes treats soil_moisture like SWE: slow-varying interior
    interpolation, missing rows default to 0 (never drop a row)
  - normalize_data adds soil_moisture to OPTIONAL_NUMERIC with the same
    z-score path
  - windowing adds soil_moisture to ENCODER_FEATURES only; decoder window
    stays flow- / SWE- / SM-free (no skillful 14-day SM forecast exists)
  - train.py sanity-check includes the new column

Failure modes (HUC8 lookup, Earthdata auth, search, download, extraction)
all degrade gracefully to an empty SMAP series so the spine still produces
rows. OPENFLOW_DISABLE_SMAP=1 short-circuits the fetch for the ablation
baseline.

CI:
  - ml_training.yml installs earthaccess/h5py/shapely after tensorflow so
    pip resolves shared transitive deps against the tensorflow pin; passes
    EARTHDATA_USERNAME / EARTHDATA_PASSWORD to the training step
  - tests/test_nasa_moisture.py covers extraction, granule date parsing,
    auth/search/download failure paths, and the end-to-end main() flow
    with mocked earthaccess + a tiny real HDF5 fixture
  - test_combine_data, test_normalize_data, test_windowing updated for the
    new soil_moisture column
…handling

The Phase 4 review flagged three issues that would substantively bias the
SMAP signal during training; this fixes all three so the with-vs-without-SMAP
ablation actually measures the feature, not the imputation noise.

1. Better missing-value handling for soil_moisture
   - fillna(0) said "Sahara desert" on every SMAP gap (RFI, frozen ground,
     sensor outage). Replaced with: forward-fill + back-fill (SM is slow-
     varying) -> station median -> 0 only if the station has no observations
     at all.
   - New sm_observed indicator (0/1) tells the model when soil_moisture is a
     real / short-gap-interpolated retrieval vs imputed via the fallback.
     Wired through normalize_data (NOT z-scored -- binary semantics) and
     into windowing.ENCODER_FEATURES (encoder-only, like soil_moisture).

2. SMAP retrieval_qual_flag filtering
   - _extract_polygon_mean now drops pixels where bit 0 of the recommended-
     quality flag is set. Wintertime Colorado retrievals routinely include
     frozen-ground pixels that look numerically valid but the SMAP team flags
     as not recommended; without this filter they corrupted the polygon mean.
   - Falls back to the prior behavior (no quality filter) when the dataset is
     absent in the granule, since older versions sometimes omit it.

3. Granule cache to avoid the obvious redownload-per-station waste
   - SPL3SMP_E granules are global daily files (~30-100 MB each), so every
     station sees the same granule for a given day. New module-level
     _GRANULE_PATH_CACHE plus a persistent cache dir
     (OPENFLOW_SMAP_CACHE_DIR, defaults to /tmp/openflow_smap_cache) means
     each granule is fetched once per process and -- with the env var set to
     a stable path -- once across runs.
   - Replaced the TemporaryDirectory in main() since deleting the cache after
     every per-station call defeated the point.

Tests: 4 new cases for quality-flag filtering (recommended / all-rejected /
dataset-missing / disabled), 2 new cases for the granule cache (in-process
dedup and disk-resume), 3 new cases for combine_data SM handling (indicator
truth, median fallback, zero-only-as-last-resort), 2 new cases for
normalize_data preserving binary sm_observed. Full local suite: 102 passed.
Adds four new data sources -- two wired as encoder features, two structured
as baseline-comparison hooks for train.py.

== New encoder features ==

USDM drought (data/get_drought.py)
  - Clean public USDM data service per HUC8; weekly snapshots of percent area
    in each drought category (D0..D4) collapsed to a single intensity index
    (D0*1 + D1*2 + ... + D4*5, range 0..500).
  - Weekly snapshots forward-filled to daily in combine_data with a 14-day
    ffill limit (USDM cadence is ~weekly so 14 days of headroom is safe).
  - 0 means "no drought", which IS the legitimate default for missing rows.

USBR RISE reservoirs (data/get_reservoir.py)
  - REST fetch of daily storage + release for any USBR-managed reservoir,
    via the public RISE JSON:API. Per-page pagination is followed via the
    JSON:API `links.next` cursor.
  - Station -> reservoir(s) mapping lives in .github/reservoir_mapping.txt
    (commented template shipped; unmapped stations are treated as unregulated
    and the two reservoir columns + reservoir_observed indicator stay at 0).
  - For sites with multiple upstream reservoirs, storage and release are
    summed (total water held back / total outflow).
  - reservoir_observed (0/1) tells the model when the storage / release
    columns are real data vs the unregulated default, the same way
    sm_observed flags real-vs-imputed soil moisture.

== Baseline comparison hooks (stubs) ==

NOAA CBRFC (data/get_cbrfc.py) + USBR S2F (data/get_s2f.py)
  - Define the train.py integration point (`baseline_predictions(test_samples)
    -> Optional[np.ndarray]`) and the per-sample fetch API.
  - Both return None until the historical archive is wired in; train.py
    treats None as "skip this baseline" and the persistence comparison is
    unaffected.
  - The module docstrings spell out the data-access gap and the obvious
    follow-up paths (AHPS tarball archive for CBRFC; per-basin CSV/PDF
    scrape for S2F, plus the seasonal->daily disaggregation it needs to be
    comparable to our 14-day horizon).

== Spine wiring ==

combine_data
  - merge_dataframes accepts drought_data + reservoir_data; per-source
    interpolation limits (MAX_DROUGHT_GAP_DAYS=14, MAX_RESERVOIR_GAP_DAYS=14)
    mirror the SWE / SMAP patterns.
  - fetch_and_process_data returns a single dict instead of a growing tuple;
    each source has its own try/except so a single fetch failure can't take
    down the rest of the spine.
  - Per-source ablation env vars: OPENFLOW_DISABLE_DROUGHT,
    OPENFLOW_DISABLE_RESERVOIR (joining OPENFLOW_DISABLE_SMAP). Surfaced as
    workflow_dispatch inputs in ml_training.yml so you can run an ablation
    from the GitHub UI without touching yaml.

normalize_data + windowing
  - OPTIONAL_NUMERIC now includes drought_index + reservoir_{storage,release}.
  - INDICATOR_COLUMNS adds reservoir_observed (kept binary, NOT z-scored).
  - ENCODER_FEATURES grows from 9 to 13 (the new auxiliaries are all
    history-window features; the decoder window stays clean of them since
    none has a skillful 14-day forecast).

train.py
  - Reports CBRFC + S2F per-horizon MAE alongside the persistence comparison
    whenever those modules return a non-None prediction tensor. With the
    current stubs, they're silently skipped; the integration is ready to
    activate the moment fetch() returns real data.

Tests: 130 passed locally. New test files for drought (8 cases), reservoir
(7 cases), and external baselines (7 cases). Existing combine_data,
normalize_data, windowing tests updated for the new columns.
@tmart234 tmart234 merged commit aab12da into dev May 18, 2026
4 checks passed
@tmart234 tmart234 deleted the claude/phase-4-smap-moisture-vdyYW branch May 18, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants