Skip to content

feat: extensible transforms, process registry, temporal resampling, and reprojection (CLIM-679)#87

Merged
turban merged 23 commits intomainfrom
restore/temporal-resampling
May 9, 2026
Merged

feat: extensible transforms, process registry, temporal resampling, and reprojection (CLIM-679)#87
turban merged 23 commits intomainfrom
restore/temporal-resampling

Conversation

@turban
Copy link
Copy Markdown
Contributor

@turban turban commented May 9, 2026

Summary

This PR consolidates three areas of new functionality — all following the same plugin pattern:

Extensible transforms pipeline

  • Dataset YAMLs declare a transforms: list of dotted-path functions applied during zarr build (after download, before writing)
  • Built-in transforms live in climate_api/transforms/; custom transforms can be loaded from any importable package under plugins_dir
  • Built-in transforms: convert_units, deaccumulate_era5, reproject_to_instance_crs

Reprojection transform

  • reproject_to_instance_crs (rioxarray-backed) reprojects source data to the instance CRS configured in climate-api.yaml
  • No-op when source CRS already matches the instance CRS — WGS84 instances incur zero overhead
  • Wired into chirps3, era5_land (temperature and precipitation), and worldpop dataset templates
  • Adds rioxarray>=0.17 as an explicit dependency

Process registry with plugin support

  • Pluggable process registry backed by YAML files — same pattern as the dataset registry
  • Built-in processes in climate_api/data/processes/; custom processes via plugins_dir/processes/
  • Each process YAML declares an execution_function dotted path; POST /processes/{id}/execution dispatches generically — no hardcoded process-id checks

Temporal resampling (first built-in process)

  • Aggregate a source dataset to a coarser temporal resolution using pandas frequency aliases (1D, W-MON, MS, etc.)
  • Supported methods: mean, sum, min, max
  • Produces a derived dataset stored as a new zarr alongside the source

Plugin pattern (consistent across datasets, transforms, processes)

Extension point Built-in location Plugin override location Reference mechanism
Datasets climate_api/data/datasets/ plugins_dir/datasets/ YAML with ingestion.function dotted path
Transforms climate_api/transforms/ any importable package under plugins_dir dotted path in dataset YAML transforms: list
Processes climate_api/data/processes/ plugins_dir/processes/ YAML with execution_function dotted path

Key files changed

File Change
climate_api/transforms/__init__.py Transform registry + built-in exports
climate_api/transforms/reproject.py reproject_to_instance_crs — rioxarray reprojection
climate_api/transforms/unit_conversion.py convert_units — unit conversion via metpy
climate_api/transforms/deaccumulate.py deaccumulate_era5 — ERA5 accumulation fix
climate_api/data_registry/services/processes.py Process registry (list, get, plugin load)
climate_api/processing/resample.py Core temporal resampling logic
climate_api/processing/routes.py Generic process dispatch via registry
climate_api/processing/services.py execute_resample entry point
climate_api/data/processes/resample.yaml Built-in resample process definition
climate_api/data/datasets/*.yaml Transform lists added to all built-in datasets
pyproject.toml Added rioxarray>=0.17

Test plan

  • make lint — clean (ruff, mypy, pyright)
  • pytest — all tests pass (transforms, reproject, process registry, resampling, routes)
  • POST /processes/resample/execution with valid request returns 200
  • POST /processes/resample/execution with invalid method/frequency returns 400
  • POST /processes/unknown/execution returns 404
  • Process registry loads built-in resample from data/processes/resample.yaml
  • Custom process YAML in plugins_dir/processes/ merges with built-ins
  • reproject_to_instance_crs no-ops when source CRS matches instance CRS
  • reproject_to_instance_crs calls rio.reproject with the correct target CRS

Supersedes #63. Incorporates #86 (transforms pipeline) and #93 (reprojection transform).

turban added 5 commits May 9, 2026 17:41
Resample parameters (source_dataset_id, period_type, method) are now
passed directly to POST /processes/resample/execution instead of being
declared on a YAML template with sync_kind: derived.

Derived dataset IDs are auto-generated as {source}_{period}_{method}.
The derived sync_kind and processing validation blocks are removed from
the registry, SyncKind enum, and sync engine.
Expose the raw pandas offset alias (e.g. '1D', 'W-MON', 'MS', '10D')
directly in the resample request instead of mapping through a fixed set
of named period types. This removes _resample_frequency(), _PERIOD_ORDER,
and the period hierarchy guard, and unlocks any frequency xarray accepts
(bi-weekly, dekadal, seasonal, etc.) without code changes.

Coverage timestamps for derived artifacts are stored as ISO date strings
via period_type="daily" on the synthetic target dataset dict.
@turban turban changed the base branch from restore/transforms-pipeline to main May 9, 2026 16:19
turban added 3 commits May 9, 2026 18:19
… pattern

- Add climate_api/data/processes/resample.yaml as the built-in resample process definition
- Add climate_api/data_registry/services/processes.py: list_processes(), get_process(),
  plugin loading from plugins_dir/processes/ (same pattern as datasets/plugins_dir/datasets/)
- Route dispatches to process['execution_function'] via registry lookup — no hardcoded process_id check
- Add services.execute_resample() as the generic entry point called by the dispatcher;
  it handles method/frequency validation and returns a JSON-serializable dict
- Custom processes can be added via plugins_dir/processes/*.yaml without touching core code
@turban turban marked this pull request as draft May 9, 2026 17:48
turban added 3 commits May 9, 2026 19:55
Introduces a rioxarray-backed reprojection transform that converts
source datasets to the instance CRS during ingestion. The transform is
a no-op when the source CRS already matches the configured instance CRS,
so WGS84 instances incur no overhead.

- Add climate_api/transforms/reproject.py with reproject_to_instance_crs
- Wire the transform into chirps3, era5_land (both variables), and worldpop dataset YAMLs
- Add rioxarray>=0.17 as an explicit dependency
- Add tests using a mocked .rio accessor to avoid local PROJ database conflicts
feat: add reproject_to_instance_crs transform to zarr build pipeline
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a pluggable process registry (YAML-backed, plugin-overridable) and a generic processing execution route, introducing temporal resampling as the first built-in process and extending the transforms pipeline with built-in transforms (unit conversion, deaccumulation, reprojection). Also expands time-period handling to support weekly ISO week strings across sync and coverage.

Changes:

  • Introduces process registry (+ plugin override support) and POST /processes/{process_id}/execution generic dispatcher.
  • Implements resampling materialization workflow (derived Zarr artifacts) and adds weekly period parsing/normalization.
  • Adds transforms pipeline + built-in transforms, updates dataset YAMLs to use dotted-path transforms, and adds comprehensive tests.

Reviewed changes

Copilot reviewed 30 out of 32 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_transforms_reproject.py Adds tests for the reprojection transform behavior and rioxarray integration.
tests/test_transforms.py Adds tests for unit conversion, deaccumulation, and the dotted-path transforms pipeline.
tests/test_shared_time.py Extends coverage for weekly period normalization/parsing.
tests/test_processing_routes.py Adds route-level tests for /processes/resample/execution behavior and error cases.
tests/test_processing_resample.py Adds extensive tests for resample materialization, edge-period dropping, reuse/overwrite behavior, and publishing.
tests/test_process_registry.py Tests built-in + plugin process registry loading and override behavior.
tests/test_datasets_sync.py Updates sync tests for weekly support and refines unsupported period-type expectations.
tests/test_dataset_registry.py Updates registry tests to reflect new ingestion function strings and removes some ingestion-validation tests.
pyproject.toml Adds rioxarray dependency for .rio accessor support.
climate_api/transforms/unit_conversion.py Implements built-in unit conversion transform (scale + offset).
climate_api/transforms/reproject.py Implements built-in reprojection transform to instance CRS.
climate_api/transforms/deaccumulate.py Implements built-in ERA5 deaccumulation transform.
climate_api/transforms/init.py Exposes built-in transforms for dotted-path references.
climate_api/shared/time.py Adds weekly period support and weekly handling in numpy datetime conversions.
climate_api/publications/services.py Refactors managed dataset id generation into a reusable function.
climate_api/processing/services.py Adds execution function for resample with validation and response formatting.
climate_api/processing/schemas.py Adds Pydantic request/response schemas and supported methods constant.
climate_api/processing/routes.py Adds generic process execution endpoint dispatched via process registry.
climate_api/processing/resample.py Implements derived resampling materialization, artifact reuse, completeness checks, and Zarr writing.
climate_api/processing/init.py Adds processing package module.
climate_api/main.py Registers processing routes in the FastAPI app.
climate_api/ingestions/sync_engine.py Adds weekly period arithmetic for sync planning.
climate_api/ingestions/services.py Adds helper to store locally materialized Zarr artifacts and supports weekly default end.
climate_api/data_registry/services/processes.py Adds YAML-backed process registry with plugin merging and dotted-path execution loading.
climate_api/data_manager/services/downloader.py Adds transforms execution hook (_run_transforms) into dataset build flow.
climate_api/data_accessor/services/accessor.py Normalizes period-string scalars when computing coverage.
climate_api/data/processes/resample.yaml Registers the built-in resample process and its execution function.
climate_api/data/datasets/worldpop.yaml Adds reprojection transform to dataset definition.
climate_api/data/datasets/era5_land.yaml Switches preprocessing to transforms pipeline and adjusts display range after unit conversion.
climate_api/data/datasets/chirps3.yaml Adds reprojection transform to dataset definition.
.gitignore Ignores derived data directory (data/derived).

Comment thread climate_api/data_registry/services/processes.py
Comment thread climate_api/data_registry/services/processes.py
Comment thread climate_api/data_registry/services/processes.py Outdated
Comment thread climate_api/processing/routes.py Outdated
Comment thread climate_api/processing/resample.py Outdated
Comment thread climate_api/shared/time.py Outdated
Comment thread pyproject.toml
Comment thread climate_api/data_manager/services/downloader.py
@turban turban changed the title feat: temporal resampling for derived datasets (CLIM-679) feat: extensible transforms, process registry, temporal resampling, and reprojection (CLIM-679) May 9, 2026
turban added 6 commits May 9, 2026 20:29
Remove reproject_to_instance_crs from dataset YAML transforms lists and
call it automatically in build_dataset_zarr after user-defined transforms.
Source CRS defaults to EPSG:4326; datasets with a different source CRS
can declare source_crs in their YAML template.
…ss registry

- shared/time.py: replace np.vectorize with pd.DatetimeIndex.isocalendar() for weekly period strings
- processing/resample.py: derive period_type from frequency alias instead of hardcoding daily
- processing/routes.py: catch TypeError from mismatched kwargs and return HTTP 400
- downloader.py: validate transform entries have required 'function' key with clear error message
- data_registry/services/processes.py: validate execution_function is a valid dotted path
- tests: update weekly/monthly coverage assertions to use correct period string format
… block

Replace flat top-level sync_kind, sync_execution, sync_availability fields
with a nested sync: block (kind/execution/availability) in dataset YAMLs.
Update all code reads and validation error messages to match.
- stac/services.py: remove stale convert_units fallback (no longer exists in datasets)
- sync_engine.py: fix error message to say sync.kind instead of sync_kind
- processes.py: validate name field is present in process definitions
- processing/services.py: inline _SUPPORTED_RESAMPLE_METHODS, delete unused schemas.py
- tests: update error message assertion and add missing-name validation test
@turban turban marked this pull request as ready for review May 9, 2026 19:18
@turban turban merged commit 6652d70 into main May 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants