Add NISAR azimuth-block splitting for parallel full-frame processing#717
Open
mgovorcin wants to merge 14 commits into
Open
Add NISAR azimuth-block splitting for parallel full-frame processing#717mgovorcin wants to merge 14 commits into
mgovorcin wants to merge 14 commits into
Conversation
Splits a single non-OPERA-burst frame (e.g. NISAR GSLC) into N azimuth
blocks processed in parallel as synthetic "bursts". Each block reuses
the full input file list but applies its own output_options.bounds, so
wrapped_phase.run masks per block via _get_mask. The existing burst pool
runs the blocks concurrently and stitching_bursts.run mosaics them.
Changes
-------
- New _block_split.split_frame_into_blocks: computes per-block (bounds,
epsg) from cfg georef with a minimum-halo formula
max(half_window_y,
similarity_search_radius * stride_y,
(corr_window_y // 2) * stride_y) + 5
so block edges aren't cropped by SHP/similarity/stitcher windows.
- displacement.py: new module-level _GroupedInputs dataclass and
_prepare_grouped_inputs helper that detects OPERA vs non-OPERA-burst
mode from group_by_burst behavior. In non-OPERA mode, optional file
lists (amp_disp, amp_mean, layover_shadow) are broadcast to every
block id so second-batch reuse works.
- _create_burst_cfg gains a bounds= kwarg that writes
output_options.bounds/_epsg per block. OPERA path passes None and is
unchanged.
- config.InputOptions adds azimuth_blocks: int = 1 (default = no
splitting, OPERA path unchanged) and halo_rows: int | None.
When running azimuth-block mode with n_parallel_bursts > 1, multiple worker processes read the same NISAR GSLC (HDF5) files concurrently at different spatial extents. On NFS/Lustre filesystems this triggers HDF5 file-locking hangs. Set HDF5_USE_FILE_LOCKING=FALSE (via setdefault, so an explicit caller setting is honoured) before spawning workers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scottstanie
pushed a commit
to scottstanie/dolphin
that referenced
this pull request
May 24, 2026
Replaces the cube → per-layer tif export with a single per-layer VRT shim that points at ``ZARR:cube.zarr:/<var>:i`` as its source. GDAL's ZARR driver (3.10+) reads the slice natively, so the existing tif-based downstream pipeline (interferogram formation, stitching, unwrap, timeseries) sees ordinary 2D rasters with no pixel-data duplication. Highlights: - Default `zarr_format=2` for the cube (GDAL's ZARR read path supports it cleanly; v3's `bytes` codec isn't implemented on the GDAL read side yet). `zarr_format=3` is still selectable via writer kwarg. - New `emit_layer_vrts`, `zarr_subdataset_uri`, `layer_vrt_path` helpers in `dolphin.io._geozarr`. VRTs use absolute paths so they survive the per-ministack → top-level rename in `sequential.py`. - `_ZarrLayerStack.export_tifs` now writes VRT shims instead of dumping layer data into tif files. `_make_stack_output` returns `.slc.vrt` paths in GEOZARR mode. - Skip the GeoTIFF-only `repack_rasters` passes in GEOZARR mode (the cube is already chunked and compressed; repack would create one tif per VRT, undoing the no-duplication win). - `_ensure_data_array` rejects the reserved coord names (x/y/spatial_ref) and checks shape against any existing array of the same name to catch accidental writes that would silently corrupt a coord. - `wrapped_phase.py` cached/skip path and `sequential._get_outputs_from_folder` also glob for `.vrt` so GEOZARR runs can resume. Parallelism: matches PR isce-framework#717. Each "burst" (or synthetic-burst azimuth-split block) process gets its own work directory and its own cube — no cross-process zarr coordination. Inside a process the cube writer drains on a single background thread, the direct analog of `BackgroundStackWriter`'s single-thread tif feeder, so the parallel ThreadPoolExecutor in `single.py` sees the same write-side throughput either way. Tests updated: - `test_io_geozarr.py`: added VRT-readable-via-rasterio, VRT-survives-move, shape-mismatch rejection, reserved-coord-name rejection, and default-is-v2 cases. - `test_workflows_single.py::test_sequential_geozarr` and `test_workflows_displacement.py::test_displacement_run_geozarr` now assert *no* per-date `.slc.tif` are written, that VRTs are emitted, and that VRT reads round-trip to the cube layer. https://claude.ai/code/session_01UC8HNQWpNsKBmBNhdMhsLu
`format_nc_filename` always emitted `NETCDF:"file":"//ds"`. CF-compliant HDF5s (OPERA CSLCs, COMPASS static_layers) work fine because GDAL's NETCDF driver accepts them, but raw HDF5 files without CF metadata (NISAR GSLCs) hit `RuntimeError: ... No such file or directory` when GDAL's NETCDF driver refuses to open the subdataset. The real split isn't by extension — it's whether the file carries CF metadata. NISAR raw HDF5 doesn't; every other .h5 we care about does. Use a basename prefix check: files starting with NISAR_ get the HDF5 driver, everything else gets NETCDF. Extend the allowlist if/when another non-CF raw-HDF5 source appears. Backport of 7e8f3a3 + f134bcc onto upstream main for the azimuth-block-fixes branch.
Splits a single non-OPERA-burst frame (e.g. NISAR GSLC) into N azimuth
blocks processed in parallel as synthetic "bursts". Each block reuses
the full input file list but applies its own output_options.bounds, so
wrapped_phase.run masks per block via _get_mask. The existing burst pool
runs the blocks concurrently and stitching_bursts.run mosaics them.
Changes
-------
- New _block_split.split_frame_into_blocks: computes per-block (bounds,
epsg) from cfg georef with a minimum-halo formula
max(half_window_y,
similarity_search_radius * stride_y,
(corr_window_y // 2) * stride_y) + 5
so block edges aren't cropped by SHP/similarity/stitcher windows.
- displacement.py: new module-level _GroupedInputs dataclass and
_prepare_grouped_inputs helper that detects OPERA vs non-OPERA-burst
mode from group_by_burst behavior. In non-OPERA mode, optional file
lists (amp_disp, amp_mean, layover_shadow) are broadcast to every
block id so second-batch reuse works.
- _create_burst_cfg gains a bounds= kwarg that writes
output_options.bounds/_epsg per block. OPERA path passes None and is
unchanged.
- config.InputOptions adds azimuth_blocks: int = 1 (default = no
splitting, OPERA path unchanged) and halo_rows: int | None.
When running azimuth-block mode with n_parallel_bursts > 1, multiple worker processes read the same NISAR GSLC (HDF5) files concurrently at different spatial extents. On NFS/Lustre filesystems this triggers HDF5 file-locking hangs. Set HDF5_USE_FILE_LOCKING=FALSE (via setdefault, so an explicit caller setting is honoured) before spawning workers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original azimuth-block splitter set each block's ``output_options.bounds`` to its read extent (central rows + halo) and wrote phase-linking outputs over that full extent. Adjacent blocks then *overlapped* in the halo region. ``stitching_bursts.run`` → ``gdal_merge.py`` does last-pixel-wins on overlap with no feathering, so on every interior boundary about half of the overlap band ended up sourced from a block's halo (edge-degraded) rather than its central (clean) rows. Fix: each block runs phase linking over ``read_bounds`` (central + halo) exactly as before — the halo is still loaded so SHP/EVD have clean neighborhood context — but after ``wrapped_phase.run`` returns, the stitching-bound outputs are cropped to ``central_bounds`` (no halo) via ``gdal.Translate``. Adjacent centrals are disjoint by construction, so the stitcher has nothing to choose between. ``comp_slcs`` are not cropped — they feed back into the *same* block's next ministack and must match its read extent. Other cleanups in the same diff: - ``_block_split.py``: new ``BlockBounds`` dataclass carrying both read/central bounds + EPSG. Replace inline NETCDF: URI construction and raw GDAL/OSR calls with ``io.format_nc_filename``, ``io.get_raster_crs``, ``io.get_raster_xysize``, ``io.get_raster_gt``. Drop the dead AUTHORITY-code check (was unreachable under ``osr.UseExceptions``) in favor of a single try/except with a clear "geocoded input required" message. - ``_utils.py``: ``_create_burst_cfg`` takes ``BlockBounds | None`` and clears top-level ``bounds_wkt`` before stamping per-block bounds, so cfgs that set ``bounds_wkt`` globally don't fail pydantic validation on every block. - ``displacement.py``: ``HDF5_USE_FILE_LOCKING=FALSE`` moved into the ``ProcessPoolExecutor`` initializer (``_worker_init``) so the env mutation lives in worker processes, not the parent. Drop the redundant ``num_workers`` shadow of ``num_parallel``. Tests (``tests/test_workflows_block_split.py``): 8 new unit tests covering num_blocks=1, central tiling, halo overlap math at frame edges and interior, halo-too-large raise, missing-CRS friendly message, ``crop_to_central`` in-place rewrite, and frozen dataclass.
On real NISAR GSLCs, ``io.get_raster_crs`` returned an empty CRS because
GDAL's HDF5 driver doesn't read the CF ``grid_mapping`` attribute — that
machinery is in the NETCDF driver, which won't open NISAR (no CF
metadata at the file level). Result: the friendly error from my last
commit ("Could not read a usable EPSG... pass num_blocks=1") fired on
every NISAR run, defeating the whole point of the splitter.
Fix: dispatch in ``_read_grid_metadata``. NISAR-prefixed .h5 files now
go through ``_read_nisar_grid_metadata``, which reads via h5py from the
``grids/frequencyA`` group:
- ``projection`` scalar -> EPSG (uint32 or bytes)
- ``xCoordinates`` / ``yCoordinates`` -> grid extent (cell centers,
backed off by ``x/yCoordinateSpacing/2`` for the UL-edge geotransform)
Honors the dataset's ``grid_mapping`` attribute if present, falling
back to the conventional ``projection`` name. Same convention
``opera_utils.nisar._info.get_nisar_bbox`` uses.
Everything else (.tif, OPERA CSLC .h5, .nc) still routes through the
dolphin GDAL helpers with ``format_nc_filename``.
Also drops a dead defensive check for a missing ``subdataset`` on NISAR
input — pydantic already enforces this on any .h5/.nc cfg.
Tests: add ``_make_nisar_h5`` helper that builds a minimal NISAR-shaped
HDF5 (HH + projection + coord datasets + grid_mapping attr), plus two
new cases (metadata roundtrip via h5py path, full splitter on a
synthetic NISAR frame).
GDAL's HDF5 driver doesn't read NISAR's CF ``grid_mapping`` — it returns an identity geotransform and empty projection regardless of what's in the file. On ``main`` this is invisible because nothing rasterizes against a NISAR VRTStack's CRS, but the moment the azimuth-block splitter sets ``output_options.bounds`` to drive ``masking.create_bounds_mask``, the mask comes back all-zero (the UTM polygon doesn't intersect an identity-GT pixel grid) and the run fails with ``MaskingError: No valid pixels left in mask``. Fix in ``VRTStack.__init__``: after the usual GDAL-based metadata read, if the result is the identity-GT + empty-projection signature *and* the input is NISAR-prefixed .h5, override ``self.gt`` / ``self.proj`` / ``self.srs`` from a focused h5py read. The metadata then flows into the .vrt on disk via the existing ``_write()`` path, so every consumer (bounds mask, nodata mask, stitching, downstream GDAL opens) sees the right CRS and geotransform. Mechanics: - New ``io._core.read_nisar_grid_metadata(filename, subdataset)`` returns ``(nx, ny, gt, epsg, wkt)``. Reads the ``projection`` sibling dataset for the EPSG and ``xCoordinates`` / ``yCoordinates`` for the cell-center grid, backing off by ``x/yCoordinateSpacing/2`` to anchor the geotransform at the upper-left edge. Honors the data dataset's ``grid_mapping`` attribute if present. - New ``io._core._is_nisar_h5(path)`` mirrors ``format_nc_filename``'s NISAR detection (basename prefix). - ``_block_split.py`` drops its private copies and uses the shared helpers, so the NISAR convention lives in one place. Test: new ``test_vrtstack_patches_nisar_metadata`` builds a minimal NISAR-shaped HDF5 and verifies ``VRTStack`` exposes real CRS+GT both as attributes and in the resulting ``.vrt`` on disk. All 11 block-split tests + the broader workflows/io suites (119 total) pass.
Stitching failed with ``RuntimeError: Recursion detected`` from gdal_merge when merging per-block interferograms. Root cause: dolphin's per-block IFGs are virtual interferograms (``.vrt``), and ``crop_to_central`` was calling ``gdal.Translate(tmp.vrt, src.vrt, projWin=...)`` then renaming ``tmp`` over ``src``. ``gdal.Translate`` on a VRT input writes an output VRT whose ``<SourceFilename>`` is the *input path string*, so after the rename the file references itself — gdal_merge follows the chain and trips GDAL's recursion guard. Fix: always materialize through ``format="GTiff"``. For ``.tif`` inputs the in-place rewrite still works; for ``.vrt`` inputs the result is a sibling ``.tif`` and the original ``.vrt`` is removed (no self-reference possible). ``crop_to_central`` now returns the output ``Path`` so callers can substitute the possibly-extension-changed path back into their file lists. ``displacement.run`` does that via list-comprehension reassignment over each block's stitching-bound output lists. The cost is materializing per-block IFGs (extra disk + the multiply-conjugate they encoded virtually). For typical NISAR runs with ~20 ifgs per ministack and 8 blocks this is ~tens of GB — not ideal, but small relative to the GSLC stack and acceptable as the correctness-first fix. A leaner alternative would be a VRT-XML-level crop (adjust ``SrcRect``/``DstRect`` in place) which preserves the virtual nature; deferred. Tests: new ``test_crop_to_central_materializes_vrt`` builds a VRT pointing at a GeoTIFF (mirrors how the IFG VRTs are structured) and asserts the cropped output is a non-recursive GTiff at the central extent. Updated ``test_crop_to_central_replaces_in_place`` to also check the returned path.
Azimuth block fixes. Thank you Sara
for more information, see https://pre-commit.ci
…nputs The merge left two copies of the dataclass and its builder; the stale copy used dict[str, tuple[Bbox, int]] for block_bounds and triggered mypy no-redef + assignment/arg-type errors. Drop the duplicate and the now-unused Bbox import. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mirzaees
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a block-as-burst pattern so NISAR full-frame GSLCs can be processed in parallel using the existing burst pool.
For burst inputs nothing changes. For NISAR (non-burst) inputs, when
input_options.azimuth_blocks > 1the frame is split into N azimuth blocks at the dispatch layer, each block is processed as a synthetic burst throughProcessPoolExecutor, and per-block outputs are stitched by the existingstitching_bursts.runstep. No changes towrapped_phase.runor the stitcher are required.Change Summary
workflows/_block_split.py(new) —split_frame_into_blocks()divides the pixel grid into N azimuth blocks with a halo;_min_halo_rows()auto-computes the minimum safe overlap from half-window, similarity search radius, and stride settings.workflows/config/_common.py—InputOptionsgainsazimuth_blocks: int = 1andhalo_rows: Optional[int] = None.workflows/displacement.py— extracts dispatch into a testable_prepare_grouped_inputs()function; setsHDF5_USE_FILE_LOCKING=FALSEbefore spawning workers when multiple processes share the same HDF5 files (prevents hangs on NFS/Lustre).workflows/_utils.py—_create_burst_cfgaccepts an optionalboundsparameter to setoutput_options.boundsper block.Typical config for a 4-block run:
CLAUDE NOTES
Key insight: n_parallel_bursts is dual-purpose. It's the outer pool size AND a budget that gets re-allocated to inner I/O workers when there are fewer "bursts" than pool slots.
n_parallel_bursts = 8, with N bursts available:
N = 1 (FULL) → 1 active process × 8 inner I/O threads
N = 4 → 4 active processes × 2 inner I/O threads each
N = 8 (BLOCKED) → 8 active processes × 1 inner I/O thread each
FULL vs BLOCKED
azimuth_blocks=1)azimuth_blocks=N)vmap(within 1 block)Each worker still uses ~1 core; you just have N of them running simultaneously on different rows.
Per-Knob Impact
n_parallel_burstsazimuth_blocksfor maximum block parallelismthreads_per_workerSets:
OMP_NUM_THREADSMKL_NUM_THREADSControls BLAS threads inside JAX kernels:
eighsolvematmulSeparate from JAX/XLA’s own thread pool.
Typical values:
JAX / XLA
JAX’s XLA backend maintains its own thread pool, sized by default to:
Local benchmark full frame vs blocked results
displacement.run end-to-end through PS+PL+stitch benchmark run with taskset -c 0-33
GSLCs : 7 (Track 042 Frame 070 Descending orbit)
half_window : 5
num_blocks : 8
n_parallel_bursts : 8 (both modes)
threads_per_worker : 4 (both modes)
block_shape: 2048x2048 (both modes)
active threads FULL : 4
active threads BLK : 32
FULL wall = 8387.3s (139.79 min)
BLOCKED wall = 2159.2s ( 35.99 min)
SPEEDUP (BLOCKED vs FULL) : 3.88×
Checklist