Skip to content

Add option to change file loading engine#334

Merged
dschwoerer merged 5 commits intomasterfrom
netcdf4-engine
Mar 6, 2026
Merged

Add option to change file loading engine#334
dschwoerer merged 5 commits intomasterfrom
netcdf4-engine

Conversation

@mikekryjak
Copy link
Copy Markdown
Collaborator

@mikekryjak mikekryjak commented Feb 11, 2026

This is a workaround for bugs with the h5netcdf binaries: #329

If you don't install h5netcdf/h5py from your distribution package manager or Spack, and instead install it e.g. from pip, then you are likely going to have a HDF5 error upon loading any dataset.

There are three possible fixes. This PR implements fix 3:

  1. Install both from source against a single shared HDF5:
    sudo apt install libhdf5-dev libnetcdf-dev
    pip install --no-binary netCDF4,h5py netCDF4 h5py

  2. Install both from your distribution package manager,
    e.g. apt, conda or Spack

  3. Switch to the netcdf4 engine:
    import xbout
    xbout.load.file_engine = 'netcdf4'

There is also a helpful error message to tell the user about these fixes if they encounter the error when loading a results dataset. This is not done for grid datasets to keep it simple.

@dschwoerer dschwoerer force-pushed the netcdf4-engine branch 2 times, most recently from d720287 to b0681fc Compare February 11, 2026 13:28
@mikekryjak
Copy link
Copy Markdown
Collaborator Author

mikekryjak commented Feb 12, 2026

The tests are failing on a accessing a read only value type problem, but I changed nothing to do with that, and other PRs have their tests pass! This feels like it's linked to #331, where Xarray suddenly decided it's not going to allow a particular type of coordinate assignment. This is extremely confusing because it looks like this type of assignment hasn't been allowed for quite some time.... and because Xarray seemingly complains about it intermittently??

See the stack trace from the error as of writing this comment:

=================================== FAILURES ===================================
_____________ TestBoutDatasetMethods.test_integrate_midpoints_slab _____________

self = <xbout.tests.test_boutdataset.TestBoutDatasetMethods object at 0x7f816a628740>
bout_xyt_example_files = <function _bout_xyt_example_files at 0x7f816a5dcf40>

    def test_integrate_midpoints_slab(self, bout_xyt_example_files):
        # Create data
        dataset_list = bout_xyt_example_files(
            None, lengths=(4, 100, 110, 120), nxpe=1, nype=1, nt=1, syn_data_type=1
        )
        ds = open_boutdataset(dataset_list)
        t = np.linspace(0.0, 8.0, 4)[:, np.newaxis, np.newaxis, np.newaxis]
        x = np.linspace(0.05, 9.95, 100)[np.newaxis, :, np.newaxis, np.newaxis]
        y = np.linspace(0.1, 21.9, 110)[np.newaxis, np.newaxis, :, np.newaxis]
        z = np.linspace(0.15, 35.85, 120)[np.newaxis, np.newaxis, np.newaxis, :]
>       ds["t"].data[...] = t.squeeze()
        ^^^^^^^^^^^^^^^^^
E       ValueError: assignment destination is read-only

@mikekryjak
Copy link
Copy Markdown
Collaborator Author

mikekryjak commented Feb 20, 2026

From Peter: better to have the engine choice as a string. The defaults should prioritise the user not having issues - if this breaks CI then the CI tests should be modified to override the engine.

To simplify things, we could make the engine flag to be a global variable.

@ZedThree
Copy link
Copy Markdown
Member

An alternative might to change the dependency on h5netcdf and h5py to just h5netcdf[pyfive].

@mikekryjak
Copy link
Copy Markdown
Collaborator Author

@ZedThree pyfive doesn't work. We are missing 8 bytes!!

import xbout
import os
os.environ["H5NETCDF_READ_BACKEND"] = "pyfive"

case = "neutlim-base-init_only"
ds = xbout.load.open_boutdataset(
    datapath = rf"/home/mike/work/cases/devtests/{case}/BOUT.dmp.0.nc",
    inputfilepath= rf"/home/mike/work/cases/devtests/{case}/BOUT.inp",
    keep_xboundaries = True,
    keep_yboundaries = True,
    info = True,
)

Gives:

Click to expand console dump ``` (xbout-h5netcdf-pyfive) mike@P0728-Ubuntu:~/work/notebooks/xbout-dev$ python test-load.py Traceback (most recent call last): File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/file_manager.py", line 219, in _acquire_with_cache_info file = self._cache[self._key] ~~~~~~~~~~~^^^^^^^^^^^ File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/lru_cache.py", line 56, in __getitem__ value = self._cache[key] ~~~~~~~~~~~^^^^^ KeyError: [, ('/home/mike/work/cases/devtests/neutlim-base-init_only/BOUT.dmp.0.nc',), 'r', (('decode_vlen_strings', True), ('driver', None), ('format', 'NETCDF4'), ('invalid_netcdf', None), ('phony_dims', 'access')), 'ee29f736-e928-42f2-a109-3b24c80d907d']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mike/work/notebooks/xbout-dev/test-load.py", line 9, in
ds = xbout.load.open_boutdataset(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/work/xbout/xbout/load.py", line 182, in open_boutdataset
input_type = check_dataset_type(datapath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/work/xbout/xbout/load.py", line 587, in check_dataset_type
ds = xr.open_dataset(filepaths[0], engine=filetype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/api.py", line 607, in open_dataset
backend_ds = backend.open_dataset(
^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/h5netcdf
.py", line 540, in open_dataset
store = H5NetCDFStore.open(
^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/h5netcdf
.py", line 242, in open
return cls(
^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/h5netcdf_.py", line 152, in init
self.filename = find_root_and_group(self.ds)[0].filename
^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/h5netcdf
.py", line 260, in ds
return self.acquire()
^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/h5netcdf
.py", line 252, in _acquire
with self._manager.acquire_context(needs_lock) as root:
File "/home/mike/.pyenv/versions/3.12.5/lib/python3.12/contextlib.py", line 137, in enter
return next(self.gen)
^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/file_manager.py", line 207, in acquire_context
file, cached = self._acquire_with_cache_info(needs_lock)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/xarray/backends/file_manager.py", line 225, in _acquire_with_cache_info
file = self._opener(*self._args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/h5netcdf/core.py", line 1962, in init
super().init(self, self._h5path)
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/h5netcdf/core.py", line 1137, in init
v = self._h5group[k]
~~~~~~~~~~~~~^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/high_level.py", line 70, in getitem
return self.__getitem_lazy_control(y, noindex=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/high_level.py", line 138, in __getitem_lazy_control
return Dataset(obj_name, DatasetID(dataobjs, noindex=noindex), self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/h5d.py", line 118, in init
self._meta = DatasetMeta(dataobject)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/h5d.py", line 760, in init
self.attributes = dataobject.get_attributes()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/dataobjects.py", line 194, in get_attributes
name, value = self.unpack_attribute(offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/dataobjects.py", line 270, in unpack_attribute
return self._parse_attribute_msg(self.msg_data, offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/dataobjects.py", line 321, in _parse_attribute_msg
value = self._attr_value(ptype, buffer, items, offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/dataobjects.py", line 358, in _attr_value
vlen, vlen_data = self._vlen_size_and_data(buf, offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/dataobjects.py", line 393, in _vlen_size_and_data
vlen_data = gheap.objects[gheap_id["object_index"]]
^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/misc_low_level.py", line 161, in objects
info = _unpack_struct_from(GLOBAL_HEAP_OBJECT, self.heap_data, offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/pyenvs/xbout-h5netcdf-pyfive/lib/python3.12/site-packages/pyfive/core.py", line 55, in _unpack_struct_from
values = struct.unpack_from(fmt, buf, offset=offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: unpack_from requires a buffer of at least 4088 bytes for unpacking 16 bytes at offset 4072 (actual buffer size is 4080)

According to my LLM of choice, this is a bug in pyfive, because it doesn't fully implement HDF5 global heap reading for all edge cases, which makes it compute the wrong buffer size in this case.

It even found the bug and a fix. This actually fixes the problem, believe it or not:
image

However, there are further issues because Xarray's h5netcdf backend hardcodes h5py in several places.... so I'm going to give up and continue on making the engine user settable

@mikekryjak mikekryjak force-pushed the netcdf4-engine branch 2 times, most recently from 5444cc8 to 1886ed8 Compare March 4, 2026 12:59
@mikekryjak
Copy link
Copy Markdown
Collaborator Author

I changed it so that you can override the engine with a string instead of forcing netCDF. I also added a warning message with guidance on the fixes.

@mikekryjak mikekryjak requested a review from dschwoerer March 4, 2026 13:02
@mikekryjak mikekryjak changed the title Add option to force netCDF4 engine Add option to change file loading engine Mar 4, 2026
@mikekryjak mikekryjak mentioned this pull request Mar 6, 2026
mikekryjak and others added 2 commits March 6, 2026 09:33
Co-authored-by: David Bold <dschwoerer@users.noreply.github.com>
@dschwoerer dschwoerer merged commit 9b67b37 into master Mar 6, 2026
@dschwoerer dschwoerer deleted the netcdf4-engine branch March 6, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants