Add scientific data and geospatial publishing guides & case study#2243
Add scientific data and geospatial publishing guides & case study#2243
Conversation
🚀 Build Preview on IPFS ready
|
|
This is great as a specific how-to. Is there another, complementary place we can write about all the ways geospatial users can benefit from IPFS? From live meeting:
|
Co-authored-by: Volker Mische <volker.mische@gmail.com>
Co-authored-by: Mosh <1306020+mishmosh@users.noreply.github.com>
Co-authored-by: Mosh <1306020+mishmosh@users.noreply.github.com>
|
|
||
| Users can then access your data by using the `ipns://` prefix. | ||
|
|
||
| TODO: add example. |
| --raw-leaves \ | ||
| --chunker=size-1048576 \ | ||
| --cid-version=1 \ |
There was a problem hiding this comment.
Once Kubo 0.40 ships these could be removed and replaced by one-time ipfs config profile apply unixfs-v1-2025 or setting Import.* values one-by-one
|
|
||
| In this guide, you'll find an overview of the problem space, available tools, and architectural patterns for publishing and working with scientific data using IPFS. | ||
|
|
||
| ## A Landscape in Flux |
There was a problem hiding this comment.
From Slack:
Condense “A Landscape in Flux” - Remove or significantly shorten so we get to solutions/new knowledge faster; we can highlight the main points but you can assume readers already work in geospatial and are familiar with the challenges
|
This PR has been updated and includes the addition of three new pages:
It still needs more work, but I think this structure works well. I'd like to also maybe add a more generic retrieval example in Python using ipfsspec for the retrieval guide. |
There was a problem hiding this comment.
Pull request overview
Adds new Scientific Data documentation content to the IPFS docs site, including a hands-on guide for publishing geospatial Zarr datasets and supporting context via a landscape overview and an ORCESTRA case study. Updates the VuePress sidebar to surface the new pages and case study.
Changes:
- Add a new “Publish Geospatial Zarr Data with IPFS” how-to guide.
- Add a new “Scientific Data and IPFS Landscape Guide” overview page.
- Add a new ORCESTRA case study and update VuePress navigation (including sidebar re-organization and case study list).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.
| File | Description |
|---|---|
| docs/how-to/scientific-data/publish-geospatial-zarr-data.md | New step-by-step publishing guide (Zarr + IPFS), including discovery/access patterns. |
| docs/how-to/scientific-data/landscape-guide.md | New overview of scientific data formats, architectural patterns, and ecosystem tooling. |
| docs/case-studies/orcestra.md | New case study describing ORCESTRA’s use of IPFS for scientific data distribution. |
| docs/.vuepress/config.js | Adds the new Scientific Data pages to the How-to sidebar and adds ORCESTRA to case studies; also reorganizes peer-related sidebar entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Users can then access your data by using the `ipns://` prefix. | ||
|
|
||
| TODO: add example. |
There was a problem hiding this comment.
This section still contains a “TODO: add example.” placeholder. Please either add a concrete DNSLink + ipns://… (and/or gateway) example or remove the placeholder before merging so the guide reads as complete.
| Users can then access your data by using the `ipns://` prefix. | |
| TODO: add example. | |
| Users can then access your data by using the `ipns://` prefix. For example, if your Zarr store lives under a `my-zarr-dataset` directory, you can use: | |
| - IPNS URI: `ipns://data.example.org/my-zarr-dataset/` | |
| - HTTP gateway: `https://dweb.link/ipns/data.example.org/my-zarr-dataset/` |
| Ultimately the choice between these approaches for content-addressed data management comes to down to the following questions: | ||
|
|
||
| - How important is duplication? This is probably a function of the volume of your data and market costs of storage. | ||
| - How important is having copy in content-addressed format around? If no public publishing is expected, you can forego this and only hash . | ||
| - What libraries and which languages using to interact with the data? For example, xarray which makes use of fsspec, can read directly from a local IPFS gateway with [`ipfsspec`](https://github.com/fsspec/ipfsspec) |
There was a problem hiding this comment.
This paragraph has a few incomplete/unclear sentences (e.g., “forego this and only hash .” and “What libraries and which languages using…”). Please rewrite for clarity so readers can follow the trade-offs being described.
| Ultimately the choice between these approaches for content-addressed data management comes to down to the following questions: | |
| - How important is duplication? This is probably a function of the volume of your data and market costs of storage. | |
| - How important is having copy in content-addressed format around? If no public publishing is expected, you can forego this and only hash . | |
| - What libraries and which languages using to interact with the data? For example, xarray which makes use of fsspec, can read directly from a local IPFS gateway with [`ipfsspec`](https://github.com/fsspec/ipfsspec) | |
| Ultimately the choice between these approaches for content-addressed data management comes down to the following questions: | |
| - How important is duplication? This is probably a function of the volume of your data and market costs of storage. | |
| - How important is it to maintain a copy of the data in a content-addressed format? If no public publishing is expected and you only need integrity checks, you may choose not to store a full content-addressed replica and instead compute hashes on demand. | |
| - What libraries and which programming languages will you use to interact with the data? For example, Python’s xarray library, via fsspec, can read directly from a local IPFS gateway using [`ipfsspec`](https://github.com/fsspec/ipfsspec). |
|
|
||
| ## Next Steps | ||
|
|
||
| - [Publishing Zarr Datasets with IPFS](./publishing-zarr-datasets.md) - A hands-on guide to publishing your first dataset |
There was a problem hiding this comment.
The “Next Steps” section links to ./publishing-zarr-datasets.md, but that file doesn’t exist in this directory. Please update the link to an existing guide (possibly publish-geospatial-zarr-data.md) or add the missing page.
| - [Publishing Zarr Datasets with IPFS](./publishing-zarr-datasets.md) - A hands-on guide to publishing your first dataset | |
| - [Publishing Zarr Datasets with IPFS](./publish-geospatial-zarr-data.md) - A hands-on guide to publishing your first dataset |
|
|
||
| ORCESTRA uses IPFS to make scientific data openly accessible, verifiable, and resilient. | ||
|
|
||
| The raw data is processed by the at the Max Planck Institute for Meteorology, who process the data for publishing, where the end result is a set of CIDs corresponding to data from the different sub campaigns. allowing anyone who retrieves the data can independently verify they received exactly what was published, with no trust required in the specific server it was fetched from. |
There was a problem hiding this comment.
This paragraph is grammatically broken and hard to follow (“processed by the at the…”, run-on sentence, and a fragment starting with “allowing…”). Please rewrite for clarity so the data publication flow and trust model are unambiguous.
| The raw data is processed by the at the Max Planck Institute for Meteorology, who process the data for publishing, where the end result is a set of CIDs corresponding to data from the different sub campaigns. allowing anyone who retrieves the data can independently verify they received exactly what was published, with no trust required in the specific server it was fetched from. | |
| Raw data from the different sub-campaigns is processed at the Max Planck Institute for Meteorology into publishable datasets. These datasets are added to IPFS, producing content identifiers (CIDs) that correspond to the published data from each sub-campaign. Because each CID is derived from the content itself, anyone who retrieves the data can independently verify that they received exactly what was published, without needing to trust any specific server that served it. |
| - **Consolidated metadata**: All metadata can be consolidated into a single file for datasets with many arrays | ||
| - **Designed for network access patterns**: Distributed storage tends to have high throughput and high latency | ||
|
|
||
| > Note: To learn more about Zarr, check out the following resources: [Introduction to the Zarr format by Copernicus Marine](https://help.marine.copernicus.eu/en/articles/10401542-introduction-to-the-zarr-format), [# What is Cloud-Optimized Scientific Data?](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/). |
There was a problem hiding this comment.
The link text includes an extra “# ” (“[# What is Cloud-Optimized Scientific Data?]”), which looks like an accidental header marker. Consider removing the “#” from the link label.
| > Note: To learn more about Zarr, check out the following resources: [Introduction to the Zarr format by Copernicus Marine](https://help.marine.copernicus.eu/en/articles/10401542-introduction-to-the-zarr-format), [# What is Cloud-Optimized Scientific Data?](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/). | |
| > Note: To learn more about Zarr, check out the following resources: [Introduction to the Zarr format by Copernicus Marine](https://help.marine.copernicus.eu/en/articles/10401542-introduction-to-the-zarr-format), [What is Cloud-Optimized Scientific Data?](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/). |
|
|
||
| #### Mutable File System (MFS) | ||
|
|
||
| MFS provides a familiar filesystem interface for organizing immutable content that in encoded with UnixFS (though not e. You can create directories, move files, and maintain a logical structure while the underlying data remains content-addressed. |
There was a problem hiding this comment.
This sentence is truncated/unfinished (“...content that in encoded with UnixFS (though not e.”). Please complete or remove it so the MFS section doesn’t read as broken.
| MFS provides a familiar filesystem interface for organizing immutable content that in encoded with UnixFS (though not e. You can create directories, move files, and maintain a logical structure while the underlying data remains content-addressed. | |
| MFS provides a familiar filesystem interface for organizing immutable content that is encoded with UnixFS. You can create directories, move files, and maintain a logical structure while the underlying data remains content-addressed. |
| TODO: give an example with the `kubo ipfs files api` or maybe an asciicinema | ||
|
|
There was a problem hiding this comment.
This section contains a TODO placeholder. Please add the referenced example (or remove the TODO) before merging to avoid publishing incomplete guidance.
|
|
||
| The [ORCESTRA data browser](http://browser.orcestra-campaign.org/) provides a web interface for discovering and retrieving datasets. Built on top of [Climate and Forecast (CF) conventions](https://cfconventions.org/) metadata embedded in the Zarr datasets, the browser lets researchers search by variable, time range, sub-campaign, and other dimensions, then retrieve data directly via IPFS. | ||
|
|
||
| The browser leverages both Helia, the TypeScript implementation of IPFS and |
There was a problem hiding this comment.
This sentence is incomplete (“The browser leverages both Helia, the TypeScript implementation of IPFS and …”). Please finish the thought (e.g., what else it leverages and how) or remove the fragment.
| The browser leverages both Helia, the TypeScript implementation of IPFS and | |
| The browser leverages Helia, the TypeScript implementation of IPFS. |
| children: [ | ||
| ['/case-studies/arbol', 'Arbol'], | ||
| ['/case-studies/audius', 'Audius'], | ||
| ['/case-studies/orcestra', 'Orcestra'], |
There was a problem hiding this comment.
Sidebar label is “Orcestra”, but the project is consistently styled as “ORCESTRA” in the case study. Consider updating the display label to match the official capitalization for consistency.
| ['/case-studies/orcestra', 'Orcestra'], | |
| ['/case-studies/orcestra', 'ORCESTRA'], |
|
|
||
| By the end of this guide, you will publish a Zarr dataset to the IPFS network in a way that is retrievable directly within [Xarray](https://xarray.dev/). | ||
|
|
||
| If you are interested in a real-world example following the patterns in this guide, check out the [The ORCESTRA campaign](https://orcestra-campaign.org/intro.html). |
There was a problem hiding this comment.
In the “real-world example” sentence, the link text reads “the [The ORCESTRA campaign]”, which is grammatically incorrect (double “the”). Consider changing the anchor text to “the ORCESTRA campaign”.
| If you are interested in a real-world example following the patterns in this guide, check out the [The ORCESTRA campaign](https://orcestra-campaign.org/intro.html). | |
| If you are interested in a real-world example following the patterns in this guide, check out the [ORCESTRA campaign](https://orcestra-campaign.org/intro.html). |
What
Add a new geospatial publishing guide
Preview: https://bafybeiaj6f6pnzpoq4ktj5ubgw4de7fk7fzfixvcj7fipignkdwmo2ubk4.ipfs.dweb.link/how-to/publish-geospatial-data/