Skip to content

Add scientific data and geospatial publishing guides & case study#2243

Open
2color wants to merge 10 commits intoipfs:mainfrom
2color:geospatial-guide
Open

Add scientific data and geospatial publishing guides & case study#2243
2color wants to merge 10 commits intoipfs:mainfrom
2color:geospatial-guide

Conversation

@2color
Copy link
Member

@2color 2color commented Jan 23, 2026

@github-actions
Copy link
Contributor

github-actions bot commented Jan 23, 2026

🚀 Build Preview on IPFS ready

@mishmosh
Copy link
Collaborator

This is great as a specific how-to. Is there another, complementary place we can write about all the ways geospatial users can benefit from IPFS?

From live meeting:

  • Consider title “Scientific Data” as category
    • Ecosystem Tooling
    • Guide to Publishing Scientific Data
  • IPFS is used by the geospatial community for better collaboration, data integrity, and open access.
    (make sure we can describe some of the architectures used)
    • Connecting kubo to your existing data repositories (stac catalog)
    • Private clusters (but open retrieval) or “Collaborative publishing”
    • Provenance

@2color 2color marked this pull request as ready for review February 4, 2026 16:14
Copy link
Collaborator

@mishmosh mishmosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions and comments inline, but I'm confident you can take it from here. Would also like to see @vmx review.

2color and others added 6 commits February 6, 2026 14:34
Co-authored-by: Volker Mische <volker.mische@gmail.com>
Co-authored-by: Mosh <1306020+mishmosh@users.noreply.github.com>
Co-authored-by: Mosh <1306020+mishmosh@users.noreply.github.com>

Users can then access your data by using the `ipns://` prefix.

TODO: add example.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: add example

@2color 2color requested a review from vmx February 6, 2026 14:46
Comment on lines +103 to +105
--raw-leaves \
--chunker=size-1048576 \
--cid-version=1 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once Kubo 0.40 ships these could be removed and replaced by one-time ipfs config profile apply unixfs-v1-2025 or setting Import.* values one-by-one


In this guide, you'll find an overview of the problem space, available tools, and architectural patterns for publishing and working with scientific data using IPFS.

## A Landscape in Flux
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Slack:

Condense “A Landscape in Flux” - Remove or significantly shorten so we get to solutions/new knowledge faster; we can highlight the main points but you can assume readers already work in geospatial and are familiar with the challenges

@2color
Copy link
Member Author

2color commented Feb 13, 2026

This PR has been updated and includes the addition of three new pages:

It still needs more work, but I think this structure works well. I'd like to also maybe add a more generic retrieval example in Python using ipfsspec for the retrieval guide.

@2color 2color changed the title Add geospatial publishing guide Add scientific data and geospatial publishing guides & case study Feb 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new Scientific Data documentation content to the IPFS docs site, including a hands-on guide for publishing geospatial Zarr datasets and supporting context via a landscape overview and an ORCESTRA case study. Updates the VuePress sidebar to surface the new pages and case study.

Changes:

  • Add a new “Publish Geospatial Zarr Data with IPFS” how-to guide.
  • Add a new “Scientific Data and IPFS Landscape Guide” overview page.
  • Add a new ORCESTRA case study and update VuePress navigation (including sidebar re-organization and case study list).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.

File Description
docs/how-to/scientific-data/publish-geospatial-zarr-data.md New step-by-step publishing guide (Zarr + IPFS), including discovery/access patterns.
docs/how-to/scientific-data/landscape-guide.md New overview of scientific data formats, architectural patterns, and ecosystem tooling.
docs/case-studies/orcestra.md New case study describing ORCESTRA’s use of IPFS for scientific data distribution.
docs/.vuepress/config.js Adds the new Scientific Data pages to the How-to sidebar and adds ORCESTRA to case studies; also reorganizes peer-related sidebar entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +180 to +182
Users can then access your data by using the `ipns://` prefix.

TODO: add example.
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section still contains a “TODO: add example.” placeholder. Please either add a concrete DNSLink + ipns://… (and/or gateway) example or remove the placeholder before merging so the guide reads as complete.

Suggested change
Users can then access your data by using the `ipns://` prefix.
TODO: add example.
Users can then access your data by using the `ipns://` prefix. For example, if your Zarr store lives under a `my-zarr-dataset` directory, you can use:
- IPNS URI: `ipns://data.example.org/my-zarr-dataset/`
- HTTP gateway: `https://dweb.link/ipns/data.example.org/my-zarr-dataset/`

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +77
Ultimately the choice between these approaches for content-addressed data management comes to down to the following questions:

- How important is duplication? This is probably a function of the volume of your data and market costs of storage.
- How important is having copy in content-addressed format around? If no public publishing is expected, you can forego this and only hash .
- What libraries and which languages using to interact with the data? For example, xarray which makes use of fsspec, can read directly from a local IPFS gateway with [`ipfsspec`](https://github.com/fsspec/ipfsspec)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph has a few incomplete/unclear sentences (e.g., “forego this and only hash .” and “What libraries and which languages using…”). Please rewrite for clarity so readers can follow the trade-offs being described.

Suggested change
Ultimately the choice between these approaches for content-addressed data management comes to down to the following questions:
- How important is duplication? This is probably a function of the volume of your data and market costs of storage.
- How important is having copy in content-addressed format around? If no public publishing is expected, you can forego this and only hash .
- What libraries and which languages using to interact with the data? For example, xarray which makes use of fsspec, can read directly from a local IPFS gateway with [`ipfsspec`](https://github.com/fsspec/ipfsspec)
Ultimately the choice between these approaches for content-addressed data management comes down to the following questions:
- How important is duplication? This is probably a function of the volume of your data and market costs of storage.
- How important is it to maintain a copy of the data in a content-addressed format? If no public publishing is expected and you only need integrity checks, you may choose not to store a full content-addressed replica and instead compute hashes on demand.
- What libraries and which programming languages will you use to interact with the data? For example, Python’s xarray library, via fsspec, can read directly from a local IPFS gateway using [`ipfsspec`](https://github.com/fsspec/ipfsspec).

Copilot uses AI. Check for mistakes.

## Next Steps

- [Publishing Zarr Datasets with IPFS](./publishing-zarr-datasets.md) - A hands-on guide to publishing your first dataset
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Next Steps” section links to ./publishing-zarr-datasets.md, but that file doesn’t exist in this directory. Please update the link to an existing guide (possibly publish-geospatial-zarr-data.md) or add the missing page.

Suggested change
- [Publishing Zarr Datasets with IPFS](./publishing-zarr-datasets.md) - A hands-on guide to publishing your first dataset
- [Publishing Zarr Datasets with IPFS](./publish-geospatial-zarr-data.md) - A hands-on guide to publishing your first dataset

Copilot uses AI. Check for mistakes.

ORCESTRA uses IPFS to make scientific data openly accessible, verifiable, and resilient.

The raw data is processed by the at the Max Planck Institute for Meteorology, who process the data for publishing, where the end result is a set of CIDs corresponding to data from the different sub campaigns. allowing anyone who retrieves the data can independently verify they received exactly what was published, with no trust required in the specific server it was fetched from.
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is grammatically broken and hard to follow (“processed by the at the…”, run-on sentence, and a fragment starting with “allowing…”). Please rewrite for clarity so the data publication flow and trust model are unambiguous.

Suggested change
The raw data is processed by the at the Max Planck Institute for Meteorology, who process the data for publishing, where the end result is a set of CIDs corresponding to data from the different sub campaigns. allowing anyone who retrieves the data can independently verify they received exactly what was published, with no trust required in the specific server it was fetched from.
Raw data from the different sub-campaigns is processed at the Max Planck Institute for Meteorology into publishable datasets. These datasets are added to IPFS, producing content identifiers (CIDs) that correspond to the published data from each sub-campaign. Because each CID is derived from the content itself, anyone who retrieves the data can independently verify that they received exactly what was published, without needing to trust any specific server that served it.

Copilot uses AI. Check for mistakes.
- **Consolidated metadata**: All metadata can be consolidated into a single file for datasets with many arrays
- **Designed for network access patterns**: Distributed storage tends to have high throughput and high latency

> Note: To learn more about Zarr, check out the following resources: [Introduction to the Zarr format by Copernicus Marine](https://help.marine.copernicus.eu/en/articles/10401542-introduction-to-the-zarr-format), [# What is Cloud-Optimized Scientific Data?](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/).
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link text includes an extra “# ” (“[# What is Cloud-Optimized Scientific Data?]”), which looks like an accidental header marker. Consider removing the “#” from the link label.

Suggested change
> Note: To learn more about Zarr, check out the following resources: [Introduction to the Zarr format by Copernicus Marine](https://help.marine.copernicus.eu/en/articles/10401542-introduction-to-the-zarr-format), [# What is Cloud-Optimized Scientific Data?](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/).
> Note: To learn more about Zarr, check out the following resources: [Introduction to the Zarr format by Copernicus Marine](https://help.marine.copernicus.eu/en/articles/10401542-introduction-to-the-zarr-format), [What is Cloud-Optimized Scientific Data?](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/).

Copilot uses AI. Check for mistakes.

#### Mutable File System (MFS)

MFS provides a familiar filesystem interface for organizing immutable content that in encoded with UnixFS (though not e. You can create directories, move files, and maintain a logical structure while the underlying data remains content-addressed.
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is truncated/unfinished (“...content that in encoded with UnixFS (though not e.”). Please complete or remove it so the MFS section doesn’t read as broken.

Suggested change
MFS provides a familiar filesystem interface for organizing immutable content that in encoded with UnixFS (though not e. You can create directories, move files, and maintain a logical structure while the underlying data remains content-addressed.
MFS provides a familiar filesystem interface for organizing immutable content that is encoded with UnixFS. You can create directories, move files, and maintain a logical structure while the underlying data remains content-addressed.

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +168
TODO: give an example with the `kubo ipfs files api` or maybe an asciicinema

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section contains a TODO placeholder. Please add the referenced example (or remove the TODO) before merging to avoid publishing incomplete guidance.

Copilot uses AI. Check for mistakes.

The [ORCESTRA data browser](http://browser.orcestra-campaign.org/) provides a web interface for discovering and retrieving datasets. Built on top of [Climate and Forecast (CF) conventions](https://cfconventions.org/) metadata embedded in the Zarr datasets, the browser lets researchers search by variable, time range, sub-campaign, and other dimensions, then retrieve data directly via IPFS.

The browser leverages both Helia, the TypeScript implementation of IPFS and
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is incomplete (“The browser leverages both Helia, the TypeScript implementation of IPFS and …”). Please finish the thought (e.g., what else it leverages and how) or remove the fragment.

Suggested change
The browser leverages both Helia, the TypeScript implementation of IPFS and
The browser leverages Helia, the TypeScript implementation of IPFS.

Copilot uses AI. Check for mistakes.
children: [
['/case-studies/arbol', 'Arbol'],
['/case-studies/audius', 'Audius'],
['/case-studies/orcestra', 'Orcestra'],
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidebar label is “Orcestra”, but the project is consistently styled as “ORCESTRA” in the case study. Consider updating the display label to match the official capitalization for consistency.

Suggested change
['/case-studies/orcestra', 'Orcestra'],
['/case-studies/orcestra', 'ORCESTRA'],

Copilot uses AI. Check for mistakes.

By the end of this guide, you will publish a Zarr dataset to the IPFS network in a way that is retrievable directly within [Xarray](https://xarray.dev/).

If you are interested in a real-world example following the patterns in this guide, check out the [The ORCESTRA campaign](https://orcestra-campaign.org/intro.html).
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the “real-world example” sentence, the link text reads “the [The ORCESTRA campaign]”, which is grammatically incorrect (double “the”). Consider changing the anchor text to “the ORCESTRA campaign”.

Suggested change
If you are interested in a real-world example following the patterns in this guide, check out the [The ORCESTRA campaign](https://orcestra-campaign.org/intro.html).
If you are interested in a real-world example following the patterns in this guide, check out the [ORCESTRA campaign](https://orcestra-campaign.org/intro.html).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants