Kerchunk! Bless you #8

pbranson · 2022-08-05T22:32:30Z

pbranson
Aug 5, 2022

Title

Kerchunk! Bless you - indexing your way to enhanced analysis of someone else's big data

Summary

So you found an open data bucket with a heap of data you want to analyse. Maybe it is dense grids of numerical model data at global scale in NetCDF format or a stack of daily satellite earth observation TIFF files. If only you could just open them all as if they were one dataset.... Well maybe you can! The Kerchunk library builds on top of the fsspec library and provides a interface via Zarr to create an XArray dataset overlay that can span many files stored in object storage.

Personnel

Paul Branson @pbranson

+++

Specific tasks

Build an understanding of how kerchunk works.
Identify some priority open data datasets to kerchunk
Test out a few file storage options for kerchunk indexes
Discuss options for sharing a kerchunk index
Demonstrate how a kerchunk index can provide the flour for the Pangeo-Forge Bakery. See https://www.frontiersin.org/articles/10.3389/fclim.2021.782909/full

Data sets and infrastructure support

A number of awesome NOAA datasets are freely available in us-east-1 AWS region:
https://registry.opendata.aws/noaa-goes/
https://registry.opendata.aws/noaa-gefs-reforecast/

Libraries (pip installable):
kerchunk
intake
fsspec
xarray
zarr
h5netcdf
rasterio
cfgrib
gribscan
git+https://github.com/TomAugspurger/cogrib

The problem

Analysing a large stack of dense array data stored in open data buckets with traditional libraries can be hard, slow and sometimes not even possible. Whilst in an ideal world, datasets would be published in an Analysis Ready format, this is frequently not the case and not universally possible due to the variety of access patterns that may preclude a notionally ideal chunking layout.

Given that these datasets are large, owned by someone else and many researchers have limited organisational capacity to rechunk and mirror such large datasets, solutions that enhance the ability to analyse published datasets 'as-is' are valuable. Recent examples here and here of such an approach are spurring a flurry of activity of people 'kerchunk'-ing available datasets as evidenced by the burgeoning kerchunk issues list!

But once a dataset has been kerchunk-ed once, others could use that index - they typically compress considerably into small (<100 MB) files that could be shared for others to reuse or to feed into a Pangeo-Forge recipe.

This project will dive into using kerchunk, making some indexes and brainstorming about platforms for sharing them, perhaps on the InterPlanetary File System (cc @d70-t)

Application example

Excellent example that analyse geostationary SST have featured at OHW2020:
https://nbviewer.org/github/oceanhackweek/ohw-tutorials/blob/OHW20/10-satellite-data-access/goes-cmp-netcdf-zarr.ipynb
And more recently using kerchunk:
https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685
However the Zarr dataset and kerchunk indexes (which take some effort to build) are not readily available.

Australian dataset examples:
https://github.com/IOMRC/intake-aodn

Existing methods

How would you or others traditionally try to address this problem?
Whilst not exactly 'traditional' more cutting-edge, the Pangeo-Forge allows for the generation of recipes to re-publish datasets in analysis ready Zarr stores.

Proposed methods/tools

Building from what you learn at Oceanhackweek, what new approaches would you like to try to implement?

Contribute examples to kerchunk documentation of index creation and dataset analysis with a published index.

GRIB related issues:
fsspec/kerchunk#150
fsspec/kerchunk#127

pbranson · 2022-08-05T22:33:51Z

pbranson
Aug 5, 2022
Author

Hopefully this is of interest to other participants. Appreciate any feedback to help shape this into something that is useful for the community!

1 reply

NickMortimer Aug 10, 2022

Happy to look at this ;)

abkfenris · 2022-08-10T15:53:07Z

abkfenris
Aug 10, 2022
Maintainer

As someone who has done their own dabbling in kerchunk indexing, I'd suggest starting to explore from the Pangeo-Forge side of things.

While Kerchunk's capabilities are expanding, it's not always the right choice for every dataset (not that Pangeo-Forge is either). Pangeo-Forge also provides the closest thing there currently is to common place for Kerchunk indexes.

I think it's also an easier way to get to a good mental model of what kerchunk does, rather than diving directly into it.

0 replies

rsignell-usgs · 2022-08-16T16:10:18Z

rsignell-usgs
Aug 16, 2022

Kerchunk allows any collection of scientific format files to be as performant as then can be on the Cloud, and also provides an easy way to make non-CF compliant datasets compliant. But it doesn't reformat or rechunk data, so if your files have tiny chunks (e.g. 100k chunks or something), or the chunk shapes are such that you need to read 100,000 grib files to extract a time series at a point, it will be slow and you need to rechunk to improve performance.

@abkfenris - Is that what you meant when you said "it's not always the right choice for every dataset" ?

2 replies

abkfenris Aug 17, 2022
Maintainer

That's part of it, but not everything is kerchunk-able yet (maybe there is only an OpenDAP endpoint), or it's already in Zarr.

At least to me, Kerchunk is more of a data provider tool, where as Pangeo-Forge uses Kerchunk and other techniques from a data user approach.

martindurant Aug 17, 2022

I mildly disagree. kerchunk is new and rapidly changing, so we need people to plug directly into its capabilities and code, as well as to make examples to add momentum and enthusiasm to the project. pangeo-forge tries to present a uniform interface over kerchunk, but it does this by hiding abilities and adding an extra layer of indirection. I would say that both of these are for data providers, not users. Users use intake, stac, xarray interfaces only.

pbranson · 2022-08-17T07:03:43Z

pbranson
Aug 17, 2022
Author

Ok so we have a repository https://github.com/oceanhackweek/ohw22-proj-kerchunk
And a start on some example notebooks.

After some digging around on AWS turns out there is a L3 NOAA Gridded SST dataset from the Himawari geostationary satellite available that is an excellent candidate for kerchunk.

@martindurant I noticed your comment specifically in relation to subsetting and previous discussions about using parquet as a container for references. The Himawari dataset is hourly, near hemispheric data at 2km resolution, so many many chunks. At this point I'm going to reuse the RefZarrStackSource intake driver from intake-aodn but I wonder what you had in mind for subsetting. Maybe adding additional columns to the parquet that could be filtered before instantiating the ReferenceFilesystem?

1 reply

pbranson Aug 17, 2022
Author

This dataset also seems am excellent (simple) candidate for pangeo-forge too

martindurant · 2022-08-17T14:41:43Z

martindurant
Aug 17, 2022

I'm not certain which point you meant by "subsetting", there are a few tings I want to get done:

the ability to reference parts of an (uncompressed) chunk as separate chunks. This is like cheap sharding and improves random access at the cost of larger reference files and somewhat slowwer throughput
subsampling images to make a pyramid (involves storing the smaller images somewhere else)
only loading some of the references.

Note that preffs already has a parquet implementation for referencefilesystem, but without lazy loading or filtering. I intend to build off this, but I can't promise when. It might work well for the >100k references that would be required here, but I have found that Zstd compression is pretty good for file size and load speed with JSON too. How you access the data chunks is another matter and workflow dependent.

0 replies

NickMortimer · 2022-08-18T01:27:42Z

NickMortimer
Aug 18, 2022

@martindurant I'm working on getting templated access to lots of small netcdf files using kerchunk. Making the references is quick and it allows me to do things like find all the files that were manually qc'd without fetching each file because I can check the reference file attrs!

Also I can see what variable names are in them etc.

So I have zipped them into a zipfile, but I'm struggling with a way to template the name of the reff file that I want out of the zip using an intake catalogue.

args:
  urlpath: "reference://"
  storage_options:
    target_options:
      requester_pays: true
    fo: 'simplecache::zip://reffs/IMOS_ANMN-QLD*_{{mooring}}_FV01*.json::{{CATALOG_DIR}}/aodn_refs.zip'
    remote_options:
      requester_pays: true
      anon: true
    remote_protocol: s3

4 replies

pbranson Aug 18, 2022
Author

@NickMortimer I think if you define mooring as a parameter in the catalog entry it should allow you to instantiate the entry with that parameter. However I am not sure if the storage_options can be templated. I checked what we did in intake-aodn and we used the urlpath with our custom driver and urlpath allows templating.

https://intake.readthedocs.io/en/latest/catalog.html#templating

martindurant Aug 18, 2022

Yes, it is an outstanding problem that Intake does not descend inside the structure of complex arguments like storage_options, to apply templates to components. It would open a rather complex world. Perhaps if we can define how it should work, and whether the user-parameters can be complex too, then we can implement it.

pbranson Aug 18, 2022
Author

I just tested out substituting a parameter into fo and it works!

Catalog test_param.yaml :

metadata:
  version: 1
sources:
  test_param:
    driver: intake_xarray.xzarr.ZarrSource
    description: 'Test parameters'    
    parameters:
      some_param:
        description: String parameter
        type: str 
    args:
      urlpath: 'reference://'
      storage_options:
        simple_templates: True
        target_options:
          compression: 'zstd'
        fo: '{{CATALOG_DIR}}{{some_param}}'
        remote_options:
          anon: true
        remote_protocol: s3

Running:

import intake
cat = intake.open_catalog('test_param.yaml')
cat.test_param(some_param="HelloCatalog").to_dask()

Returns:

FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/ohw22-proj-kerchunk/catalog/HelloCatalog'

martindurant Aug 18, 2022

It occurs to me, that probably you want to combine these JSON files rather than refer to them jointly. Have you tried kerchunk.combine to make a single global dataset?

NickMortimer · 2022-10-11T07:20:14Z

NickMortimer
Oct 11, 2022

Hey Paul, I just pointed the reference maker at the moorings .nc files and it works well and quick but I guess the thing is handling the multiple files that do not concatenate? If you've got time for a quick chat I can explain Nick From: Paul Branson ***@***.***> Sent: Wednesday, August 17, 2022 3:06 PM To: oceanhackweek/discussions ***@***.***> Cc: Mortimer, Nick (O&A, IOMRC Crawley) ***@***.***>; Comment ***@***.***> Subject: Re: [oceanhackweek/discussions] Kerchunk! Bless you (Discussion #8) This dataset also seems am excellent (simple) candidate for pangeo-forge too - Reply to this email directly, view it on GitHub<#8 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABBDKH5UC24C5VTY5BPWDFLVZSFMJANCNFSM55XOREPQ>. You are receiving this because you commented.Message ID: ***@***.******@***.***>>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OceanHackWeek (OHW)

Kerchunk! Bless you #8

{{title}}

Replies: 7 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Kerchunk! Bless you #8

Title

Summary

Personnel

Specific tasks

Data sets and infrastructure support

The problem

Application example

Existing methods

Proposed methods/tools

Replies: 7 comments · 8 replies

pbranson Aug 5, 2022 Author

abkfenris Aug 10, 2022 Maintainer

abkfenris Aug 17, 2022 Maintainer

pbranson Aug 17, 2022 Author

pbranson Aug 17, 2022 Author

pbranson Aug 18, 2022 Author

pbranson Aug 18, 2022 Author

Replies: 7 comments 8 replies

pbranson
Aug 5, 2022
Author

abkfenris
Aug 10, 2022
Maintainer

abkfenris Aug 17, 2022
Maintainer

pbranson
Aug 17, 2022
Author

pbranson Aug 17, 2022
Author

pbranson Aug 18, 2022
Author

pbranson Aug 18, 2022
Author