Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building recipes from files located within a large tar.gz file #442

Open
jbusecke opened this issue Nov 16, 2022 · 6 comments
Open

Building recipes from files located within a large tar.gz file #442

jbusecke opened this issue Nov 16, 2022 · 6 comments

Comments

@jbusecke
Copy link
Contributor

I wanted to highlight a use case I have encountered multiple times in the past weeks and which is only partially supported by pangeo-forge-recipes.

The core situation always is the following:

The files that are source for a recipe are contained in a large compressed file (see pangeo-forge/staged-recipes#219 (comment)) for an example. As a recipe builder I want to be able to work with the files contained in there and e.g. merge or concat them

As I learned, if the container is a .zip/.tar file you can already index into it, but .gzip does not have that possibility.

I wonder if there is a possibility to expand the functionality of pgf-recipes and allow some syntax that requires caching/unpacking of .gz files but still maintains something akin to a url per file.

If e.g. you have a file container.tar.gz which contains file1.nc and file2.nc and can be downloaded at http://zenodo.org/<project>/container.tar.gz.

Would it be at all possible to have some special command like GUNZIP that one could insert into a URL like this:

def build_urls(filenumber):
     return f`http://zenodo.org/<project>/container.tar.gz/UNZIP/file{filenumber}.nc`

If pgf-recipes could recognize this 'command' (there is probably a better word for this), then the recipe could just require the data to be cached locally, unpack it and do its usual thing?

@rabernat
Copy link
Contributor

Thanks for opening this issue. I agree we need to support this workflow somehow, since these kinds of archives are unfortunately very common.

I think once #369 is done, it will be much more clear how to do this. Basically we will just create a custom PTransform to do the unzipping.

@cisaacstern
Copy link
Member

I agree we'll still want to wait for beam-refactor to go in before approaching this, and the following is not necessarily a drop-in fix for this, but noting what seems to be a related line of work:

https://github.com/sozip/sozip-spec/blob/master/blog/01-announcement.md

via

https://twitter.com/howardbutler/status/1612457687949901825?s=20&t=krbnOD1DVC6BeyEsfPz3_g

@martindurant
Copy link
Contributor

I have read about sozip following @rabernat pointing it out to me elsewhere. I would add a couple of things:

  • ZIP already allows accessing of any contained file with a simple index lookup
  • fsspec can open files in a remote archive with URLs like "zip://memberfile::protocol://archive.path"
  • kerchunk can scan and build an index for uncompressed files within ZIP or TAR https://github.com/fsspec/kerchunk/blob/main/kerchunk/utils.py#L267
  • fsspec can open remote files with (gzip) compression and pass them to the tar filesystem. However, random access amounts to reading from the start every time. You could couple this with fsspec's cache to write out uncompressed versions of the member files to local or elsewhere, but it won't be as performant as curl | tar.
  • NOTHING can split up a gzip stream once it is written. If you have control over the writing, rather than trying to play tricks don't_use_gzip. bzip2, xz, zstd, blosc... all have internal stream blocks that can come close to random access with the right settings.

@cisaacstern
Copy link
Member

Thanks for the clarifications, @martindurant!

@rabernat
Copy link
Contributor

I spent a little time playing with python's tarfile and got the following little code snippet working

import fsspec
import tarfile

url = "https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1"

fp = fsspec.open(url)
tf = tarfile.open(fileobj=fp.open(), mode='r:gz')

while True:
    member = tf.next()
    if member is None:
        break
    print(member)

This could be the basis for a Beam PTransform that emits each file as an element.

https://gist.github.com/rabernat/616deabf2e12576f999470cbd82e9950

@martindurant
Copy link
Contributor

The fsspec one-liner might be

allfiles = fsspec.open_files(
    "tar://*::https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1", 
    tar={"compression": "gzip"})
)

but it still must read the entire stream through. Other versions of the command are possible, but you can't get around gzip's single monolithic stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants