-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building recipes from files located within a large tar.gz file #442
Comments
Thanks for opening this issue. I agree we need to support this workflow somehow, since these kinds of archives are unfortunately very common. I think once #369 is done, it will be much more clear how to do this. Basically we will just create a custom PTransform to do the unzipping. |
I agree we'll still want to wait for
via
|
I have read about sozip following @rabernat pointing it out to me elsewhere. I would add a couple of things:
|
Thanks for the clarifications, @martindurant! |
I spent a little time playing with python's tarfile and got the following little code snippet working import fsspec
import tarfile
url = "https://zenodo.org/record/6609035/files/datasets.tar.gz?download=1"
fp = fsspec.open(url)
tf = tarfile.open(fileobj=fp.open(), mode='r:gz')
while True:
member = tf.next()
if member is None:
break
print(member) This could be the basis for a Beam PTransform that emits each file as an element. https://gist.github.com/rabernat/616deabf2e12576f999470cbd82e9950 |
The fsspec one-liner might be
but it still must read the entire stream through. Other versions of the command are possible, but you can't get around gzip's single monolithic stream. |
I wanted to highlight a use case I have encountered multiple times in the past weeks and which is only partially supported by pangeo-forge-recipes.
The core situation always is the following:
The files that are source for a recipe are contained in a large compressed file (see pangeo-forge/staged-recipes#219 (comment)) for an example. As a recipe builder I want to be able to work with the files contained in there and e.g. merge or concat them
As I learned, if the container is a .zip/.tar file you can already index into it, but .gzip does not have that possibility.
I wonder if there is a possibility to expand the functionality of pgf-recipes and allow some syntax that requires caching/unpacking of .gz files but still maintains something akin to a url per file.
If e.g. you have a file
container.tar.gz
which containsfile1.nc
andfile2.nc
and can be downloaded athttp://zenodo.org/<project>/container.tar.gz
.Would it be at all possible to have some special command like
GUNZIP
that one could insert into a URL like this:If pgf-recipes could recognize this 'command' (there is probably a better word for this), then the recipe could just require the data to be cached locally, unpack it and do its usual thing?
The text was updated successfully, but these errors were encountered: