Reading from raw bytes? #145

calebwin · 2021-03-27T20:52:48Z

I'm downloading a Parquet file over network using AWSS3.jl. Can I parse this into a a DataFrame using Parquet.jl?

quinnj · 2021-03-30T17:32:30Z

@tanmaykm, I think this would be helpful to have; in Arrow.jl and CSV.jl, we ultimately always do all the file parsing/processing on a Vector{UInt8} which makes it really convenient for cases like OP mentioned.

tanmaykm · 2021-03-31T12:41:36Z

Having the processing functions work on Vector{UInt8} will be useful for for files that would fit into memory. This would also work for files on disk that can be memory mapped.

But for files being loaded from blob stores like S3, will a File type abstraction on them be better? One that can fetch chunks from byte offsets as and when needed?

calebwin · 2021-03-31T16:34:51Z

@tanmaykm I don't have first-hand knowledge on this but it seems like you could easily incur overhead from reading over the network for each fetch of bytes. And so a File abstraction wouldn't be desirable?

tanmaykm · 2021-03-31T16:55:24Z

Yes, the reads need to be buffered by the abstraction of course. And most of the data access in this package are actually for reasonably large chunks of data, with byte level access in done from internal buffers, which I thought would suit this approach.

calebwin · 2021-03-31T17:46:25Z

I see. Looks like AWSS3.jl supports reading byte ranges from files in S3. But if this was behind an implementation of File (is there even such a thing as an AbstractFile?), does Parquet.jl support reading from a File or does it have to be a filename for some reason?

tanmaykm · 2021-03-31T18:09:51Z

The filepath is not used apart from initial opening of file, and for filtering partitioned datasets. Those may work too with minor changes if we use URLs instead.

I have not come across AbstractFile. We should have one maybe, and we probably only need methods for filesize, seek and reading a range of bytes implemented for S3 access.

calebwin · 2021-03-31T18:12:57Z

The filepath is not used apart from initial opening of file, and for filtering partitioned datasets. Those may work too with minor changes if we use URLs instead.

Got it

I have not come across AbstractFile. We should have one maybe, and we probably only need methods for filesize, seek and reading a range of bytes implemented for S3 access.

I feel like the Julia I/O ecosystem is really great thanks to hard work by you and others. But there really needs to be a better unifying abstraction for reading datasets from files. I'm working on something like Dask for Julia and greatly sensing the need for something similar to fsspec for Julia. PathsBase.jl and FileIO.jl are great but not sufficient for multi-file datasets.

calebwin · 2021-04-05T06:33:39Z

@tanmaykm @quinnj I unfortunately don't have the time to develop this at the moment - do you think this might be a valid case to just use S3FS via FUSE?

tanmaykm · 2021-04-05T07:57:23Z

Yes, I think S3FS via FUSE may work well in this case.

calebwin · 2021-04-05T12:47:36Z

@tanmaykm Okay, my only concern is - do you know if S3FS will download files to disk if it isn't using cache? I would hope that it would just download ranges of bytes to memory....

tanmaykm · 2021-04-12T09:51:44Z

It does seem like that from s3fs document, and I was not able to see files being written when I tried it. But it claims that using cache may make it faster and it has an option to limit cache size.

cwiese · 2021-04-12T16:30:26Z

I am having same issue: reading a Parquet file on S3 and hoping to benefit from reading a specific column only. I would think this is a very popular use case.

cwiese · 2021-04-12T16:41:14Z

root = "s3://$(bucket)/$(path)/$(run_date)"
fxpath = "$(root)/fx_states.parquet"
p = S3Path(fxpath, config=config)
f = read(p)
using Arrow
ar = Arrow.Stream(f, pos=1)

this gets me "ERROR: ArgumentError: no arrow ipc messages found in provided input"

I figured straight to Arrow and I can create a dataframe (if needed). Perhaps @quinnj can correct me here?

calebwin · 2021-04-12T16:51:30Z

That definitely won't work - Arrow.Stream is expecting Arrow data. But you provided it Parquet data. Arrow and Parquet are different formats. And note that the format of Arrow data is same (or almost same) regardless of whether it is disk or network or in-memory.

cwiese · 2021-04-12T21:26:52Z

right! But I do not see a way to construct a Parquet.File with s3Path. If I want to read Parquest Files on AWS S3 - I will need to use Python for now - it seems. Now the challenge is how to avoid copying all this data multiple time getting into Julia DF.

layne-sadler · 2021-08-28T14:10:18Z

Looking at the python equivalent:

pyarrow.parquet.read_table
"path... or file-like objects"
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

which enables

pandas.read_parquet()
"path... or file-like objects"
https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

jkcoxson · 2022-12-15T22:24:57Z

I also need to read from raw bytes for a library I'm writing. What would it take to implement this?

calebwin mentioned this issue Mar 28, 2021

Reading from raw bytes? queryverse/ParquetFiles.jl#39

Open

calebwin mentioned this issue Jun 4, 2021

Enable read Dataset from any interface implementing file methods #150

Open

mbauman mentioned this issue Nov 29, 2022

Better interoperation with Parquet files JuliaComputing/DataSets.jl#60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading from raw bytes? #145

Reading from raw bytes? #145

calebwin commented Mar 27, 2021

quinnj commented Mar 30, 2021

tanmaykm commented Mar 31, 2021

calebwin commented Mar 31, 2021

tanmaykm commented Mar 31, 2021

calebwin commented Mar 31, 2021 •

edited

Loading

tanmaykm commented Mar 31, 2021

calebwin commented Mar 31, 2021

calebwin commented Apr 5, 2021

tanmaykm commented Apr 5, 2021

calebwin commented Apr 5, 2021

tanmaykm commented Apr 12, 2021

cwiese commented Apr 12, 2021

cwiese commented Apr 12, 2021 •

edited

Loading

calebwin commented Apr 12, 2021

cwiese commented Apr 12, 2021

layne-sadler commented Aug 28, 2021 •

edited

Loading

jkcoxson commented Dec 15, 2022

Reading from raw bytes? #145

Reading from raw bytes? #145

Comments

calebwin commented Mar 27, 2021

quinnj commented Mar 30, 2021

tanmaykm commented Mar 31, 2021

calebwin commented Mar 31, 2021

tanmaykm commented Mar 31, 2021

calebwin commented Mar 31, 2021 • edited Loading

tanmaykm commented Mar 31, 2021

calebwin commented Mar 31, 2021

calebwin commented Apr 5, 2021

tanmaykm commented Apr 5, 2021

calebwin commented Apr 5, 2021

tanmaykm commented Apr 12, 2021

cwiese commented Apr 12, 2021

cwiese commented Apr 12, 2021 • edited Loading

calebwin commented Apr 12, 2021

cwiese commented Apr 12, 2021

layne-sadler commented Aug 28, 2021 • edited Loading

jkcoxson commented Dec 15, 2022

calebwin commented Mar 31, 2021 •

edited

Loading

cwiese commented Apr 12, 2021 •

edited

Loading

layne-sadler commented Aug 28, 2021 •

edited

Loading