How to handle s3.read_parquet Binary (object) auto-transformation? #1751

raguidu · 2022-11-07T15:41:00Z

raguidu
Nov 7, 2022

Hi guys. I've been spending quite some time on this so I decided to start this thread- any help would be really appreciated.

I'm writing python code to read Parquet files generated by the AWS S3 Export feature for Aurora Postgres and write them back to an Aurora Postgres DB. So I'm using wr.s3.read_parquet to read the Parquet datasets from S3, using function to_csv to transform them to csv in memory, and then load them into an Aurora Postgres DB using Postgres COPY from STDIN feature. The table structure is created with pg_dump, so it's exactly the same between original source and target.

It works fine most of the time. However, when it comes across tables that have columns with the bytea data type (i.e. BLob) it fails to load (ERROR: invalid input syntax for type bytea). Here's an example of a column value that would fail:

--Original column data (bytea data type - 1 row) in Postgres before the S3 Export:
\xaced0005757200175b4c6a6176612e696f2e53657269616c697a61626c653baed009ac53d7ed490200007870
000000027372000e6a6176612e6c616e672e4c6f6e673b8be490cc8f23df0200014a000576616c7565787200106a6176612e6c616e672e4e756d62657286ac951d0b94e08b020000787000000000000818197371007e00020000000000000002

--Same column data in the exported Parquet file in S3 (it seems to go through a base64 encoding, which is fine). Column type here is BINARY
rO0ABXVyABdbTGphdmEuaW8uU2VyaWFsaXphYmxlO67QCaxT1+1JAgAAeHAAAAACc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAACBgZc3EAfgACAAAAAAAAAAI=

--Now the output of the same column data when I read it with wr.s3.read_parquet - Column type here is Object
b'\xac\xed\x00\x05ur\x00\x17[Ljava.io.Serializable;\xae\xd0\t\xacS\xd7\xedI\x02\x00\x00xp
x00\x00\x00\x02sr\x00\x0ejava.lang.Long;\x8b\xe4\x90\xcc\x8f#\xdf\x02\x00\x01J\x00\x05valuexr\x00\x10java.lang.Number\x86\xac\x95\x1d\x0b\x94\xe0\x8b\x02\x00\x00xp\x00\x00\x00\x00\x00\x08\x18\x19sq\x00~\x00\x02\x00\x00\x00\x00\x00\x00\x00\x02'

So wr.s3.read_parquet transforms the data so much that I haven't found a solution to load it back into Postgres and make it look like the original data. If it kept the same value as in the Parquet file I could manipulate it in Postgres so that it'd look exactly like the original data (e.g. load it as text then decode it back to bytea), but this is not the case as you can see.

Any suggestions on how I could deal with that type of binary data and have it looking exactly as the original one (or even looking like what's in the Parquet file i.e. not getting transformed when calling wr.s3.read_parquet).
Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to handle s3.read_parquet Binary (object) auto-transformation? #1751

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to handle s3.read_parquet Binary (object) auto-transformation? #1751

Uh oh!

Uh oh!

raguidu Nov 7, 2022

Replies: 0 comments

raguidu
Nov 7, 2022