You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
There is a new requirement to have metadata directly in all our cloud data at my workplace (this is due to the need to move data between different hosting solutions). This means that our fallback data formats have become avro/parquet as you can attach metadata directly to the schemas. However, there is currently no direct way to do this using the s3.to_parquet function, so I wonder if it is possible to add this capability?
Just FYI, I think the s3.to_parquet functionality is brilliant and saves so much effort when making glue tables with partitions etc., so I would really like to be able to carry on using it in our workflows rather than write custom boto3/pyarrow logic.
What I propose is that we add a new metadata parameter in the existing pyarrow_additional_kwargs dictionary. This avoids any changes in the API so that only a minor version bump would be needed.
This would also add the capability to do it for ORC files via the same pyarrow_additional_kwargs argument in the s3.to_orc function.
From there the metadata can be extracted and validated in the _S3WriteStrategy class (or in the _S3ParquetWriteStrategy/_S3ORCWriteStrategy child classes separately if these formats have different metadata constraints, I haven't researched this part yet). We then can pass the metadata to an amended _data_types.pyarrow_schema_from_pandas function:
Describe alternatives you've considered
After digging into the code a bit more, I can see that you can attach your own schema directly via pyarrow_additional_kwargs which then overwrites the schema made by awswranglerhere.
However, I would still argue that there is a need for the feature described above as I want awswrangler to make the schema for me, and there should be a way to simply pass a dictionary of file metadata to the schema generator function.
Maybe pyarrow_additional_kwargs isn't the best place for it though as I can see it is expanded directly in pyarrow.parquet.ParquetWriter, so the metadata key would have to be popped out of the dictionary before this point.
Let me know your thoughts.
Additional considerations
I know that there are several other functions in this library for handling parquet/orc metadata (i.e. read_parquet_metadata, read_orc_metadata, store_parquet_metadata), so we would need to check that these work correctly. I would have thought it would be fine though as they are designed to work with the parquet/orc specifications.
I am willing to submit a PR for this feature if approved.
The text was updated successfully, but these errors were encountered:
@walter9388, contributions are always welcome. We can discuss if pyarrow_additional_kwargs is indeed the best input argument to hold metadata on your PR
Is your feature request related to a problem? Please describe.
There is a new requirement to have metadata directly in all our cloud data at my workplace (this is due to the need to move data between different hosting solutions). This means that our fallback data formats have become avro/parquet as you can attach metadata directly to the schemas. However, there is currently no direct way to do this using the
s3.to_parquet
function, so I wonder if it is possible to add this capability?Just FYI, I think the
s3.to_parquet
functionality is brilliant and saves so much effort when making glue tables with partitions etc., so I would really like to be able to carry on using it in our workflows rather than write customboto3
/pyarrow
logic.Describe the solution you'd like
Extra metadata can be added into the parquet schema using the
metadata
parameter inpa.schema
(https://arrow.apache.org/docs/python/generated/pyarrow.schema.html).Currently, the pyarrow schema is created in the
write
method in_S3WriteStrategy
via the_data_types.pyarrow_schema_from_pandas
function, which is approximately:What I propose is that we add a new
metadata
parameter in the existingpyarrow_additional_kwargs
dictionary. This avoids any changes in the API so that only a minor version bump would be needed.This would also add the capability to do it for ORC files via the same
pyarrow_additional_kwargs
argument in thes3.to_orc
function.From there the metadata can be extracted and validated in the
_S3WriteStrategy
class (or in the_S3ParquetWriteStrategy
/_S3ORCWriteStrategy
child classes separately if these formats have different metadata constraints, I haven't researched this part yet). We then can pass the metadata to an amended_data_types.pyarrow_schema_from_pandas
function:Describe alternatives you've considered
After digging into the code a bit more, I can see that you can attach your own schema directly via
pyarrow_additional_kwargs
which then overwrites the schema made byawswrangler
here.However, I would still argue that there is a need for the feature described above as I want
awswrangler
to make the schema for me, and there should be a way to simply pass a dictionary of file metadata to the schema generator function.Maybe
pyarrow_additional_kwargs
isn't the best place for it though as I can see it is expanded directly inpyarrow.parquet.ParquetWriter
, so themetadata
key would have to be popped out of the dictionary before this point.Let me know your thoughts.
Additional considerations
I know that there are several other functions in this library for handling parquet/orc metadata (i.e.
read_parquet_metadata
,read_orc_metadata
,store_parquet_metadata
), so we would need to check that these work correctly. I would have thought it would be fine though as they are designed to work with the parquet/orc specifications.I am willing to submit a PR for this feature if approved.
The text was updated successfully, but these errors were encountered: