how to user Table.partitioner to read spark partitioned datasets #242

mahiki · 2021-07-11T21:16:28Z

I apologize in advance for not understanding Table iterators in this context, perhaps this is very readable for a developer but I am a data scientist trying to work around a problem in the Parquet.jl package.

This is a request for examples added to the documentation or help in usage.

Use Case:
I need to read files (CSV or parquet format) into a DataFrame from a partitioned file structure like the following, and add a string-type column containing "XXX" where the file path contains a pattern "column_name=XXX". I think from

tree ./datasets
└── parquet
    └── jobs
        ├── d_monthly_table
        │   ├── dataset_date=2021-05-31
        │   │   ├── device_family=ABC
        │   │   │   └── part-00000-591b098d-a69d-4fd9-b163-6bd1ee22f3bb.c000.snappy.parquet
        │   │   ├── device_family=DEF
        │   │   │   └── part-00000-c8345286-a65a-4b61-a9d4-e6bb13ea3bfd.c000.snappy.parquet
        │   │   └── device_family=GHI
        │   │       └── part-00000-f873f5bc-82dc-4085-839a-745cfc1aa855.c000.snappy.parquet
        │   ├── dataset_date=2021-06-30
        │   │   ├── device_family=AVS
        │   │   │   └── part-00000-6655178e-0bbb-4741-ab6a-8efe6aebbded.c000.snappy.parquet
        │   │   ├── device_family=EFD
        │   │   │   └── part-00000-358d8620-c4e3-44cb-89d5-d00eb23ad52b.c000.snappy.parquet
        │   │   └── device_family=FTV
        │   │       └── part-00000-c10db044-3b38-4032-ad22-aab3edb8e1ba.c000.snappy.parquet
        │   └── dataset_date=2021-06-30_$folder$
        └── d_monthly_table_$folder$

I think if I pass a vector or list of FilePaths to the Tables.partitioner as in the docs example I could append a scalar column column_name with content "XXX" by extracting patterns from the data file paths. This will take many hours for me to figure out, can I get some help added to the documentation with a concrete example? Best way to learn IMO.

# example from Tables.jl docs:
for tbl in Tables.partitions(Tables.partitioner(CSV.File, list_of_csv_files))
    Threads.@spawn begin
        cols = Tables.columns(tbl)
        # do stuff with cols  --- this is where I append a column to each table (?) with content pulled from the file path.
    end
end

The text was updated successfully, but these errors were encountered:

quinnj · 2021-10-23T02:57:54Z

Sorry for the slow reply; CSV.jl now has this functionality supported without the need for Tables.partitioner. See the source keyword argument documentation here. Essentially, you can pass a directory of filenames as strings to CSV.File(files; source=files) and that will automatically generate a column where the rows of each file will have their filename as values in that new column. Hope that helps.

mahiki · 2021-10-25T19:50:02Z

Thank you, this is fantastic. I can't wait to test it out.

quinnj closed this as completed Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to user Table.partitioner to read spark partitioned datasets #242

how to user Table.partitioner to read spark partitioned datasets #242

mahiki commented Jul 11, 2021 •

edited

Loading

quinnj commented Oct 23, 2021

mahiki commented Oct 25, 2021

how to user Table.partitioner to read spark partitioned datasets #242

how to user Table.partitioner to read spark partitioned datasets #242

Comments

mahiki commented Jul 11, 2021 • edited Loading

quinnj commented Oct 23, 2021

mahiki commented Oct 25, 2021

mahiki commented Jul 11, 2021 •

edited

Loading