temp_ table creation inferred a dtype from a sample that couldn't accomodate values from full data set #108

MattTriano · 2023-04-04T22:35:37Z

Ran into an issue in when ingesting the latest update to parcel sales data. The task update_socrata_table.load_data_tg.load_csv_data.ingest_csv_data inferred the wrong dtype for the sale_document_num column when creating the temp_ table in data_raw, as it was previously set to infer from a sample of 2M rows. This sample must have only included values where the sale_document_num value were strictly numeric, but in the full data set (which has 2.15M rows), one row had an alphanumeric value (shown below).

...
[2023-04-04, 01:01:47 CDT] {socrata_tasks.py:299} INFO - Failed to ingest flat file to temp table. Error: invalid input syntax for type bigint: "2106116060B"
CONTEXT:  COPY temp_cook_county_parcel_sales, line 2038793, column sale_document_num: "2106116060B"
...

I got the pipeline to run successfully by simply raising the size of the sample, but this is not ideal as A) pandas has to load that sample into memory (which raises the amount of memory a system needs to run the system), and B) a bad value could just appear in row n+1 or above.

Ideation on durable strategies

Use dtypes from the persistent data_raw table

Only possible for existing table. On the first pull of a new data set, it would have to just use the current implementation. This would have short circuited this issue this time, but it wouldn't have stopped this problem if it was the initial pull (although that might not be so bad, as the developer would be engaged at the time the bug would emerge).

Implement an error handler that learns from invalid types

In practice, it would still try to infer from a reasonable sample, but if incompatible values are found when ingesting the full data set into temp_, the error-handling path could factor this information in, alter the relevant column, and try ingesting again.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

temp_ table creation inferred a dtype from a sample that couldn't accomodate values from full data set #108

temp_ table creation inferred a dtype from a sample that couldn't accomodate values from full data set #108

MattTriano commented Apr 4, 2023

temp_ table creation inferred a dtype from a sample that couldn't accomodate values from full data set #108

temp_ table creation inferred a dtype from a sample that couldn't accomodate values from full data set #108

Comments

MattTriano commented Apr 4, 2023

Ideation on durable strategies

Use dtypes from the persistent data_raw table

Implement an error handler that learns from invalid types