You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ran into an issue in when ingesting the latest update to parcel sales data. The task update_socrata_table.load_data_tg.load_csv_data.ingest_csv_data inferred the wrong dtype for the sale_document_num column when creating the temp_ table in data_raw, as it was previously set to infer from a sample of 2M rows. This sample must have only included values where the sale_document_num value were strictly numeric, but in the full data set (which has 2.15M rows), one row had an alphanumeric value (shown below).
...
[2023-04-04, 01:01:47 CDT] {socrata_tasks.py:299} INFO - Failed to ingest flat file to temp table. Error: invalid input syntax fortype bigint: "2106116060B"
CONTEXT: COPY temp_cook_county_parcel_sales, line 2038793, column sale_document_num: "2106116060B"
...
I got the pipeline to run successfully by simply raising the size of the sample, but this is not ideal as A) pandas has to load that sample into memory (which raises the amount of memory a system needs to run the system), and B) a bad value could just appear in row n+1 or above.
Ideation on durable strategies
Use dtypes from the persistent data_raw table
Only possible for existing table. On the first pull of a new data set, it would have to just use the current implementation. This would have short circuited this issue this time, but it wouldn't have stopped this problem if it was the initial pull (although that might not be so bad, as the developer would be engaged at the time the bug would emerge).
Implement an error handler that learns from invalid types
In practice, it would still try to infer from a reasonable sample, but if incompatible values are found when ingesting the full data set into temp_, the error-handling path could factor this information in, alter the relevant column, and try ingesting again.
The text was updated successfully, but these errors were encountered:
Ran into an issue in when ingesting the latest update to parcel sales data. The task
update_socrata_table.load_data_tg.load_csv_data.ingest_csv_data
inferred the wrong dtype for thesale_document_num
column when creating the temp_ table in data_raw, as it was previously set to infer from a sample of 2M rows. This sample must have only included values where thesale_document_num
value were strictly numeric, but in the full data set (which has 2.15M rows), one row had an alphanumeric value (shown below).I got the pipeline to run successfully by simply raising the size of the sample, but this is not ideal as A) pandas has to load that sample into memory (which raises the amount of memory a system needs to run the system), and B) a bad value could just appear in row n+1 or above.
Ideation on durable strategies
Use dtypes from the persistent data_raw table
Only possible for existing table. On the first pull of a new data set, it would have to just use the current implementation. This would have short circuited this issue this time, but it wouldn't have stopped this problem if it was the initial pull (although that might not be so bad, as the developer would be engaged at the time the bug would emerge).
Implement an error handler that learns from invalid types
In practice, it would still try to infer from a reasonable sample, but if incompatible values are found when ingesting the full data set into temp_, the error-handling path could factor this information in, alter the relevant column, and try ingesting again.
The text was updated successfully, but these errors were encountered: