Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

temp_ table creation inferred a dtype from a sample that couldn't accomodate values from full data set #108

Open
MattTriano opened this issue Apr 4, 2023 · 0 comments

Comments

@MattTriano
Copy link
Owner

Ran into an issue in when ingesting the latest update to parcel sales data. The task update_socrata_table.load_data_tg.load_csv_data.ingest_csv_data inferred the wrong dtype for the sale_document_num column when creating the temp_ table in data_raw, as it was previously set to infer from a sample of 2M rows. This sample must have only included values where the sale_document_num value were strictly numeric, but in the full data set (which has 2.15M rows), one row had an alphanumeric value (shown below).

...
[2023-04-04, 01:01:47 CDT] {socrata_tasks.py:299} INFO - Failed to ingest flat file to temp table. Error: invalid input syntax for type bigint: "2106116060B"
CONTEXT:  COPY temp_cook_county_parcel_sales, line 2038793, column sale_document_num: "2106116060B"
...

I got the pipeline to run successfully by simply raising the size of the sample, but this is not ideal as A) pandas has to load that sample into memory (which raises the amount of memory a system needs to run the system), and B) a bad value could just appear in row n+1 or above.

Ideation on durable strategies

Use dtypes from the persistent data_raw table

Only possible for existing table. On the first pull of a new data set, it would have to just use the current implementation. This would have short circuited this issue this time, but it wouldn't have stopped this problem if it was the initial pull (although that might not be so bad, as the developer would be engaged at the time the bug would emerge).

Implement an error handler that learns from invalid types

In practice, it would still try to infer from a reasonable sample, but if incompatible values are found when ingesting the full data set into temp_, the error-handling path could factor this information in, alter the relevant column, and try ingesting again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant