Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 connect - how to handle bad/incomplete/incorrect data records #8

Open
chipmaurer opened this issue Apr 28, 2021 · 1 comment
Open

Comments

@chipmaurer
Copy link
Collaborator

First, check hive connector to see what it does with bogus data, and do similar. Do things like have a CSV with blank rows, missing fields, incorrect types for etc.

@chipmaurer
Copy link
Collaborator Author

Here is a row problem that needs to be addressed.

s94,Movie,27: Gone Too Soon,Simon Napier-Bell,"Janis Joplin, Jimi Hendrix, Amy Winehouse, Jim Morrison, Kurt Cobain",United Kingdom,1-May-18,2017,TV-MA,70 min,Documentaries,"Explore the circumstances surrounding the tragic deaths at 27 of Jimi Hendrix, Jim Morrison, Brian Jones, Janis Joplin, Kurt Cobain and Amy Winehouse."

In a CSV which has a cell with a quoted comma list, the S3 column decoder gets confused, and you could end up with this error:

Query 20210921_190308_00129_7i234 failed: For input string: " Jim Morrison"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant