Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible encoding issue when processing CSV files #235

Open
realrolfje opened this issue Sep 3, 2024 · 0 comments
Open

Possible encoding issue when processing CSV files #235

realrolfje opened this issue Sep 3, 2024 · 0 comments

Comments

@realrolfje
Copy link
Owner

realrolfje commented Sep 3, 2024

Describe the bug
In a production situation processing large amounts of CSV records, we sometimes see an IO exception with the text: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.

When this happens, Anonimatron stops writing to the output file, leaving it incomplete (input file is approx. 50MB, outputfile is left at 2.2MB).

To Reproduce
Read CSV files from customers. Not sure what exactly causes this yet, it is happening irregularly.

Expected behavior
There are a few things we expect:

  • Anonimatron should correctly handle file encodings.
  • If the file encoding is correct but the contents is broken, the error should be a bit more informative about the problem
  • When writing to the output file stops because of an error, the output file should be deleted and an exit code should indicate that anonimization failed.

Logs, screenshots

Error in log tijdens file 5xxxxxxxxxx0000105353
Anonymizing from /efs/ccv/backup/5/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
              to /efs/ccv/anonymized/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
Exception in thread "main" java.io.UncheckedIOException: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:150)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160)
        at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132)
        at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
        at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:650)
        at xxx.anonymize.csv.CSVReader.parseLine(CSVReader.java:85)
        at xxx.anonymize.csv.CSVReader.read(CSVReader.java:46)
        at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:183)
        at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:87)
        at com.rolfje.anonimatron.Anonimatron.anonymize(Anonimatron.java:103)
        at com.rolfje.anonimatron.Anonimatron.main(Anonimatron.java:67)
Caused by: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:369)
        at org.apache.commons.csv.Lexer.nextToken(Lexer.java:290)
        at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:770)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:148)
        ... 15 more

Further details are in personal mail because of possibly sensitive data or customer information.

Desktop (please complete the following information):

  • OS: CentOS
  • Java version OpenJDK 1.17
  • Anonimatron v1.15

Additional context
Please inform when this problem is fixed, so that we can (fix and) re-process the incomplete files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant