Possible encoding issue when processing CSV files #235

realrolfje · 2024-09-03T07:12:51Z

Describe the bug
In a production situation processing large amounts of CSV records, we sometimes see an IO exception with the text: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.

When this happens, Anonimatron stops writing to the output file, leaving it incomplete (input file is approx. 50MB, outputfile is left at 2.2MB).

To Reproduce
Read CSV files from customers. Not sure what exactly causes this yet, it is happening irregularly.

Expected behavior
There are a few things we expect:

Anonimatron should correctly handle file encodings.
If the file encoding is correct but the contents is broken, the error should be a bit more informative about the problem
When writing to the output file stops because of an error, the output file should be deleted and an exit code should indicate that anonimization failed.

Logs, screenshots

Error in log tijdens file 5xxxxxxxxxx0000105353
Anonymizing from /efs/ccv/backup/5/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
              to /efs/ccv/anonymized/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
Exception in thread "main" java.io.UncheckedIOException: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:150)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160)
        at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132)
        at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
        at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:650)
        at xxx.anonymize.csv.CSVReader.parseLine(CSVReader.java:85)
        at xxx.anonymize.csv.CSVReader.read(CSVReader.java:46)
        at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:183)
        at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:87)
        at com.rolfje.anonimatron.Anonimatron.anonymize(Anonimatron.java:103)
        at com.rolfje.anonimatron.Anonimatron.main(Anonimatron.java:67)
Caused by: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:369)
        at org.apache.commons.csv.Lexer.nextToken(Lexer.java:290)
        at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:770)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:148)
        ... 15 more

Further details are in personal mail because of possibly sensitive data or customer information.

Desktop (please complete the following information):

OS: CentOS
Java version OpenJDK 1.17
Anonimatron v1.15

Additional context
Please inform when this problem is fixed, so that we can (fix and) re-process the incomplete files.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible encoding issue when processing CSV files #235

Possible encoding issue when processing CSV files #235

realrolfje commented Sep 3, 2024 •

edited

Loading

Possible encoding issue when processing CSV files #235

Possible encoding issue when processing CSV files #235

Comments

realrolfje commented Sep 3, 2024 • edited Loading

realrolfje commented Sep 3, 2024 •

edited

Loading