Cvs n rows issue #337 #51

EIjo · 2024-07-18T08:03:24Z

This pull request resolves issue #337 and adds row length to the cvs report.
It does so by using the scanned number of lines as the number of rows. Furthermore I have edited the test csv scanreport to include row length as well.

The following vulnerabilities are fixed with an upgrade: - https://snyk.io/vuln/SNYK-JAVA-COMAMAZONREDSHIFT-6841702 Co-authored-by: snyk-bot <[email protected]>

Add statement about intended use.

… lines read from the csv file

…ious implementation

… instead of -1

janblom · 2024-08-06T09:34:33Z

I think this problem has actually a bit more to it than the label "good first issue" suggests.

in case of a database, rowCount is set to the total number of records in the table (this can a cheap/fast operation depending on the database type database);
setting rowCount to the number of lines actually processed in case of a csv file only means the same as above when there are less lines in the csv than the sample size; otherwise it is a numer the user of WR already knows (the sample size);
the only way to get the total number of lines right in case of a csv file is by counting them (as the issue suggests by referring to the unix wc command), but this could cause a long runtime in case of very large files - a more clever strategy to approach the actual number might be wiser.

EIjo · 2024-08-06T12:27:04Z

If I understand the workings of processCsvFile correctly is that it iterates over all the lines in the file using the iterator of class ReadTextFile. ReadTextFile uses a BufferedReader and will read over all lines in a file, which if counted should come out to the same value as the unix wc command would give. To account for the header in the .csv file I decrement the counted rows by one. Additionally with the current implementation using an existing for loop (almost) no additional computational complexity is added.

The only ways to break out of the for loop over all lines are if no scanValues are set (in this case nothing would be scanned anyway) or when an upper bound for the sampleSize is set and the lineNr is greater than the sampleSize. If sampleSize has been set the N rows column should always show the upper bound value, whereas the N rows checked could be lower if any rows contain a formatting error.

I also tested the functionality for the cases of formatting errors and trailing newlines.

Formatting error

If one row contains a formatting error the expected behavior should be that the N rows column should show one additional row compared to N rows checked. In this case the value of N rows is equal to that of the unix wc command minus one.

wc person-header.csv > 31 31 2148 person-header.csv

Trailing newlines

In case of trailing newlines the expected behavior should be that the N rows column should show additional rows equal to the amount of trailing newlines compared to N rows checked. Also in this case the value of N rows is equal to that of the unix wc command minus one.

wc person-header.csv > 33 31 2195 person-header.csv

I hope this clarifies my way of implementing this feature, and as far as I have tested it its functionality has been equal to that of the unix wc command minus one. However if I have overlooked anything please let me know.

janblom · 2024-08-07T09:25:52Z

What happens when sampleSize is set to a positive value, and the csv has more lines than that value (this is a likely scenario)? By default, sampleSize is set to 100.000 in the GUI ("Rows per table" in the Scan tab), and also in example config files.

Unless I miss something, the proposed implementation would lead to inconsistencies for the "N rows" column for csv files where the actual number of lines exceeds the sample size: it would no longer reflect the actual number of lines (rows) in the file.

EIjo · 2024-08-07T10:10:42Z

Ah yes, that would indeed be confusing. I also don't see a way to count the number of lines without incurring a performance cost in the case a sampleSize is set (I assume that if sampleSize is set you do not want this additional performance cost). Would it still be a worthwhile feature to add N rows in the case no sampleSize is added, as you could see if any rows are incorrectly formatted, or would this still be redundant as you would expect N rows to be equal to N rows counted if all rows are formatted correctly.

janblom · 2024-08-07T10:24:21Z

What I am thinking of, but we would need to discuss this more extensive for feasibilty:

measure the file size (is a fast operation)
if file size below a threshold (TBD), count all lines
if file size above threshold, estimate average line size (we need to define a strategy for that), and estimate the number of lines by (file size divided by average line size)
update documentation to emphasize that in case of large text files, it is an estimate

Part of the reason I am considering this, is that in case of SQL databases, we now also do a count, which could also be slow. However, there are "cheap" ways to obtain an estimate from (some) SQL databases as well. Giving a rough estimate aligns better with the intent of the WhiteRabbit scan: a "quick" survey of what the data looks like.

…cate an estimate (some other manner of comunicating estimates should be used)

janblom and others added 6 commits May 22, 2024 20:46

fix: rabbit-core/pom.xml to reduce vulnerabilities (OHDSI#413)

1dab445

The following vulnerabilities are fixed with an upgrade: - https://snyk.io/vuln/SNYK-JAVA-COMAMAZONREDSHIFT-6841702 Co-authored-by: snyk-bot <[email protected]>

Update README.md

c4da141

Add statement about intended use.

'n rows' for csv files now get correctly populated with the number of…

6e79f1f

… lines read from the csv file

decremented n_row value by one, as header was counted as well in prev…

f195003

…ious implementation

updated ScanReport-reference-v0.10.7-csv.xlsx to include n_row values…

f4bcc4b

… instead of -1

Merge remote-tracking branch 'thehyve-wr/master' into cvs-n-rows

f8bdc38

EIjo added the good first issue label Jul 18, 2024

EIjo added 3 commits August 8, 2024 12:19

Updated rowCount to show an estimate if not all rows are read

1698f6e

Added feature to scan csv files in splits.

1408e13

Reverted scanReport as well as changed rowCount to not show ~ to indi…

51fa0bb

…cate an estimate (some other manner of comunicating estimates should be used)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cvs n rows issue #337 #51

Cvs n rows issue #337 #51

EIjo commented Jul 18, 2024

janblom commented Aug 6, 2024

EIjo commented Aug 6, 2024

janblom commented Aug 7, 2024

EIjo commented Aug 7, 2024

janblom commented Aug 7, 2024

Cvs n rows issue #337 #51

Are you sure you want to change the base?

Cvs n rows issue #337 #51

Conversation

EIjo commented Jul 18, 2024

janblom commented Aug 6, 2024

EIjo commented Aug 6, 2024

Formatting error

Trailing newlines

janblom commented Aug 7, 2024

EIjo commented Aug 7, 2024

janblom commented Aug 7, 2024