Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cvs n rows issue #337 #51

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Cvs n rows issue #337 #51

wants to merge 9 commits into from

Conversation

EIjo
Copy link

@EIjo EIjo commented Jul 18, 2024

This pull request resolves issue #337 and adds row length to the cvs report.
It does so by using the scanned number of lines as the number of rows. Furthermore I have edited the test csv scanreport to include row length as well.

@janblom
Copy link

janblom commented Aug 6, 2024

I think this problem has actually a bit more to it than the label "good first issue" suggests.

  • in case of a database, rowCount is set to the total number of records in the table (this can a cheap/fast operation depending on the database type database);
  • setting rowCount to the number of lines actually processed in case of a csv file only means the same as above when there are less lines in the csv than the sample size; otherwise it is a numer the user of WR already knows (the sample size);
  • the only way to get the total number of lines right in case of a csv file is by counting them (as the issue suggests by referring to the unix wc command), but this could cause a long runtime in case of very large files - a more clever strategy to approach the actual number might be wiser.

@EIjo
Copy link
Author

EIjo commented Aug 6, 2024

If I understand the workings of processCsvFile correctly is that it iterates over all the lines in the file using the iterator of class ReadTextFile. ReadTextFile uses a BufferedReader and will read over all lines in a file, which if counted should come out to the same value as the unix wc command would give. To account for the header in the .csv file I decrement the counted rows by one. Additionally with the current implementation using an existing for loop (almost) no additional computational complexity is added.

The only ways to break out of the for loop over all lines are if no scanValues are set (in this case nothing would be scanned anyway) or when an upper bound for the sampleSize is set and the lineNr is greater than the sampleSize. If sampleSize has been set the N rows column should always show the upper bound value, whereas the N rows checked could be lower if any rows contain a formatting error.

I also tested the functionality for the cases of formatting errors and trailing newlines.

Formatting error

If one row contains a formatting error the expected behavior should be that the N rows column should show one additional row compared to N rows checked. In this case the value of N rows is equal to that of the unix wc command minus one.
image
image
wc person-header.csv > 31 31 2148 person-header.csv

Trailing newlines

In case of trailing newlines the expected behavior should be that the N rows column should show additional rows equal to the amount of trailing newlines compared to N rows checked. Also in this case the value of N rows is equal to that of the unix wc command minus one.
image
image
wc person-header.csv > 33 31 2195 person-header.csv

I hope this clarifies my way of implementing this feature, and as far as I have tested it its functionality has been equal to that of the unix wc command minus one. However if I have overlooked anything please let me know.

@janblom
Copy link

janblom commented Aug 7, 2024

What happens when sampleSize is set to a positive value, and the csv has more lines than that value (this is a likely scenario)? By default, sampleSize is set to 100.000 in the GUI ("Rows per table" in the Scan tab), and also in example config files.

Unless I miss something, the proposed implementation would lead to inconsistencies for the "N rows" column for csv files where the actual number of lines exceeds the sample size: it would no longer reflect the actual number of lines (rows) in the file.

@EIjo
Copy link
Author

EIjo commented Aug 7, 2024

Ah yes, that would indeed be confusing. I also don't see a way to count the number of lines without incurring a performance cost in the case a sampleSize is set (I assume that if sampleSize is set you do not want this additional performance cost). Would it still be a worthwhile feature to add N rows in the case no sampleSize is added, as you could see if any rows are incorrectly formatted, or would this still be redundant as you would expect N rows to be equal to N rows counted if all rows are formatted correctly.

@janblom
Copy link

janblom commented Aug 7, 2024

What I am thinking of, but we would need to discuss this more extensive for feasibilty:

  • measure the file size (is a fast operation)
  • if file size below a threshold (TBD), count all lines
  • if file size above threshold, estimate average line size (we need to define a strategy for that), and estimate the number of lines by (file size divided by average line size)
  • update documentation to emphasize that in case of large text files, it is an estimate

Part of the reason I am considering this, is that in case of SQL databases, we now also do a count, which could also be slow. However, there are "cheap" ways to obtain an estimate from (some) SQL databases as well. Giving a rough estimate aligns better with the intent of the WhiteRabbit scan: a "quick" survey of what the data looks like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants