-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory mapping #87
Memory mapping #87
Conversation
Codecov Report
@@ Coverage Diff @@
## master #87 +/- ##
==========================================
+ Coverage 67.54% 74.56% +7.02%
==========================================
Files 9 10 +1
Lines 1103 1152 +49
==========================================
+ Hits 745 859 +114
+ Misses 358 293 -65
Continue to review full report at Codecov.
|
@shashi This should be ready for merging. I reran https://github.com/davidanthoff/csv-comparison on this branch vs master, and most things get a little bit better (but not much). There is a small regression for date time, but it all looks more like noise to me, I think in terms of performance this is mostly a wash. BUT, I also tried it in combination with #78 and a CSV file that is way larger than main memory. And in particular, I read only one column from a 50GB file on a 16GB main memory machine. It all works beautifully. So just for that I think it is worth merging this. @andreasnoack just pinging you here as well because @shashi mentioned that you at some point also worked on the memory mapped story of this package. |
Looks good to me. Although I still think we should have tried to use weakrefstring instead of a new type. I'm ok with merging this. |
This is very much WIP. Essentially the strategy is to have a small wrapper around the
Vector{UInt8}
that is the memory map that implements enough of theString
interface so that the parsing from that works.I also tried to do this with
WeakRefString
(by pinning the buffer withGC.@preserve
), but that consistently segfaulted during tests...I also think that this is actually not the ideal use-case for
WeakRefString
: the main reason for that package is that one can stack-allocate that type, but we don't need that at all. In fact, in this case here, nothing unsafe, or weak has to happen. So my sense is that using a different type for the string here, that is much safer to use, is a better option.There is another reason I think a custom string type makes sense here: I think one strategy for eventually implementing #24 is to just have another thin wrapper like the one here that does the same thing, but for a different encoding. So in my mind, these small string wrappers might be able to solve that issue eventually as well.
This PR does pass all tests right now, and when I run https://github.com/davidanthoff/csv-comparison, I get another small increase in performance (I think, I might have mixed up baselines, but I am certain that it is either neutral or an improvement).
Things still todo: