Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory mapping #87

Merged
merged 5 commits into from
Dec 15, 2018
Merged

Memory mapping #87

merged 5 commits into from
Dec 15, 2018

Conversation

davidanthoff
Copy link
Member

@davidanthoff davidanthoff commented Nov 29, 2018

This is very much WIP. Essentially the strategy is to have a small wrapper around the Vector{UInt8} that is the memory map that implements enough of the String interface so that the parsing from that works.

I also tried to do this with WeakRefString (by pinning the buffer with GC.@preserve), but that consistently segfaulted during tests...

I also think that this is actually not the ideal use-case for WeakRefString: the main reason for that package is that one can stack-allocate that type, but we don't need that at all. In fact, in this case here, nothing unsafe, or weak has to happen. So my sense is that using a different type for the string here, that is much safer to use, is a better option.

There is another reason I think a custom string type makes sense here: I think one strategy for eventually implementing #24 is to just have another thin wrapper like the one here that does the same thing, but for a different encoding. So in my mind, these small string wrappers might be able to solve that issue eventually as well.

This PR does pass all tests right now, and when I run https://github.com/davidanthoff/csv-comparison, I get another small increase in performance (I think, I might have mixed up baselines, but I am certain that it is either neutral or an improvement).

Things still todo:

  • find a better name for the type
  • write some more tests
  • rerun performance comparison

@codecov-io
Copy link

codecov-io commented Nov 29, 2018

Codecov Report

Merging #87 into master will increase coverage by 7.02%.
The diff coverage is 54.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #87      +/-   ##
==========================================
+ Coverage   67.54%   74.56%   +7.02%     
==========================================
  Files           9       10       +1     
  Lines        1103     1152      +49     
==========================================
+ Hits          745      859     +114     
+ Misses        358      293      -65
Impacted Files Coverage Δ
src/utf8optimizations.jl 76.31% <ø> (+21.05%) ⬆️
src/field.jl 78.87% <100%> (+8.25%) ⬆️
src/csv.jl 78.4% <50%> (+6.67%) ⬆️
src/VectorBackedStrings.jl 54.34% <54.34%> (ø)
src/util.jl 65.38% <0%> (+0.76%) ⬆️
src/guesstype.jl 84.48% <0%> (+1.72%) ⬆️
src/record.jl 74.35% <0%> (+6.82%) ⬆️
src/lib/result.jl 42.85% <0%> (+7.14%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a46e4b3...c5ad2e6. Read the comment docs.

@davidanthoff davidanthoff changed the title WIP Memory mapping Memory mapping Dec 14, 2018
@davidanthoff
Copy link
Member Author

davidanthoff commented Dec 14, 2018

@shashi This should be ready for merging.

I reran https://github.com/davidanthoff/csv-comparison on this branch vs master, and most things get a little bit better (but not much). There is a small regression for date time, but it all looks more like noise to me, I think in terms of performance this is mostly a wash.

BUT, I also tried it in combination with #78 and a CSV file that is way larger than main memory. And in particular, I read only one column from a 50GB file on a 16GB main memory machine. It all works beautifully. So just for that I think it is worth merging this.

@andreasnoack just pinging you here as well because @shashi mentioned that you at some point also worked on the memory mapped story of this package.

@shashi
Copy link
Collaborator

shashi commented Dec 15, 2018

Looks good to me. Although I still think we should have tried to use weakrefstring instead of a new type. I'm ok with merging this.

@davidanthoff davidanthoff merged commit 74682a9 into queryverse:master Dec 15, 2018
@davidanthoff davidanthoff deleted the mmap2 branch December 15, 2018 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants