Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Constant space sort #61

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

bgamari
Copy link
Contributor

@bgamari bgamari commented May 17, 2020

This introduces an interface providing constant-space sorting of eventlogs via on-disk merge-sort.
I have confirmed that the implementation indeed maintains constant-space behavior, requiring about 79s seconds to sort a 200MB eventlog while not exceeding 40 MBytes residency (using a chunk size of 1e5 events).

This is in contrast to the old in-memory codepath which requires 35 seconds and nearly 5GBytes of residency to sort this same eventlog.

If I increase the chunk size by an order of magnitude (to 1e6 events) then the constant-space sort has essentially the same runtime as the in-memory codepath but runs in merely 300MBytes.

Note: In testing this I realized that several of the encoding paths treated strings incorrectly. Consequently, this depends upon (and contains commits from) #62.

To-do

  • add a test
  • ensure that temporary files are cleaned up on failure
  • documentation
  • optimise throughput
  • sort out how this should be exposed in the interface; should it supercede the existing sortEvents?
  • split out string encoding fixes into new MR
  • decide on a sensible chunk size (decided on 5e5 as a reasonable compromise between memory and throughput)

Previously we used the Binary instance for Text to serialise the event
name. This is wrong.

We now first encode to UTF-8 and use this in the eventlog encoding.
@bgamari bgamari force-pushed the constant-space-sort branch 3 times, most recently from 73ef6f3 to 3cc5c14 Compare May 17, 2020 01:24
@bgamari bgamari mentioned this pull request May 17, 2020
@facundominguez
Copy link

facundominguez commented May 18, 2020

Is it necessary to sort all the events as if they were completely unordered? I was assuming, at the time I wrote my patch, that the events of each capability would come in order within each capability, so all that was necessary was merging the capability streams back again.

Perhaps there is some inconvenience in splitting the capability streams that I'm missing?

Copy link
Member

@maoe maoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much time for today so I'll take a look again tomorrow.

src/GHC/RTS/Events/Sort.hs Show resolved Hide resolved
@maoe
Copy link
Member

maoe commented May 26, 2020

I think facundominguez's point is valid. Also it seems that #14 gets in the way if we're going to replace the existing sortEvents with this constant space sorting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants