Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bounded memory sorting #32

Open
facundominguez opened this issue Aug 17, 2017 · 7 comments
Open

Bounded memory sorting #32

facundominguez opened this issue Aug 17, 2017 · 7 comments

Comments

@facundominguez
Copy link

facundominguez commented Aug 17, 2017

While writing an analysis tool, the need for sorting the events in an eventlog file arose. The problem with the existing GHC.RTS.Events.sortEvents is that it is not memory bounded, leading to heap exhaustion on large inputs.

How about we provide a memory bounded version?

I ended up writing something similar to this. It splits events per capability to different temporary files, and then merges the events back into a single list.

I could submit a PR if it looks useful to others.

This version is a bit wasteful in that it has to deserialize and serialize all the events, where for the sake of splitting the events, only the capability number and the encoded form of the event would be needed.

Another odd aspect is that I couldn't make it stream when using Data.ByteString.Builder to write the events to the temporary files. Heap would be exhausted the same.

@maoe
Copy link
Member

maoe commented Sep 4, 2017

Sorry for the delayed response.

I haven't looked into the patch yet but in general I think external sorting is a useful addition to the API.

@osa1
Copy link

osa1 commented Aug 17, 2018

I recently had a similar problem. Loading a 1.3G eventlog uses more than 32G memory. Profiling output shows that most allocations are done by sortEvents and EventParserUtils.getString.

@bgamari
Copy link
Contributor

bgamari commented May 16, 2020

Indeed, I came here to suggest precisely @facundominguez's patch. Sorting via secondary storage would be great; I routinely run into issues like this when processing eventlogs.

@bgamari
Copy link
Contributor

bgamari commented May 17, 2020

This is addressed by #61.

@TeofilC
Copy link
Contributor

TeofilC commented Apr 27, 2022

I feel like it should be possible to read an eventlog in sorted order with constant memory overhead and without an intermediate file.
You'd have a cursor into the file for each capability. Each of those cursors would ignore blocks that belong to another capability (by looking at the block marker, which tells us the size of the block). Then you can get a sorted eventlog by just keeping the cursors sorted by timestamps and always taking the minimum, as events are sorted within a block. This would avoid the need for an intermediate file.
In essence, this is the same algorithm as merging N sorted linked-lists.

Does that sound reasonable? If so, I can try to implement it at some point when I have time

@Mikolaj
Copy link
Member

Mikolaj commented Apr 27, 2022

Hi @TeofilC. That was a long time ago, so perhaps it's changed, but I remember fixing bugs in Threadscope that came from assuming evens for a single capability come sorted. If they still don't, I think the perturbation is probably minimal, one or two positions, but it's worth checking on a few different computers (slow, fast, under load, etc.) and, if needed. be prepared to react. OTOH, perhaps memory deceives me and it was events for threads, not HECs, coming out of order or events that are expected not arriving at all.

@TeofilC
Copy link
Contributor

TeofilC commented Aug 15, 2023

I never got around to working on this. Since then my thinking has shifted. I'm not really sure if eventlogs need to be sorted at all.

I think most use cases are better solved by having some sort of index of how the different Capabilities' streams are interleaved. This can be produced without having to modify the eventlog at all and allow us to efficiently read the events in sorted order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants