Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very simple LRU-based garbage collector (GC) #144

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

nicelhc13
Copy link
Contributor

This PR invokes an LRU-based GC for each task execution. Cholesky on various setting worked correctly. This PR is not optimized well performance yet, but aims to implement a baseline. This PR includes possible bug fixes on the runtime.

@@ -1479,6 +1531,8 @@ def __init__(self, environments: Collection[TaskEnvironment], n_threads: Optiona
self._device_launched_datamove_task_counts = {
dev: 0 for dev in self._available_resources.get_resources()}

self._lrum = LRUManager()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking through this bc we'll be looking at it again soon and been thinking about memory counting.
It should be a separate LRUManager for each device object (memory space), shouldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the design. My original idea was to have a single GC (not necessarily now but I thouhgt a daemon like a scheduler thread) and manage all memory regardless of its location. It might be simpler than a GC per a device? Could you please let me know your idea and its pros?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we would call into the EvictionPolicy is when a specific device is full and we want to bring more data onto it. This would look at only data on that device to start evicting (which means it needs a device specific hash-list structure for LRU). I am not sure how this would be handled by a global one for all devices without evicting data from other devices unnecessarily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (just terminology), specifically I'm using "GC" for cleaning up data that will not be used in the future and "Eviction" for kicking out any data (possibly data that will be used in the future) due to space limitations on a device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on board with you. But I think there is a different way to implement what you said. What I meant was the global memory manager can still manage the whole devices. For example, as you said, we maintain a list of arrays for each device and the global one can traverse and evicts/dellocates subset of them to get more memory. Or we can have a dedicated memory manager thread for each device and let each of them manage the owned device parrays. I think this is implementation detail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants