Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize Array::m_executeOnGPU #1434

Merged
merged 7 commits into from
Oct 2, 2024

Conversation

BradWhitlock
Copy link
Member

Summary

Lack of Array::m_executeOnGPU initialization played a role in causing host-initialization of device-allocated memory in certain situations. I was not able to create a reproducer on the develop branch but these changes eliminate warnings about uninitialized memory in axom::Array under Valgrind and fix the crashes on my MIR branch when using CUDA-allocated memory.

int m_allocator_id;
bool m_executeOnGPU;
int m_allocator_id = INVALID_ALLOCATOR_ID;
bool m_executeOnGPU {false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use different variable initialization syntax? Do we have a convention that we follow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of a convention. Point taken though. This file is using assignment-style initialization for the other members so I'll change to that for the bool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@publixsubfan -- I think in the past, you had a reason to prefer the curly brace initialization syntax. Do you recall what it was?

Once this is resolved, it might be good to update our coding conventions w/ a preference. At the very least, we should be consistent w/in a single file.

Copy link
Member

@kennyweiss kennyweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BradWhitlock -- if possible, could you please add a regression test that triggers the behavior you observed?

Please also update the RELEASE_NOTES w/ the bugfix (and add a new "Unreleased" section if it's not already there).

@@ -1164,6 +1170,8 @@ Array<T, DIM, SPACE>::Array(Array&& other) noexcept
m_capacity = other.m_capacity;
m_resize_ratio = other.m_resize_ratio;
m_allocator_id = other.m_allocator_id;
m_executeOnGPU = axom::detail::getAllocatorSpace(m_allocator_id) ==
Copy link
Member

@kennyweiss kennyweiss Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This is a one-liner, but might is somewhat tricky to get right, and perhaps subject to change. Should it be a function here or in axom/core/memory_management?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kenny, good idea. I'm doing that now. I was running into trouble on CI anyway with MemorySpace::Device only existing when UMPIRE is enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for a reproducer, I could not get it to happen with the test I wrote (so I junked it). The fixes do clear up my problems on CUDA though on my MIR branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From CI:

2024-10-01T19:41:44.0228174Z /home/axom/axom/src/axom/core/../../axom/core/Array.hpp:293:28: error: no member named 'Device' in 'axom::MemorySpace'
2024-10-01T19:41:44.0244447Z         axom::MemorySpace::Device;
2024-10-01T19:41:44.0245077Z         ~~~~~~~~~~~~~~~~~~~^

https://dev.azure.com/healy22/d4445b8f-6e95-49cd-9e5b-8c8ca1ccad31/_apis/build/builds/10807/logs/58

More reason to move this to a function. Device is only defined when Umpire is available.
See:

enum class MemorySpace
{
Dynamic,
#ifdef AXOM_USE_UMPIRE
Host,
Device,
Unified,
Pinned,
Constant
#endif
};

+1 for the team for having CI configs w/o Umpire!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for a reproducer, I could not get it to happen with the test I wrote (so I junked it). The fixes do clear up my problems on CUDA though on my MIR branch.

Probably still good to have test cases that exercise the true and false branches

int m_allocator_id;
bool m_executeOnGPU;
int m_allocator_id = INVALID_ALLOCATOR_ID;
bool m_executeOnGPU {false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@publixsubfan -- I think in the past, you had a reason to prefer the curly brace initialization syntax. Do you recall what it was?

Once this is resolved, it might be good to update our coding conventions w/ a preference. At the very least, we should be consistent w/in a single file.

@kennyweiss kennyweiss added bug Something isn't working GPU Issues related to GPU development labels Oct 1, 2024
Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BradWhitlock looks reasonable. It would be good to think about adding tests to catch cases that may show breakages -- maybe make an issue that we can discuss?

@publixsubfan
Copy link
Contributor

@rhornung67: I think we'd need to test with asan/ubsan enabled in order to reliably capture a problem like this. Alternatively, -Weffc++ on Clang might be able to statically warn about uninitialized variables, but it might also spit out a bunch of other warnings that we'd have to work through.

@BradWhitlock BradWhitlock merged commit f6b1f59 into develop Oct 2, 2024
13 checks passed
@BradWhitlock
Copy link
Member Author

I added a test that approximates what I was doing in my MIR branch. There was some weirdness with RelWithDebInfo but not a crash. The test calls a new function axom::isDeviceAllocator() to exercise it.

@BradWhitlock BradWhitlock deleted the bugfix/whitlock/array_initialization branch October 2, 2024 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GPU Issues related to GPU development
Projects
None yet
Development

Successfully merging this pull request may close these issues.

axom::Array constructor crash on CUDA.
4 participants