Skip to content

Commit 5261637

Browse files
Fix l0 teardown and sycl windows teardown (#17869)
Signed-off-by: Neil R. Spruit <[email protected]> Co-authored-by: Chris Perkins <[email protected]>
1 parent 8ee2aaa commit 5261637

File tree

18 files changed

+208
-128
lines changed

18 files changed

+208
-128
lines changed

sycl/doc/design/GlobalObjectsInRuntime.md

+66-7
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,65 @@ Deinitialization is platform-specific. Upon application shutdown, the DPC++
5555
runtime frees memory pointed by `GlobalHandler` global pointer, which triggers
5656
destruction of nested `std::unique_ptr`s.
5757

58+
### Shutdown Tasks and Challenges
59+
60+
As the user's app ends, the SYCL runtime is responsible for releasing any UR
61+
adapters that have been gotten, and teardown the plugins/adapters themselves.
62+
Additionally, we need to stop deferring any new buffer releases and clean up
63+
any memory whose release was deferred.
64+
65+
To this end, the shutdown occurs in two phases: early and late. The purpose
66+
for early shutdown is primarily to stop any further deferring of memory release.
67+
This is because the deferred memory release is based on threads and on Windows
68+
the threads will be abandoned. So as soon as possible we want to stop deferring
69+
memory and try to let go any that has been deferred. The purpose for late
70+
shutdown is to hold onto the handles and adapters longer than the user's
71+
application. We don't want to initiate late shutdown until after all the users
72+
static and thread local vars have been destroyed, in case those destructors are
73+
calling SYCL.
74+
75+
In the early shutdown we stop deferring, tell the scheduler to prepare for release, and
76+
try releasing the memory that has been deferred so far. Following this, if
77+
the user has any global or static handles to sycl objects, they'll be destroyed.
78+
Finally, the late shutdown routine is called the last of the UR handles and
79+
adapters are let go, as is the GlobalHandler itself.
80+
81+
82+
#### Threads
83+
The deferred memory marshalling is built on a thread pool, but there is a
84+
challenge here in that on Windows, once the end of the users `main()` is reached
85+
and their app is shutting down, the Windows OS will abandon all remaining
86+
in-flight threads. These threads can be `.join()` but they simply return instantly,
87+
the threads are not completed. Further any thread specific variables
88+
(or `thread_local static` vars) will NOT have their destructors called. Note
89+
that the standard while-loop-over-condition-var pattern will cause a hang -
90+
we cannot "wait" on abandoned threads.
91+
On Windows, short of adding some user called API to signal this, there is
92+
no way to detect or avoid this. None of the "end-of-library" lifecycle events
93+
occurs before the threads are abandoned. ( not `std::atexit()`, not globals or
94+
`static`, or `static thread_local` var destruction, not `DllMain(DLL_PROCESS_DETACH)` )
95+
This means that on Windows, once we arrive at `shutdown_early()` we cannot wait on
96+
host events or the thread pool.
97+
98+
For the deferred memory itself, there is no issue here. The Windows OS will
99+
reclaim the memory for us. The issue of which we must be wary is placing UR
100+
handles (and similar) in host threads. The RAII mechanism of unique and
101+
shared pointers will not work in any thread that is abandoned on Windows.
102+
103+
One last note about threads. It is entirely the OS's discretion when to
104+
start or schedule a thread. If the main process is very busy then it is
105+
possible that threads the SYCL library creates (`host_tasks`/`thread_pool`)
106+
won't even be started until AFTER the host application `main()` function is done.
107+
This is not a normal occurrence, but it can happen if there is no call to `queue.wait()`
108+
109+
58110
### Linux
59111

60-
On Linux DPC++ runtime uses `__attribute__((destructor))` property with low
112+
On Linux, the `early_shutdown()` is begun by the destruction of a static
113+
`StaticVarShutdownHandler` object, which is initialized by
114+
`platform::get_platforms()`.
115+
116+
`late_shutdown()` timing uses `__attribute__((destructor))` property with low
61117
priority value 110. This approach does not guarantee, that `GlobalHandler`
62118
destructor is the last thing to run, as user code may contain a similar function
63119
with the same priority value. At the same time, users may specify priorities
@@ -72,10 +128,14 @@ times, the memory leak may impact code performance.
72128

73129
### Windows
74130

75-
To identify shutdown moment on Windows, DPC++ runtime uses default `DllMain`
76-
function with `DLL_PROCESS_DETACH` reason. This guarantees, that global objects
77-
deinitialization happens right before `sycl.dll` is unloaded from process
78-
address space.
131+
Differing from Linux, on Windows the `early_shutdown()` is begun by
132+
`DllMain(PROCESS_DETACH)`, unless statically linked.
133+
134+
The `late_shutdown()` is begun by the destruction of a
135+
static `StaticVarShutdownHandler` object, which is initialized by
136+
`platform::get_platforms()`. ( On linux, this is when we do `early_shutdown()`.
137+
Go figure.) This is as late as we can manage, but it is later than any user
138+
application global, `static`, or `thread_local` variable destruction.
79139

80140
### Recommendations for DPC++ runtime developers
81141

@@ -109,8 +169,7 @@ for (adapter in initializedAdapters) {
109169
urLoaderTearDown();
110170
```
111171

112-
Which in turn is called by either `shutdown_late()` or `shutdown_win()`
113-
depending on platform.
172+
Which in turn is called by `shutdown_late()`.
114173

115174
![](images/adapter-lifetime.jpg)
116175

sycl/source/detail/global_handler.cpp

+63-36
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,11 @@ using LockGuard = std::lock_guard<SpinLock>;
3737
SpinLock GlobalHandler::MSyclGlobalHandlerProtector{};
3838

3939
// forward decl
40-
void shutdown_win(); // TODO: win variant will go away soon
4140
void shutdown_early();
4241
void shutdown_late();
42+
#ifdef _WIN32
43+
BOOL isLinkedStatically();
44+
#endif
4345

4446
// Utility class to track references on object.
4547
// Used for GlobalHandler now and created as thread_local object on the first
@@ -60,10 +62,6 @@ class ObjectUsageCounter {
6062

6163
LockGuard Guard(GlobalHandler::MSyclGlobalHandlerProtector);
6264
MCounter--;
63-
GlobalHandler *RTGlobalObjHandler = GlobalHandler::getInstancePtr();
64-
if (RTGlobalObjHandler) {
65-
RTGlobalObjHandler->prepareSchedulerToRelease(!MCounter);
66-
}
6765
} catch (std::exception &e) {
6866
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~ObjectUsageCounter", e);
6967
}
@@ -254,24 +252,37 @@ void GlobalHandler::releaseDefaultContexts() {
254252
MPlatformToDefaultContextCache.Inst.reset(nullptr);
255253
}
256254

257-
struct EarlyShutdownHandler {
258-
~EarlyShutdownHandler() {
255+
// Shutdown is split into two parts. shutdown_early() stops any more
256+
// objects from being deferred and takes an initial pass at freeing them.
257+
// shutdown_late() finishes and releases the adapters and the GlobalHandler.
258+
// For Windows, early shutdown is typically called from DllMain,
259+
// and late shutdown is here.
260+
// For Linux, early shutdown is here, and late shutdown is called from
261+
// a low priority destructor.
262+
struct StaticVarShutdownHandler {
263+
264+
~StaticVarShutdownHandler() {
259265
try {
260266
#ifdef _WIN32
261-
// on Windows we keep to the existing shutdown procedure
262-
GlobalHandler::instance().releaseDefaultContexts();
267+
// If statically linked, DllMain will not be called. So we do its work
268+
// here.
269+
if (isLinkedStatically()) {
270+
shutdown_early();
271+
}
272+
273+
shutdown_late();
263274
#else
264275
shutdown_early();
265276
#endif
266277
} catch (std::exception &e) {
267-
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~EarlyShutdownHandler",
268-
e);
278+
__SYCL_REPORT_EXCEPTION_TO_STREAM(
279+
"exception in ~StaticVarShutdownHandler", e);
269280
}
270281
}
271282
};
272283

273-
void GlobalHandler::registerEarlyShutdownHandler() {
274-
static EarlyShutdownHandler handler{};
284+
void GlobalHandler::registerStaticVarShutdownHandler() {
285+
static StaticVarShutdownHandler handler{};
275286
}
276287

277288
bool GlobalHandler::isOkToDefer() const { return OkToDefer; }
@@ -304,35 +315,29 @@ void GlobalHandler::prepareSchedulerToRelease(bool Blocking) {
304315
#ifndef _WIN32
305316
if (Blocking)
306317
drainThreadPool();
318+
#endif
307319
if (MScheduler.Inst)
308320
MScheduler.Inst->releaseResources(Blocking ? BlockingT::BLOCKING
309321
: BlockingT::NON_BLOCKING);
310-
#endif
311322
}
312323

313324
void GlobalHandler::drainThreadPool() {
314325
if (MHostTaskThreadPool.Inst)
315326
MHostTaskThreadPool.Inst->drain();
316327
}
317328

318-
#ifdef _WIN32
319-
// because of something not-yet-understood on Windows
320-
// threads may be shutdown once the end of main() is reached
321-
// making an orderly shutdown difficult. Fortunately, Windows
322-
// itself is very aggressive about reclaiming memory. Thus,
323-
// we focus solely on unloading the adapters, so as to not
324-
// accidentally retain device handles. etc
325-
void shutdown_win() {
326-
GlobalHandler *&Handler = GlobalHandler::getInstancePtr();
327-
Handler->unloadAdapters();
328-
}
329-
#else
330329
void shutdown_early() {
331330
const LockGuard Lock{GlobalHandler::MSyclGlobalHandlerProtector};
332331
GlobalHandler *&Handler = GlobalHandler::getInstancePtr();
333332
if (!Handler)
334333
return;
335334

335+
#if defined(XPTI_ENABLE_INSTRUMENTATION) && defined(_WIN32)
336+
if (xptiTraceEnabled())
337+
return; // When doing xpti tracing, we can't safely shutdown on Win.
338+
// TODO: figure out why XPTI prevents release.
339+
#endif
340+
336341
// Now that we are shutting down, we will no longer defer MemObj releases.
337342
Handler->endDeferredRelease();
338343

@@ -354,6 +359,12 @@ void shutdown_late() {
354359
if (!Handler)
355360
return;
356361

362+
#if defined(XPTI_ENABLE_INSTRUMENTATION) && defined(_WIN32)
363+
if (xptiTraceEnabled())
364+
return; // When doing xpti tracing, we can't safely shutdown on Win.
365+
// TODO: figure out why XPTI prevents release.
366+
#endif
367+
357368
// First, release resources, that may access adapters.
358369
Handler->MPlatformCache.Inst.reset(nullptr);
359370
Handler->MScheduler.Inst.reset(nullptr);
@@ -370,7 +381,6 @@ void shutdown_late() {
370381
delete Handler;
371382
Handler = nullptr;
372383
}
373-
#endif
374384

375385
#ifdef _WIN32
376386
extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
@@ -391,23 +401,18 @@ extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
391401
if (PrintUrTrace)
392402
std::cout << "---> DLL_PROCESS_DETACH syclx.dll\n" << std::endl;
393403

394-
#ifdef XPTI_ENABLE_INSTRUMENTATION
395-
if (xptiTraceEnabled())
396-
return TRUE; // When doing xpti tracing, we can't safely call shutdown.
397-
// TODO: figure out what XPTI is doing that prevents
398-
// release.
399-
#endif
400-
401404
try {
402-
shutdown_win();
405+
shutdown_early();
403406
} catch (std::exception &e) {
404-
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in shutdown_win", e);
407+
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in DLL_PROCESS_DETACH", e);
405408
return FALSE;
406409
}
410+
407411
break;
408412
case DLL_PROCESS_ATTACH:
409413
if (PrintUrTrace)
410414
std::cout << "---> DLL_PROCESS_ATTACH syclx.dll\n" << std::endl;
415+
411416
break;
412417
case DLL_THREAD_ATTACH:
413418
break;
@@ -416,6 +421,28 @@ extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
416421
}
417422
return TRUE; // Successful DLL_PROCESS_ATTACH.
418423
}
424+
BOOL isLinkedStatically() {
425+
// If the exePath is the same as the dllPath,
426+
// or if the module handle for DllMain is not retrievable,
427+
// then we are linked statically
428+
// Otherwise we are dynamically linked or loaded.
429+
HMODULE hModule = nullptr;
430+
auto LpModuleAddr = reinterpret_cast<LPCSTR>(&DllMain);
431+
if (!GetModuleHandleExA(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS, LpModuleAddr,
432+
&hModule)) {
433+
return true; // not retrievable, therefore statically linked
434+
}
435+
char dllPath[MAX_PATH];
436+
if (GetModuleFileNameA(hModule, dllPath, MAX_PATH)) {
437+
char exePath[MAX_PATH];
438+
if (GetModuleFileNameA(NULL, exePath, MAX_PATH)) {
439+
if (std::string(dllPath) == std::string(exePath)) {
440+
return true; // paths identical, therefore statically linked
441+
}
442+
}
443+
}
444+
return false; // Otherwise dynamically linked or loaded
445+
}
419446
#else
420447
// Setting low priority on destructor ensures it runs after all other global
421448
// destructors. Priorities 0-100 are reserved by the compiler. The priority

sycl/source/detail/global_handler.hpp

+1-2
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ class GlobalHandler {
7373
XPTIRegistry &getXPTIRegistry();
7474
ThreadPool &getHostTaskThreadPool();
7575

76-
static void registerEarlyShutdownHandler();
76+
static void registerStaticVarShutdownHandler();
7777

7878
bool isOkToDefer() const;
7979
void endDeferredRelease();
@@ -95,7 +95,6 @@ class GlobalHandler {
9595

9696
bool OkToDefer = true;
9797

98-
friend void shutdown_win();
9998
friend void shutdown_early();
10099
friend void shutdown_late();
101100
friend class ObjectUsageCounter;

sycl/source/detail/host_task.hpp

+9
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
#pragma once
1414

1515
#include <detail/cg.hpp>
16+
#include <detail/global_handler.hpp>
1617
#include <sycl/detail/cg_types.hpp>
1718
#include <sycl/handler.hpp>
1819
#include <sycl/interop_handle.hpp>
@@ -33,6 +34,10 @@ class HostTask {
3334
bool isInteropTask() const { return !!MInteropTask; }
3435

3536
void call(HostProfilingInfo *HPI) {
37+
if (!GlobalHandler::instance().isOkToDefer()) {
38+
return;
39+
}
40+
3641
if (HPI)
3742
HPI->start();
3843
MHostTask();
@@ -41,6 +46,10 @@ class HostTask {
4146
}
4247

4348
void call(HostProfilingInfo *HPI, interop_handle handle) {
49+
if (!GlobalHandler::instance().isOkToDefer()) {
50+
return;
51+
}
52+
4453
if (HPI)
4554
HPI->start();
4655
MInteropTask(handle);

sycl/source/detail/platform_impl.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ std::vector<platform> platform_impl::get_platforms() {
171171

172172
// This initializes a function-local variable whose destructor is invoked as
173173
// the SYCL shared library is first being unloaded.
174-
GlobalHandler::registerEarlyShutdownHandler();
174+
GlobalHandler::registerStaticVarShutdownHandler();
175175

176176
return Platforms;
177177
}

sycl/source/detail/program_manager/program_manager.cpp

+2
Original file line numberDiff line numberDiff line change
@@ -3873,7 +3873,9 @@ extern "C" void __sycl_register_lib(sycl_device_binaries desc) {
38733873
// Executed as a part of current module's (.exe, .dll) static initialization
38743874
extern "C" void __sycl_unregister_lib(sycl_device_binaries desc) {
38753875
// Partial cleanup is not necessary at shutdown
3876+
#ifndef _WIN32
38763877
if (!sycl::detail::GlobalHandler::instance().isOkToDefer())
38773878
return;
38783879
sycl::detail::ProgramManager::getInstance().removeImages(desc);
3880+
#endif
38793881
}

sycl/source/detail/queue_impl.hpp

+11-1
Original file line numberDiff line numberDiff line change
@@ -264,7 +264,17 @@ class queue_impl {
264264
destructorNotification();
265265
#endif
266266
throw_asynchronous();
267-
getAdapter()->call<UrApiKind::urQueueRelease>(MQueue);
267+
auto status =
268+
getAdapter()->call_nocheck<UrApiKind::urQueueRelease>(MQueue);
269+
// If loader is already closed, it'll return a not-initialized status
270+
// which the UR should convert to SUCCESS code. But that isn't always
271+
// working on Windows. This is a temporary workaround until that is fixed.
272+
// TODO: Remove this workaround when UR is fixed, and restore
273+
// ->call<>() instead of ->call_nocheck<>() above.
274+
if (status != UR_RESULT_SUCCESS &&
275+
status != UR_RESULT_ERROR_UNINITIALIZED) {
276+
__SYCL_CHECK_UR_CODE_NO_EXC(status);
277+
}
268278
} catch (std::exception &e) {
269279
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~queue_impl", e);
270280
}

0 commit comments

Comments
 (0)