Safe darktable shutdown #18061

jenshannoschwalm · 2024-12-24T18:47:46Z

EDIT again 01/04

This is a comprehensive collection of everything necessary to avoid darktable crashing after closing the app via ctrl-q or click on close-button. There have been lengthy discussions in previous pr's #18023 and #17984 for those of you interested in the progress we made. It has been a very hard and bumpy ride to get this all sorted out, with this pr it is very hard to get dt crashing after closing the ui. Thanks to all of you having tested or reviewed commented on those pr's.

I had still observed rare and spurious problems with the late PRs if there are many backjobs running exporting and flipping the same images for example in rawspeed loader, exif blob reading and mem access in nirvana.

Also there might be issues in dbus as reported by @zisoft but i couldn't trigger those.
The lua terminations problems have been finally understood and fixed.

The good one: The bunch is intended for master and 5.0 branch if reviewed for this.
The bad one for 5.0, to work safely the first 3 commits! are required

Commit 1 Fixes the order in control shutdown
Commit 2+3 fix crashes related to wrong cache usage. All 3 big bugs were not detected and were likely not happening if backthread processing on images had much lower cache pressure.

Proposal 4 Some additional easy fixes related to mipmap caching and bad dt_image_t returned and not fully checking for safe results. Not as important ...

Let's go into details from the commit messages for easier understanding why this all necessary.

Control quit and shutdown order issue (shutdown part 1)

How is darktable shutdown working safely?

dt_control_quit() is called via user request, it sets control->running anatomically to DT_CONTROL_STATE_CLEANUP implying that dt_control_running() will not be TRUE any more.
It finally calls gtk_main_quit() so with current code we don't have gtk events after that. Avoid race conditions by using an anatomic flag ensuring the valid code part is executed only once.
Quitting gtk also means we are exactly here to continue the shutdown. Anything requiring a still active UI must be done before ...
dt_control_shutdown() first waits for all threads to be joined, that means waiting for all pending jobs to be done completely.
So we have to ensure:
1. a full working software stack including image_cache, mipmap_cache and darktable.imageio to allow processing the pixelpipe and full access to all images.
2. we don't access darktable.lib related stuff with dereferencing.
3. The pipeline processing uses access to gui related data - focus, active module ... so make sure to avoid those too.
4. As lua events might be fired by the backthreads it's state mutex must still be unlocked.
After dt_control_shutdown() has finished we are sure there are no background threads running any more so we can safely close all mentioned subsystems and continue.
In shutdown mode - tested via dt_control_running() is FALSE - we will not add new jobs. Avoids unlikely but possible race conditions for now.

Progress destroy and safety (shutdown part 2)

A very annoying bug in dt_control_progress_cancel() has been fixed.

Calling the progress->cancel callback is somewhat tricky as that might include destroying the progress structure which also includes destroying the process->mutex.

dt_control_progress_cancel_callback() is doing that as it deep-down calls dt_control_job_cancel() which finally wants to destroy the mutex but as it's locked it can't do so.

So, if progress->cancel == &dt_control_progress_cancel_callback we must not call progress->cancel with the mutex locked as that would make the mutex destroy failing leading to assert() failures in debug builds and we can't use that mutex anyway leading to a crash.

Otherwise do the cancel callback in mutex locked state.
We strictly avoid progress de-referencing if being NULL for safety, in case of getters we return safe dummies,
in case of putters we avoid the function.

Fix dt_image_update_final_size() and dt_image_get_final_size() (shutdown part 3)

In dt_image_update_final_size() we must not use dt_cache_release() if the entry is NULL because of dereferencing.

In dt_image_get_final_size() we kept a pointer to a dt_image_t struct and later used it after the original dt_image_t has been released.

If system is under heavy cache load this could very well fail - and it does.

dt_image_t not available fixes (proposal shutdown part 4)

If we get dt_image_t *img = dt_image_cache_get() from the image cache img can be NULL for various reasons.

In all those cases we must not de-reference img in any way and must use a meaningful value - like when getting the imgid - and avoid further processing on that data.

We had that being checked in most cases, a few ones have been overseen yet.

Above has been edited to current commit status

wpferguson · 2024-12-24T20:37:19Z

As dt_lua_finalize() or the dt_lua_finalize_early() variant has been called before, the mutex behind dt_lua_lock() is not safe any longer and as the lua subsystem is not fully functional any more, we have to strictly avoid processing (problematic) lua code from now on.
So take care, anything happening in lua code must be prepared to respect dt_control_running() for signals or a responding gui.

dt_lua_finalize_early() isn't a variant, it's the preparatory for dt_lua_finalize(). dt_lua_finalize_early() sends the exit event to all the running scripts letting them know that darktable is ending. dt_lua_lock() is functional in dt_lua_finalize_early() so that the scripts can exit cleanly. dt_lua_finalize() destroys the Lua context, so Lua is no longer running after that.

Lua doesn't need the GUI to run, except for the case of darktable.gui.action() being called in darkroom. The GUI is used by some scripts just to get parameters so that they can perform the requested functions. So if a script has been triggered to run from a the GUI, it will continue running even if the GUI disappears.

For sure the dt_lua_lock() can't work and executions leads to crashes ( or assert() failing in debug builds ) if the lock mutex has been destroyed already.

Can you give me an instance or scenario where it fails or causes a crash? Or are you talking about the final lock in dt_lua_finalize() where it never unlocks, because Lua isn't running anymore so therefore it can't.

problematic lua code

Can I have a definition or example of that so I have some idea what to test?

jenshannoschwalm · 2024-12-24T21:17:42Z

dt_lua_finalize_early() isn't a variant, it's the preparatory for dt_lua_finalize(). dt_lua_finalize_early() sends the exit event to all the running scripts letting them know that darktable is ending. dt_lua_lock() is functional in dt_lua_finalize_early() so that the scripts can exit cleanly. dt_lua_finalize() destroys the Lua context, so Lua is no longer running after that.

OK. So this remarks should be changed. Maybe you make a proposal or i just take your comment ?

Lua doesn't need the GUI to run, except for the case of darktable.gui.action() being called in darkroom. The GUI is used by some scripts just to get parameters so that they can perform the requested functions. So if a script has been triggered to run from a the GUI, it will continue running even if the GUI disappears.

Ok, as long as as does not read "fresh" data or uses the lib stuff.

For sure the dt_lua_lock() can't work and executions leads to crashes ( or assert() failing in debug builds ) if the lock mutex has been destroyed already.

Can you give me an instance or scenario where it fails or causes a crash? Or are you talking about the final lock in dt_lua_finalize() where it never unlocks, because Lua isn't running anymore so therefore it can't.

problematic lua code

Can I have a definition or example of that so I have some idea what to test?

Sure - just look at the commit code :-)

_import_internal uses dt_lua_lock(). Had been good this way until we started to using the backjobs. So there might be running import jobs after that lock had been destroyed.
export_with_flags is the same story.
Both crashed here and iirc also on @zisoft machine, for sure this is absolutely not safe and on debug builds runs into assert() issues.

wpferguson · 2024-12-24T21:47:23Z

What are backjobs? Is that in 5.0?

EDIT: So why not move dt_lua_finalize() after all the backjobs? dt_lua_lock() works until dt_lua_finalize() runs.

wpferguson · 2024-12-25T01:23:28Z

Lua has been working just fine until backjobs were added. So since the only thing that changed was backjobs, that's where the problem is, not in Lua.

The "problematic" code is that the backjobs are firing events that Lua is supposed to handle. The reason for the crashes is that the backjobs were added after Lua has stopped. So the fix is to put the backjobs before Lua finalizes so that Lua can handle the events.

jenshannoschwalm · 2024-12-25T03:59:04Z

The reason for the crashes is that the backjobs were added after Lua has stopped. So the fix is to put the backjobs before Lua finalizes so that Lua can handle the events.

Well, the backjobs were not added but continue running ...

Just mentioning this, i was also very surprised why dt started behaving like this and it took me a long time to understand :-)

So why not move dt_lua_finalize() after all the backjobs

I agree. I was a bit reluctant to do so as that has been in that position for a "long time" . Had that here in "test commit" and indeed could not find any issue the last days doing that, so squashed and pushing that now.

I would appreciate testing on your side as you will know best if anythings behaves strange.
(BTW i couldn't spot the code part that actually and finally unlocks and destroys the lua state mutex and cond.

jenshannoschwalm · 2024-12-25T05:48:37Z

Fixed, squashed and force-pushed after above discussion, updating the top "presentation" to current status

src/common/exif.cc

jenshannoschwalm · 2024-12-25T14:08:08Z

Again fixed, squashed and force-pushed after above discussion, updating the top "presentation" to current status

jenshannoschwalm · 2024-12-26T16:30:44Z

Added a 4-th commit with two additional and i think rather important fixes.

Affected were dt_image_update_final_size() and dt_image_get_final_size(), under heavy cache load both can fail miserably.

jenshannoschwalm · 2025-01-04T14:53:43Z

@TurboGit i did a lot more testing the last week or so, checking what could possibly be done for the 5.0 branch as a bugfix. Remember this all is about safe closing without a crash if there is a background job still running when you do ctrl-q.

If we would take only a497c56 the chance for a safe close is already pretty good (all the other commits fix bugs that happen mostly with heavy cache pressure by multiple background jobs or by discarding a job) so i personally would suggest to take that for 5.0 too.

curiousercreative · 2025-01-14T14:55:54Z

Potentially fixes #18208

jenshannoschwalm · 2025-01-14T20:15:27Z

Potentially fixes #18208

Nope. This PR is fixing instabilities and crashes while shutting down in the background.

How is darktable shutdown working safely? 1. dt_control_quit() is called via user request, it sets control->running anatomically to DT_CONTROL_STATE_CLEANUP implying that dt_control_running() will not be TRUE any more. It finally calls gtk_main_quit() so with current code we don't have gtk events after that. 2. Quitting gtk also means **we are exactly here** to continue the shutdown. Anything requiring a still active UI must be done before ... 3. dt_control_shutdown() first waits for all threads to be joined, that means waiting for all pending jobs to be done completely. 4. So we have to ensure: 1) a full working software stack including image_cache, mipmap_cache and darktable.imageio to allow processing the pixelpipe and full access to all images. 2) we don't access darktable.lib related stuff with dereferencing. 3) The pipeline processing uses access to gui related data - focus, active module ... so make sure to avoid those too. 4) As lua events might be fired by the backthreads it's state mutex must still be unlocked. 5. After dt_control_shutdown() has finished we are sure there are no background threads running any more so we can safely close all mentioned subsystems and continue. 6. In shutdown mode - tested via dt_control_running() is FALSE - we will not add new jobs. Avoids unlikely but possible race conditions for now.

1. A very annoying bug in dt_control_progress_cancel() has been fixed. Calling the progress->cancel callback is somewhat tricky as that might include destroying the progress structure which also includes destroying the process->mutex. _control_progress_cancel_callback() doing in this order --> dt_control_job_cancel() --> _control_job_set_state --> dt_control_progress_destroy So in this case - we used dt_control_progress_attach_job() a) we **must not** call progress->cancel with the mutex locked as that would make the mutex destroy failing leading to assert() failures in debug builds and an undefined behaviour in release builds. b) We unlock a mutex immediately after that but that location is not existing any more. Otherwise, do the cancel callback in mutex locked state as the progress struct will not be destroyed. 2. We strictly avoid progress de-referencing if being NULL for safety, in case of getters we return safe dummies, in case of putters we avoid the function.

…own part 3) In dt_image_update_final_size() we **must not** use dt_cache_release() if the entry is NULL. In dt_image_get_final_size() we kept a pointer to a dt_image_t struct and later used it after the original dt_image_t has been released. If system is under heavy cache load this could very well fail - and it does.

If we get `dt_image_t *img = dt_image_cache_get()` from the image cache, img can be NULL for various reasons. In all those cases we must not de-reference img in any way and must use a meaningful value - like when getting the imgid - and avoid further processing on that data. We had that being checked in most cases, a few ones have been overseen yet.

jenshannoschwalm added bugfix pull request fixing a bug priority: high core features are broken and not usable at all, software crashes difficulty: hard big changes across different parts of the code base lua labels Dec 24, 2024

jenshannoschwalm force-pushed the processing_in_the_dark branch from df0bad0 to 8e00c68 Compare December 25, 2024 03:36

jenshannoschwalm force-pushed the processing_in_the_dark branch from 8e00c68 to 3ef0845 Compare December 25, 2024 05:42

jenshannoschwalm mentioned this pull request Dec 25, 2024

Threads, jobs and safety maintenance. #18023

Closed

jenshannoschwalm force-pushed the processing_in_the_dark branch from 3ef0845 to 8093805 Compare December 25, 2024 12:37

jenshannoschwalm requested a review from LebedevRI as a code owner December 25, 2024 12:37

zisoft reviewed Dec 25, 2024

View reviewed changes

src/common/exif.cc Outdated Show resolved Hide resolved

jenshannoschwalm force-pushed the processing_in_the_dark branch from 8093805 to 5319fa8 Compare December 25, 2024 13:30

jenshannoschwalm force-pushed the processing_in_the_dark branch 3 times, most recently from a2b085a to 8442b6a Compare December 31, 2024 04:26

jenshannoschwalm added this to the 5.2 milestone Jan 2, 2025

jenshannoschwalm force-pushed the processing_in_the_dark branch from 8442b6a to d59c97e Compare January 4, 2025 06:51

MStraeten mentioned this pull request Jan 14, 2025

darktable process persists after application is quit and windows close #18208

Open

jenshannoschwalm force-pushed the processing_in_the_dark branch from d59c97e to 77c1255 Compare January 15, 2025 05:40

ralfbrown mentioned this pull request Jan 19, 2025

Darktable window closes, but process never terminates (unless running darktable inside gdb) #18240

Open

jenshannoschwalm mentioned this pull request Jan 21, 2025

Avoid null pointer dereference in dt_mipmap_cache_get_with_caller #18254

Open

jenshannoschwalm force-pushed the processing_in_the_dark branch 4 times, most recently from 76e2e09 to 81250e1 Compare January 25, 2025 19:54

jenshannoschwalm added 4 commits January 28, 2025 06:05

jenshannoschwalm force-pushed the processing_in_the_dark branch from 81250e1 to 2a52893 Compare January 28, 2025 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safe darktable shutdown #18061

Safe darktable shutdown #18061

jenshannoschwalm commented Dec 24, 2024 •

edited

Loading

wpferguson commented Dec 24, 2024

jenshannoschwalm commented Dec 24, 2024

wpferguson commented Dec 24, 2024 •

edited

Loading

wpferguson commented Dec 25, 2024

jenshannoschwalm commented Dec 25, 2024 •

edited

Loading

jenshannoschwalm commented Dec 25, 2024

jenshannoschwalm commented Dec 25, 2024

jenshannoschwalm commented Dec 26, 2024

jenshannoschwalm commented Jan 4, 2025

curiousercreative commented Jan 14, 2025

jenshannoschwalm commented Jan 14, 2025

Safe darktable shutdown #18061

Are you sure you want to change the base?

Safe darktable shutdown #18061

Conversation

jenshannoschwalm commented Dec 24, 2024 • edited Loading

wpferguson commented Dec 24, 2024

jenshannoschwalm commented Dec 24, 2024

wpferguson commented Dec 24, 2024 • edited Loading

wpferguson commented Dec 25, 2024

jenshannoschwalm commented Dec 25, 2024 • edited Loading

jenshannoschwalm commented Dec 25, 2024

jenshannoschwalm commented Dec 25, 2024

jenshannoschwalm commented Dec 26, 2024

jenshannoschwalm commented Jan 4, 2025

curiousercreative commented Jan 14, 2025

jenshannoschwalm commented Jan 14, 2025

jenshannoschwalm commented Dec 24, 2024 •

edited

Loading

wpferguson commented Dec 24, 2024 •

edited

Loading

jenshannoschwalm commented Dec 25, 2024 •

edited

Loading