Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Thread Safety in Core Data Management #12805

Open
wants to merge 17 commits into
base: trunk
Choose a base branch
from

Conversation

pmusolino
Copy link
Member

@pmusolino pmusolino commented May 21, 2024

Closes: #12786

Description

It seems there are multiple crashes in the app, even very old crashes, which have to do with issues such as data corruption and deadlocks related to CoreData. #12785 is one of these (extra info here peaMlT-DX), and as usual, this is one of those crashes that is impossible to replicate, since concurrency issues often involve race conditions, where the timing of multiple threads accessing or modifying shared resources leads to unpredictable behavior. The exact timing of these operations can vary with each run of the application, making it hard to consistently reproduce the issue.

In this PR I attempt a series of enhancements to our Core Data management strategy, focusing on improving thread safety and resolving potential deadlocks, both on CoreDataManager and EntityListener. These changes should ensure that Core Data operations are performed in a safe and efficient manner across different threads.

For reviewing this PR -which must make sense to you, there is also the possibility that this code will not be merged at the end-; I added all the people who in the past have been involved with major changes to these classes, or participated in debugging some of these crashes. But given also the potential impact under the hood of these changes, I need your opinion and extra extra pair of eyes. <- If this PR is merged and we notice no impact at all, I can revert the changes.

Key Changes

  • Thread Safety: A dedicated synchronization queue (syncQueue) is now utilized to manage access to the persistentContainer and its contexts, ensuring thread-safe operations. The dispatch queue ensures that only one thread can access or modify the shared resources (e.g., persistentContainer, viewContext) at a time.
  • Code Refactoring: Various functions have been refactored to either include proper synchronization or to utilize existing thread-safe methods provided by NSPersistentContainer.
  • EntityListener: other than CoreDataManager, the other class that should be thread safe I think it should be EntityListener, so also here I introduced the same synchronization queue.

Testing instructions

  • As mentioned above, the exact timing of the operations can vary with each run of the application, making it hard to consistently reproduce the issues. Given that, try to do smoke testing, ensuring the most important functions of the app work as expected.
  • Unit tests have been updated to reflect the changes and ensure that all Core Data operations behave as expected.
  • If this PR will be merged, it is important to closely monitor the application for any changes in the frequency and nature of the crashes that have been occurring, specifically a reduction in the crashes that have been previously reported. This will help us to determine the effectiveness of the changes introduced in this PR in addressing the underlying issues.

  • I have considered if this change warrants user-facing release notes and have added them to RELEASE-NOTES.txt if necessary.

@pmusolino pmusolino added type: enhancement A request for an enhancement. type: crash The worst kind of bug. labels May 21, 2024
@pmusolino pmusolino added this to the 18.8 milestone May 21, 2024
@wpmobilebot
Copy link
Collaborator

wpmobilebot commented May 21, 2024

WooCommerce iOS📲 You can test the changes from this Pull Request in WooCommerce iOS by scanning the QR code below to install the corresponding build.

App NameWooCommerce iOS WooCommerce iOS
Build Numberpr12805-96e8270
Version18.9
Bundle IDcom.automattic.alpha.woocommerce
Commit96e8270
App Center BuildWooCommerce - Prototype Builds #9474
Automatticians: You can use our internal self-serve MC tool to give yourself access to App Center if needed.

Copy link
Contributor

@iamgabrielma iamgabrielma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this !

What would be the trade-offs of following this approach?

My understanding is that one of the trade-offs for safety would be a performance hit due to the overhead of queue synchronisation (which should be fine in our case right now, but we should find a way to measure it).

My main concern is that while we would be sure to run all these processes asynchronously in the syncQueue to assure safety, this would also block the thread until the operation is completed, isn't it? Since our action dispatcher always runs in the main thread by design, would upsert-related actions potentially be a source of deadlocks?

Comment on lines +226 to +298

func test_viewStorage_is_thread_safe() {
// Given
let manager = CoreDataManager(name: storageIdentifier, crashLogger: MockCrashLogger())
let expectation = self.expectation(description: "Thread Safety")

// When
DispatchQueue.global(qos: .background).async {
let viewContext = manager.viewStorage as? NSManagedObjectContext
// Then
XCTAssertNotNil(viewContext)
expectation.fulfill()
}

wait(for: [expectation], timeout: Constants.expectationTimeout)
}

func test_reset_is_thread_safe() {
// Given
let manager = CoreDataManager(name: storageIdentifier, crashLogger: MockCrashLogger())
let expectation = self.expectation(description: "Thread Safety")

// When
DispatchQueue.global(qos: .background).async {
manager.reset()
// Then
expectation.fulfill()
}

wait(for: [expectation], timeout: Constants.expectationTimeout)
}

func test_concurrent_access_to_core_data_manager() {
// Given
let manager = CoreDataManager(name: storageIdentifier, crashLogger: MockCrashLogger())
let expectation1 = self.expectation(description: "Thread 1")
let expectation2 = self.expectation(description: "Thread 2")

// When
DispatchQueue.global(qos: .background).async {
manager.reset()
// Then
expectation1.fulfill()
}

DispatchQueue.global(qos: .background).async {
let viewContext = manager.viewStorage as? NSManagedObjectContext
// Then
XCTAssertNotNil(viewContext)
expectation2.fulfill()
}

wait(for: [expectation1, expectation2], timeout: Constants.expectationTimeout)
}

func test_saving_operations_are_thread_safe() {
// Given
let manager = CoreDataManager(name: storageIdentifier, crashLogger: MockCrashLogger())
let expectation = self.expectation(description: "Thread Safety")

// When
DispatchQueue.global(qos: .background).async {
let derivedContext = manager.writerDerivedStorage as? NSManagedObjectContext
derivedContext?.performAndWait {
_ = derivedContext?.insertNewObject(ofType: ShippingLine.self)
derivedContext?.saveIfNeeded()
}
// Then
expectation.fulfill()
}

wait(for: [expectation], timeout: Constants.expectationTimeout)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure how useful are these tests or if are actually testing the changes as we intend to, if we remove the syncQueue.sync in this PR and re-run the tests, all of them still pass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure how useful are these tests or if are actually testing the changes as we intend to, if we remove the syncQueue.sync in this PR and re-run the tests, all of them still pass.

I'm not sure why I lost this comment. By the way, @iamgabrielma, you're right, but at the same time, it's quite difficult to replicate thread safety issues, so these tests serve more as a sort of sanity check rather than unit tests. By the way, if you have any suggestions for improvement, I'm open to hearing them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely tricky.

I don't have a better suggestion for the test themselves, but I think that we should change the names at a minimum so they doesn't conflate that are testing something that aren't, in this case we don't know if the operations are thread safe, for example: Accessing viewStorage from a background thread and verifying its non-nil state does not confirm thread-safety, or the "reset" test just demonstrates that the function completes without immediate errors on a background thread, not its thread safety.

Which bring the point of: Are these tests really necessary? If they are, they should be renamed to not miss-guide what they test.

Yosemite/YosemiteTests/Tools/EntityListenerTests.swift Outdated Show resolved Hide resolved
expectation.fulfill()
}

wait(for: [expectation], timeout: Constants.expectationTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see expectationTimeout has been with us for a long time, but 10 seconds feels extreme. Having a specific number of seconds for timeout is arbitrary, but if we change it to 1 second the test suite still passes, and around 0.1-0.5 starts to become flaky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is used almost everywhere in the unit tests, so if I need a custom timeout for these methods, I would prefer to set it individually rather than change it globally. This is because tests might still pass on your local machine but could require more time on the CI server. In any case, for most methods where it is applied, this value simply acts as the timeout limit. However, if a test is successful, it typically doesn't require the full duration of this timeout to complete. I would make a separate PR to change the timeout since it impacts hundreds of tests


// Then
wait(for: [expectation1, expectation2, expectation3], timeout: 5.0)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with the above tests, if we remove the syncqueue changes to make the calls single-threaded this test still passes, so it's unlikely that's testing the changes in the way we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this either. How about using DispatchQueue.concurrentPerform like this suggestion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it, implementing it in this way:

func test_thread_safety_for_readOnlyEntity() {
    // Given: A sample storage account is inserted and saved.
    let storageAccount = storageManager.insertSampleAccount()
    viewContext.saveIfNeeded()

    // And: An EntityListener is set up with the readOnly entity being the inserted storage account.
    let listener = EntityListener(viewContext: viewContext, readOnlyEntity: storageAccount.toReadOnly())

    // When: Multiple threads access the readOnlyEntity property concurrently.
    let iterations = 100
    let expectation = self.expectation(description: "Concurrent access")

    DispatchQueue.global(qos: .userInitiated).async {
        DispatchQueue.concurrentPerform(iterations: iterations) { index in
            print("Iteration \(index)")
            _ = listener.readOnlyEntity
        }
        expectation.fulfill()
    }

    // Then: If no crashes or exceptions occurred, the test is considered successful.
    wait(for: [expectation], timeout: 10.0)
    XCTAssert(true)
}

But no luck, the test still passes. I would say that it's probably also "normal"? Since it's quite difficult to replicate thread safety issues?

Comment on lines 77 to 128
let container = NSPersistentContainer(name: name, managedObjectModel: modelsInventory.currentModel)
container.persistentStoreDescriptions = [storeDescription]

let migrationDebugMessages = migrateDataModelIfNecessary(using: container.persistentStoreCoordinator)
let container = NSPersistentContainer(name: name, managedObjectModel: modelsInventory.currentModel)
container.persistentStoreDescriptions = [storeDescription]

container.loadPersistentStores { [weak self] (storeDescription, error) in
guard let `self` = self, let persistentStoreLoadingError = error else {
return
}
let migrationDebugMessages = migrateDataModelIfNecessary(using: container.persistentStoreCoordinator)

DDLogError("⛔️ [CoreDataManager] loadPersistentStore failed. Attempting to recover... \(persistentStoreLoadingError)")

/// Remove the old Store which is either corrupted or has an invalid model we can't migrate from
///
var persistentStoreRemovalError: Error?
do {
try container.persistentStoreCoordinator.destroyPersistentStore(at: self.storeURL,
ofType: storeDescription.type,
options: nil)
} catch {
persistentStoreRemovalError = error
}

/// Retry!
///
container.loadPersistentStores { [weak self] (storeDescription, underlyingError) in
guard let underlyingError = underlyingError as NSError? else {
container.loadPersistentStores { [weak self] (storeDescription, error) in
guard let `self` = self, let persistentStoreLoadingError = error else {
return
}

let error = CoreDataManagerError.recoveryFailed
DDLogError("⛔️ [CoreDataManager] loadPersistentStore failed. Attempting to recover... \(persistentStoreLoadingError)")

/// Remove the old Store which is either corrupted or has an invalid model we can't migrate from
///
var persistentStoreRemovalError: Error?
do {
try container.persistentStoreCoordinator.destroyPersistentStore(at: self.storeURL,
ofType: storeDescription.type,
options: nil)
} catch {
persistentStoreRemovalError = error
}

/// Retry!
///
container.loadPersistentStores { [weak self] (storeDescription, underlyingError) in
guard let underlyingError = underlyingError as NSError? else {
return
}

let error = CoreDataManagerError.recoveryFailed
let logProperties: [String: Any?] = ["persistentStoreLoadingError": persistentStoreLoadingError,
"persistentStoreRemovalError": persistentStoreRemovalError,
"retryError": underlyingError,
"appState": UIApplication.shared.applicationState.rawValue,
"migrationMessages": migrationDebugMessages]
self?.crashLogger.logFatalErrorAndExit(error,
userInfo: logProperties.compactMapValues { $0 })
}

let logProperties: [String: Any?] = ["persistentStoreLoadingError": persistentStoreLoadingError,
"persistentStoreRemovalError": persistentStoreRemovalError,
"retryError": underlyingError,
"appState": UIApplication.shared.applicationState.rawValue,
"migrationMessages": migrationDebugMessages]
self?.crashLogger.logFatalErrorAndExit(error,
userInfo: logProperties.compactMapValues { $0 })
self.crashLogger.logMessage("[CoreDataManager] Recovered from persistent store loading error",
properties: logProperties.compactMapValues { $0 },
level: .info)
}

let logProperties: [String: Any?] = ["persistentStoreLoadingError": persistentStoreLoadingError,
"persistentStoreRemovalError": persistentStoreRemovalError,
"appState": UIApplication.shared.applicationState.rawValue,
"migrationMessages": migrationDebugMessages]
self.crashLogger.logMessage("[CoreDataManager] Recovered from persistent store loading error",
properties: logProperties.compactMapValues { $0 },
level: .info)
}

return container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff seems unnecessary and just adds mental overhead to review the PR/find changes in the history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff seems unnecessary and just adds mental overhead to review the PR/find changes in the history.

I think there was a wrong indentation in the code, that has been like that for years

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the updated indent is incorrect, can you revert that please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if they have been fixed now?

Yosemite/YosemiteTests/Tools/EntityListenerTests.swift Outdated Show resolved Hide resolved
@pmusolino
Copy link
Member Author

Thanks, @iamgabrielma for the review!

What would be the trade-offs of following this approach?
My understanding is that one of the trade-offs for safety would be a performance hit due to the overhead of queue synchronisation (which should be fine in our case right now, but we should find a way to measure it).

My main concern is that while we would be sure to run all these processes asynchronously in the syncQueue to assure safety, this would also block the thread until the operation is completed, isn't it? Since our action dispatcher always runs in the main thread by design, would upsert-related actions potentially be a source of deadlocks?

As you said, there is a trade-off. Using syncQueue.sync to ensure thread safety introduces synchronization overhead. Every time a shared resource is accessed, it must be synchronized, which can slow down operations. And, if there are multiple tasks accessing the syncQueue, there might be delays due to contention, as tasks will be executed serially.
But at the same time, there are no such heavy operations we do on the database that we consider a problem right now.

Wanting to be more pragmatic, the main problem I want to solve all the way, is thread safety, which must be guaranteed by us, because by default each NSManagedObjectContext is not, and should always run in the same thread or queue.

To address the problem and determine if it's truly a thread safety issue, we could also use syncQueue.async (note the 'a'). By doing this, we ensure that the context, even when nested within a different thread or queue, operates within a specific queue for all Core Data-related activities, but it requires a lot more work to make everything async (a nightmare).

PS: discovered for instance, DefaultStoresManager has its own queue, and if we also perform database access within it, this means that the database access and write operations would not be thread-safe since they are executed in different queues. 💥 But also other classes, like OrderStore or some classes in Alamofire use different queues, which can generate this type of issue if we don't make this change.

@jaclync jaclync modified the milestones: 18.8, 18.9 May 24, 2024
childManagedObjectContext.parent = persistentContainer.viewContext
childManagedObjectContext.mergePolicy = NSMergeByPropertyObjectTrumpMergePolicy
return childManagedObjectContext
return syncQueue.sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using DispatchQueue and closures, how about making CoreDataManager an actor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another possibility, but do you see a real benefit over DispatchQueue? I'm concerned that making this class an actor might entail many changes elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A benefit is to have a more modern syntax with no requirement for using closures, making the code more readable. I understand the potential side effects of changes required at call sites though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, and let me know if you agree, that if this PR is merged and it fixes the crashes on Core Data, it's worth investing some time later in migrating it to actor, unless the impact is huge.

@hafizrahman hafizrahman modified the milestones: 18.9, 19.0 May 31, 2024
Copy link
Contributor

@itsmeichigo itsmeichigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pmusolino - I'm not confident about the changes making all access to the storage sync instead of async due to the potential hang this could cause, so I cannot approve this PR. I'll ping random folks here for more opinions on this @selanthiraiyan @jaclync @joshheald.

RELEASE-NOTES.txt Outdated Show resolved Hide resolved
@selanthiraiyan selanthiraiyan modified the milestones: 19.0, 19.1 Jun 7, 2024
@Ecarrion Ecarrion modified the milestones: 19.1, 19.2 Jun 14, 2024
guard let note = ManagedObjectsDidChangeNotification(notification: notification) else {
return
}
queue.sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Do we also need a queue here? i.e. Do we have to execute these callbacks synchronously?

@Ecarrion Ecarrion modified the milestones: 19.2, 19.3 Jun 21, 2024
@malinajirka malinajirka modified the milestones: 19.3, 19.4 Jun 28, 2024
@malinajirka
Copy link
Contributor

I have update the target milestone to 19.4 as part of the release process for 19.3.

@iamgabrielma iamgabrielma removed this from the 19.4 milestone Jul 5, 2024
@dangermattic
Copy link
Collaborator

1 Warning
⚠️ PR is not assigned to a milestone.

Generated by 🚫 Danger

@iamgabrielma
Copy link
Contributor

I'll remove the tag for the moment until the PR is ready for a specific milestone.

@@ -15,6 +15,9 @@ public final class CoreDataManager: StorageManagerType {

private let modelsInventory: ManagedObjectModelsInventory

// A dispatch queue for synchronizing access to shared attributes
private let syncQueue = DispatchQueue(label: "com.automattic.woocommerce.CoreDataManager.syncQueue")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This queue is used to prevent concurrent access to the read context and write context themselves. They can still be used to perform concurrent Core Data changes.

If I understand it correctly, you'd want synchronized data writing to prevent database corruption. I don't think this new serial queue will help that.

NotificationCenter.default.post(name: .StorageManagerDidResetStorage, object: self)
DDLogVerbose("💣 [CoreDataManager] Stack Destroyed!")
NotificationCenter.default.post(name: .StorageManagerDidResetStorage, object: self)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a potential dead lock here.

Imagine reset is called on the main thread. The main thread now waits on the syncQueue to finish running. The syncQueue waits viewContext to complete. If the viewContext is bound to the main context, it waits for the main thread, which is blocked.

The above scenario is pretty easy to set up in unit tests. I'd suggesting adding an unit test to see if there is a dead-lock here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: crash The worst kind of bug. type: enhancement A request for an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error Domain=NSSQLiteErrorDomain Code=11 "(null)" UserInfo={NSFilePath=/var/mobile/Containers/Dat...