-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Firestore.terminate hangs, and never calls callback #12522
Comments
I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight. |
Not sure if this is relevant, but I noticed that the tcp connection stops receiving any bytes after this condition is triggered... wondering if there's something here that's triggering some TCP blockage because of the amount of network happening... |
I think I have an idea of why it starts happening, but can't yet explain it in code... what happens is that after some event, we'll create some changelog documents on a collection like "/changelog/user_id", which might contain 100s or 1000s of documents created in a very short span of time, but outside of any database transaction, so it looks like it gets 100s of serial updates of 1 document each... for some reason, this seems to lock firestore... if I remove this changelog subscription, I can't reproduce the hang any more |
Hi @sergiocampama, thank you for filing this issue. Quick question, what were the steps when you produced this issue? Terminating firestore hangs, how is it related to " there's an overload of firestore document changes."? Are you terminating the firestore right after 100 documents mutation? |
Through some mechanism, we're creating/modifying 1000s of changelog documents. We have a listener for that collection of changelogs, which is updated 1 at a time, since the changelog documents are created from different cloud functions (they're onModified triggers for another set of documents). At some point during this listening, it seems firestore locks up and stops receiving any kind of updates. We notice this lack of updates because the UI does not reflect that a document which should have been deleted, is not removed from the screen. It is at this point that if you call Firestore.terminate, that since it is somehow locked, the terminate's callback won't get called, and it will get stuck on that backtrace shown above. |
My colleague has pointed out it might be because Could you please try changing the
|
Made the change locally, but it didn't help, still getting stuck. I also replaced the cacheSettings with |
it's not looking like a dispatch issue, more like one of the GrpcStreams is getting blocked |
Ruling out the persistence, the next suspect is the grpc. In the next ios sdk release we are doing a significant upgrade of the grpc dependency, which may fix the issue. Would you like to try again after the new release? Or, you can try applying #12398 locally to upgrade grpc in advance. Could you please provide debug log as well? It might give us some clue. |
No dice on updated gRPC. I've enabled logging, and it's a bit massive and contains some sensitive information for a public setting, and I also don't think it provides much value, since there are no errors, just a massive list of watch/unwatch/response logs... I could share it if you start an email thread with me at sergiocampama at gmail dot com... but I did found this... when the connection seems to hang because of the massive number of document updates, I can still send requests to the backend to listen to new documents (so it's not a dispatch queue issue), but the backend does not reply with a response for the initial document read nor the registered target_id, so it's starting to look like the WatchStream from Firestore's Datastore is getting hung on the backend side (could be head of line blocking on that h2 stream)... if this is indeed a possibility, it seems harder to debug from your side... not sure if I can give you any identifier for the connection so that someone could look more into it... it would be great if the firestore grpc channel had some sort of ping mechanism to detect that the connection is hanging and to try to force recover it by reregistering the target_ids, so that we don't lose uptime just on that single connection blocking... if not, force closing the watch_stream and write_stream from remote_store would also help so that we can at least recover by letting the user perform the action that would involve firestore's termination for now we're mitigating by trying to reduce the number of document reads to not trigger this hang, but given we don't have a root cause yet, we fear it might still be triggered in the future. |
Actually, looking at the logs for the last response log, I see this just after it:
(we're using 10.22 Firestore but 10.17 FirebaseCore for some reason) |
setting a breakpoint in datastore.cc's PollGrpcQueue to when is there a way to get more information on why the grpc completion queue failed to write to the stream? |
I'm fairly confident that when we get |
just tested that I could continue to use the write stream by creating new firestore documents after getting the ideally, we should be able to find the root cause of the write failure, but sans that, allowing firestore to recover from this state without requiring an app restart should be the higher priority IMO |
One final update: I was able to create a small repro case that results in the same behavior. This overloads the WatchStream until there's a read error ( not sure if this is the same failure mode as what I'm seeing above, but it looks similar in that if the watch or write streams are overloaded, firestore gets corrupted with no way of recovering |
Hi @sergiocampama, thank you for the repro app. I am able to reproduce the "operation of type 2 failed" error. |
awesome, thanks for confirming... let me know when you get a chance to review with the team, to see if there are any quick workarounds that we can use |
ok, found a workaround that works for us for now... if I replace
I can avoid the stream crapping out.... it won't get the listen updates, but I can pair this with a polling read every n seconds, and if it fails, I can call terminate, which won't lock up any more. |
In fact, with that code above, which looks terrible since it's calling on_success for the error case, I'm seeing that the connection continues working normally, it never locks up, and it never locks up the terminate call, and it sort of recovers on its own... sure, I'll lose some events, but in those cases we can just call terminate and restart the listeners... I disagree with "all errors are unrecoverable" |
tracing to where the ok success status comes, was able to find this error:
after that, only a single bit is propagated to signal the failure of the completion. |
Figured out how to enable grpc tracing, was able to see that we are indeed getting a RST_STREAM from the backend with code 0 here is the trace logs: https://gist.githubusercontent.com/sergiocampama/c7ebecd21f42ca948358cd45bc1d5a77/raw/0fd3b4e2e8ff8c5639c9c3bca48be1be9a7d8f1e/gistfile1.txt what I was able to find was that the watch stream does indeed get a RST_STREAM message, and then we restart that stream, as you can see there are multiple what I think is happening is that on overload, the grpc++ batch processor continues sending messages to the backend, after the stream has closed... so it's not detecting that event and canceling the subsequent messages, that generates a discardable error, but an error nonetheless, and that indeed is taken as a (wrong) signal that the stream is broken and that we need to close it, which in turns puts Firestore into a broken state where it can not be recovered from |
Created a gRPC++ issue here: grpc/grpc#36115 |
Created a fork with this change for now [1]. Hopefully you can all reproduce and provide a more robust fix over time, but this is sufficient for our needs for now. |
This is amazing. Thank you for sharing the debugging process, it is very informative! I will bring this to our team discussion and keep you updated. |
@sergiocampama you deserve an award for this investigation. |
Hi, I'm from the gRPC team. I don't completely understand the issue here, but from a gRPC perspective - if a write on a stream/RPC is failing, there is no point in writing to that stream again. Whether an RPC is done or not is decided by what's happening on reads and whether a status is available for that RPC. In this case, a Please feel free to reach out to me (or schedule a meeting) in case there are any questions as to how gRPC works. |
Thanks @yashykt, I was basing my batching assumption on the error being generated in the I'll let the Firestore experts figure it out then. |
Description
There are certain scenarios where I'll call Firestore's terminate function, and it will hang, never calling the callback. I haven't been able to create a reproducer, but can consistently get my product's code to get into this state (not shareable for obvious reasons).
This appears to happen when there's an overload of firestore document changes. I am able to see that the
com.google.firebase.firestore
queue's target queue iscom.apple.root.default-qos.overcommit
though not sure what this means. It looks like the queue starts with this value set as the target queue, could be a red herring.Here's a screenshot of the backtrace of where firestore hangs... It looks like it's unable to close the gRPC watch stream for the RemoteStore, stuck on WaitUntilOffQueue.
I'm basically stuck trying to debug this to see where things go wrong and what I can do to avoid this. I'm very unfamiliar with this codebase, so I don't even know what to look for to find where the code is getting stuck. Any help with pointers on where to look or what to check would be very appreciated.
Reproducing the issue
Can't share reproduction steps, as I can't share the relevant code.
Firebase SDK Version
10.22
Xcode Version
15.3
Installation Method
CocoaPods
Firebase Product(s)
Firestore
Targeted Platforms
iOS
Relevant Log Output
If using Swift Package Manager, the project's Package.resolved
None
If using CocoaPods, the project's Podfile.lock
Can't share
The text was updated successfully, but these errors were encountered: