-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INSERTs can severely slow down backends on startup (v17-only) #10281
Comments
@MMeent mentioned that traces look like everything is just slow because of backpressure (compute is doing a lot of writes), so it's not fully stuck, just slow It's still interesting why it reproduces only with PG 17 and that we can slow everything down with such a simple load. Moving to backlog for now |
Which backpressure do you mean - compute or PS side? Compute side backpressure blocks only writers, while PS - only readers. Unfortunately I was not able to reproduce the problem locally - in my case connect delay never exceeds one second. |
That's interesting. I could reproduce this with:
results in:
By the way, after the script exits, I can see a backend stuck on startup for much longer than 20 seconds:
|
May be backpressure throtling happens in context where some buffer is locked and so started backends are blocked on this buffer. If you are able to reproduce it (I still not), can you please once again capture stack traces of involved backends. |
Yes, of course, though. I've restarted the script, so pids are different:
This time we see no "backpressure" in the backend title, but it changes over time.
So the backtrace of that backend is:
|
Thank you. My hypothesis is that PS is witting for LSN which was not delivered yet, because backend initiated this transaction was blocked by backpressure. But to prove it, it is necessary to inspect SK WAL and LSN 2240648 is waiting for. |
One more run:
The backtrace of the first "starting" backend:
I'll try to make a docker file for reproducing this... |
And are you able to reproduce it using release build? |
Yes, I've reproduced it with release too:
Sorry for the delay -- it took me a while to understand that "cargo neon {command}" doesn't work with the release build. |
And one more time:
with the backtrace of the backend waiting (inside check_hba()):
|
Yes, those are waiting for Pageserver to respond. If you can figure out why Pageserver doesn't respond fast enough then that'd be nice, but based on the info provided here this doesn't seem like it is a Compute issue. |
Yeah, with debug logging added, log_min_messages = DEBUG3, and:
I can see the following for a slow-starting backend:
So reading each block by this backend takes about 4 seconds... |
So looks like the problem is at PS side. Unfortunately I still not able to reproduce it locally. Can you send me please page server log? May be there is some useful information explaining lag of PS... My current hypothesis is the following:
4Mb of WAL per second seems to be too small. Also it is strange to me that this time is so stable. Are you sure that you are using release binaries of Neon? I never started Neon locally using |
This is CPU usage at my system:
As you see, backend consumes 100% CPU and page server - just 30% (but it may be more bounded by disk IO). |
Having analyzed compute logs for v16 and v17 (
), I see the significant difference:
v17 (here 67822fad.36374f, 67822fba.36375c, 67822fc5.363766 - backends executing "SELECT 1", with the latter stuck on startup):
That is, the latter backend is starting to jump over large distances, and this results in waiting for the requested lsn to be received by pageserver (each wait takes more than 3 seconds due to backpressure), And one more run with additional logging:
That is, a backend starts normally when it can get lsn from lsn cache, but when it can't (because of cache overflow), it gets a larger LSN value, which necessitates waiting for the page to go through safekeeper, pageserver... And while the backend is waiting for the page requested, lsn_cache is overflowed again (by INSERT'ing backend), and the backend gets into the same situation when requesting the next page. To check this theory, I've tried 'lsn_cache_size = 10485760' with v17, and the test passed. At the same time, on v16 the test passes even with lsn_cache_size = 16384. |
By the way, maybe it would make sense to get rid of excessive locking at: I also could not find where previous function GetLastWrittenLSN() (non-vectorized version) is used in v17. Maybe it should be removed? |
Backpressure should not slowdown My understanding is that this 3 seconds delays are not caused by backpressure, but by laughing PS on this ~3 seconds in applying WAL. So for some reasons PS is slow and can not caught-up with inserts for some reasons (not reproduced in my case). This statement can be proved (or disproved) by changing
I agree with it that But in any case, overflow of LwLSN cache becomes critical only because PS is lagging. Only in this case waiting for most recent LSB can take a lot of time. So the original problem is why PS is so slow. It is expected for debug build, but not for release build. May be slow disks and some expensive background operations (compaction,...) can cause it? Separate question why it happens only at pg17. Is LwLSN cache is faster overflown at pg17 (may be because of more page evictions)? Or pg17 is applying WAL slowly than pg16 which cause PS lagging? I think that looking at metrics ( |
I tried to match safekeeper and pageserver logs, and what I'm seeing is:
And I see similar delays with v16 as well. So this is a part of the issue, but not the difference between versions.
I think the culprit here is v17-only function SetLastWrittenLSNForBlockv(), which advances maxLastWrittenLsn. (In v16, only SetLastWrittenLSNForBlockRange() can move it.) (I still investigate this issue with a debug build, where it reproduced easily, I guess it can be reproduced with a release build too, with some repro adjustments/increased load.) |
|
## Problem See #10281 pg17 performs extra lock/unlock operation when fetching LwLSN. ## Summary of changes Perform all lookups under one lock, moving initialization of not found keys to separate loop. Related Postgres PR: neondatabase/postgres#553 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
@alexanderlaw, can it still be reproduced after a039f83? |
Yes, I've reproduced it on 624a507 with the same recipe: #10281 (comment)
|
@alexanderlaw can we clarify the blocking part? I.e. if you don't put any timeout, will it succeed eventually? So far, it looks like backpressure/ingestion speed problem, which is bad, but shouldn't result in blocking |
Yes, it's not an indefinite lock, just every read request takes 3+ seconds, so even backend startup/authentication can take more than a minute. I guess, any follow-up activity of the backend would suffer from the page round-trip and backpressure while it is active, so I'd classify this as a performance issue. |
Is it release or debug build? |
That recipe contains |
As I wrote many times for debug build it is expected behaviour: debug version of PS is too slow to caught up with compute. In any case, beed no able to reproduce it locally, I do not know what else I can do with this ticket. |
@knizhnik , please take a look at the issue title — I've created it primarily because this behavior is specific to v17.
Please also look at the answer I gave upthread:
Perhaps you could try the repro attached above (literally, without any changes like debug->release). |
There seems to be no surprise here: if PS. is lagging, i.e. is not able to apply WAL fast enough, then certainly there will be delays between sending WAL for SK and receiving it by PS: just because PS is busy with apply and not requesting new portion of WAL.
O, good catch. I have not noticed it.
And It really can cause significant degrade of performance for pg17 because of useless waits for LSN at PS. |
Are you able to reproduce it with #10474 ? |
No, the issue is not reproduced on branch fix_set_lwlsnv, commit ea41da4. Thanks for working on the fix! |
…okv (#10474) ## Problem See #10281 `SetLastWrittenLSNForBlockv` is assigning max(lsn) to `maxLastWrittenLsn` while its should contain only max LSN not present in LwLSN cache. It case unnecessary waits in PS. ## Summary of changes Restore status-quo for pg17. Related Postgres PR: neondatabase/postgres#563 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
Fixed by #10474 |
The following (endless, or just running for long) script, when running against v17:
occasionally effectively blocks other backends on startup.
With this additional script:
I can see:
(Note the cpu utilization column.)
where 150080 is waiting inside:
and the backtrace of 150016:
The complete reproducer: repro-hang.txt.
The text was updated successfully, but these errors were encountered: