Very High 95% latency #40

DavidM-D · 2023-10-25T15:28:30Z

We've been observing quite high relayer latency when creating and running add-key on FastAuth.

This is reflected in this grafana dashboard which shows fairly consistent 95% latency of 20 seconds, which is higher than I'd expect.

Do we know what's causing this and what could be done to improve it?

anthony-near · 2023-10-25T20:46:42Z

@anshal-savla do you know if something with our deployment infra could be causing this? When I run the relayer locally it never takes >10 seconds which involves making calls to remote RPCs, whereas the relayer deployed on GCP should have RPCs within the same availability zone.

anshal-savla · 2023-10-25T21:38:17Z

@anthony-near Nothing in our infra setup should be causing such high latency.

DavidM-D · 2023-10-31T12:03:52Z

Any idea what might be causing it then @anthony-near?

firatNEAR · 2023-10-31T12:12:16Z

Could we maybe get some tracing and metrics on the Relayer to track the latency?
I suspect that it might also be the RPC causing this delay, so it would be good to understand the time spent on relayer itself.

anthony-near · 2023-10-31T14:36:54Z

I ran a profiler locally and 70%+ of cpu time (different than wall time including network requests) is making rpc requests and formatting the responses. The only network calls made are RPC requests. Locally they were the source of latency and bottlenecks. I also suspect it is RPC request latency

anthony-near · 2023-10-31T15:17:58Z

I will try to do more profiling to include network calls and go through the logs on prod to verify for sure that it is RPC.

anthony-near · 2023-11-03T13:23:57Z

As you can see from the flamechart profiling from a test, the vast majority of the CPU time is spent on networking beside the unwrapping an error for the test (relayer-53450093a1a2da8a`core::result::unwrap_failed).

DavidM-D · 2023-11-03T13:28:39Z

How long does this test take to run on the wall clock?
Are we seeing the latency that is causing the issues in your test setup or is it only observable in prod?

anthony-near · 2023-11-03T13:32:31Z

Test took 4.207 seconds total wall clock time.

There's latency in the call to the rpc locally, but that's accounted for in the wall clock time and is nothing like what is in the 99th percentile in prod. I'm not sure the how grafana request latency metrics are collected. I'm not entirely sure the methodology used there is accurate.

anthony-near · 2023-11-03T13:40:10Z

I also just ran a load test where I made 100 serial (non-parallelized) requests hitting the /relay endpoint locally. The load test took 4 minutes and 50 seconds, giving an average response time of (4*60+50)/100 = 2.9 seconds.

DavidM-D · 2023-11-03T14:02:37Z

Hmm, then it doesn't seem to be capturing the latency we've observed though grafana and the MPC services metrics. Would it be possible to add logging so we can see what requests are taking so long?

anthony-near · 2023-11-03T14:17:31Z

Logging already exists and is viewable here: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-mainnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0A%20severity%3E%3D%22DEBUG%22;cursorTimestamp=2023-11-03T14:16:10.043219Z;duration=P7D?project=pagoda-discovery-platform-prod

anthony-near · 2023-12-01T17:06:14Z

Using broadcast_tx_async RPC endpoint to send the txn and then calling the view-account-changes RPC endpoint (which support querying updates for multiple account_ids in a single query) in a separate thread fixes the observed high edge case latency #56 (comment). This will require a new /view-account-changes endpoint on the relayer that accepts multiple account_ids and returns the result.

firatNEAR · 2024-02-06T09:06:23Z

Did we test this out with the new release cut on devnet? Is this still an ongoing issue?

DavidM-D · 2024-02-13T04:26:34Z

This appears to be an ongoing issue. We're still seeing 95% of 20+ seconds. Observing the logs it seems that most requests from FastAuth take 15-40 seconds and the quicker requests are the other endpoints which seem to come in at a few ms (not quite sure what those other endpoints are doing). That's reflected in the grafana dashboard which shows the 50% of requests at 5ms, which is too fast to do anything on chain.

firatNEAR · 2024-02-13T09:10:04Z

I was under the impression that some of the requests hit async endpoints, that would explain the low latency calls. AFAIK 'Create Account' calls were taking very long due to the atomic nature of the transaction (it shouldn't take 40 seconds though I agree on that).

Have you guys also try testing this on the testnet relayer? That one runs the latest with the changes. cc: @anthony-near

anthony-near · 2024-02-16T21:59:39Z

The updated version that has tracing for request latencies has been deployed on testnet dev env for ~1 week now. Here's a query for the request latency, which is usually milliseconds and occasionally (99%) 10-20 seconds max: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG;cursorTimestamp=2024-02-16T21:08:53.862672Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

The /send_meta_tx_async endpoint has a max latency of 1.5s where the median is a few ms. This is the most performant endpoint, but it's new in the release. https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG%0ASEARCH%2528%22%60%2Fsend_meta_tx_async%60%22%2529;cursorTimestamp=2024-02-16T18:13:45.876595Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

The /send_meta_tx has a max latency of 7.6s as the most stable endpoint to relay transactions and in line with my local testing: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG%0ASEARCH%2528%22%60%2Fsend_meta_tx%60%22%2529%0A%2528httpRequest.latency%3E%3D%227606ms%22%20OR%0AjsonPayload.latency%3E%3D%227606ms%22%2529;cursorTimestamp=2024-02-16T18:37:59.099269Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

The /create_account_atomic endpoint requests all take ~22s, which is somewhat expected since we need to force that txn through and it requires multiple steps and RPC calls: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG%0ASEARCH%2528%22%60%2Fcreate_account_atomic%60%22%2529;cursorTimestamp=2024-02-16T09:50:56.138514Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

The /relay endpoint yielded some surprising results and this may be a culprit for extra latency as this endpoint is more or less deprecated (borsh serialization/deserialization adds unnecessary time). The 2xx responses on this endpoint were fine and in line with the /send_meta_tx endpoints, but the 5xx response requests took >20 s: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG%0ASEARCH%2528%22%60%2Frelay%60%22%2529%0A%2528httpRequest.latency%3E%3D%2219737ms%22%20OR%0AjsonPayload.latency%3E%3D%2219737ms%22%2529;cursorTimestamp=2024-02-16T14:08:47.349072Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

There were no calls to the /get_allowance, /update_allowance, or /update_all_allowances endpoints which are just purely interacting with redis and do not touch RPC, so I'm not concerned about them.

I created a dashboard of all request breakdowns of all requests on cloud run services here: https://console.cloud.google.com/monitoring/dashboards/builder/0b2ba8d8-8810-447e-9e1e-ce44a78219fb;duration=P7D?project=pagoda-discovery-platform-prod. Note that there are inaccuracies in the reporting of 99% times as the max time I saw for any response was 29s, not 300+ seconds reported on the dashboard for the 99% 5xx failures line on the chart. Don't think it's just for the relayer since all cloud run latencies as all cloud run request latencies are incude, but I manually went through all logs in the past week and the max response time I saw was 29s, which is still long, but it's on a deprecated endpoint in a 5xx response.

s-n-park · 2024-02-21T22:29:58Z

Not 100% sure if this is related to relayer, but in case it is, wanted to share a sample video of trying to login via FastAuth and sitting on a "recovering account" screen for 15 seconds.

https://drive.google.com/file/d/10d6YgLzkU0I-mAcLTviUPyI8eCmcFGMz/view

DavidM-D assigned anthony-near Oct 25, 2023

This was referenced Nov 15, 2023

Tracing and Flamegraphs/charts #47

Merged

release 0.3.0 #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very High 95% latency #40

Very High 95% latency #40

DavidM-D commented Oct 25, 2023

anthony-near commented Oct 25, 2023

anshal-savla commented Oct 25, 2023

DavidM-D commented Oct 31, 2023

firatNEAR commented Oct 31, 2023

anthony-near commented Oct 31, 2023

anthony-near commented Oct 31, 2023

anthony-near commented Nov 3, 2023

DavidM-D commented Nov 3, 2023 •

edited

Loading

anthony-near commented Nov 3, 2023 •

edited

Loading

anthony-near commented Nov 3, 2023

DavidM-D commented Nov 3, 2023

anthony-near commented Nov 3, 2023

anthony-near commented Dec 1, 2023 •

edited

Loading

firatNEAR commented Feb 6, 2024

DavidM-D commented Feb 13, 2024

firatNEAR commented Feb 13, 2024

anthony-near commented Feb 16, 2024 •

edited

Loading

s-n-park commented Feb 21, 2024

Very High 95% latency #40

Very High 95% latency #40

Comments

DavidM-D commented Oct 25, 2023

anthony-near commented Oct 25, 2023

anshal-savla commented Oct 25, 2023

DavidM-D commented Oct 31, 2023

firatNEAR commented Oct 31, 2023

anthony-near commented Oct 31, 2023

anthony-near commented Oct 31, 2023

anthony-near commented Nov 3, 2023

DavidM-D commented Nov 3, 2023 • edited Loading

anthony-near commented Nov 3, 2023 • edited Loading

anthony-near commented Nov 3, 2023

DavidM-D commented Nov 3, 2023

anthony-near commented Nov 3, 2023

anthony-near commented Dec 1, 2023 • edited Loading

firatNEAR commented Feb 6, 2024

DavidM-D commented Feb 13, 2024

firatNEAR commented Feb 13, 2024

anthony-near commented Feb 16, 2024 • edited Loading

s-n-park commented Feb 21, 2024

DavidM-D commented Nov 3, 2023 •

edited

Loading

anthony-near commented Nov 3, 2023 •

edited

Loading

anthony-near commented Dec 1, 2023 •

edited

Loading

anthony-near commented Feb 16, 2024 •

edited

Loading