Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very High 95% latency #40

Open
DavidM-D opened this issue Oct 25, 2023 · 18 comments
Open

Very High 95% latency #40

DavidM-D opened this issue Oct 25, 2023 · 18 comments
Assignees

Comments

@DavidM-D
Copy link

We've been observing quite high relayer latency when creating and running add-key on FastAuth.

This is reflected in this grafana dashboard which shows fairly consistent 95% latency of 20 seconds, which is higher than I'd expect.

Do we know what's causing this and what could be done to improve it?

@anthony-near
Copy link
Collaborator

@anshal-savla do you know if something with our deployment infra could be causing this? When I run the relayer locally it never takes >10 seconds which involves making calls to remote RPCs, whereas the relayer deployed on GCP should have RPCs within the same availability zone.

@anshal-savla
Copy link
Contributor

@anthony-near Nothing in our infra setup should be causing such high latency.

@DavidM-D
Copy link
Author

Any idea what might be causing it then @anthony-near?

@firatNEAR
Copy link
Contributor

Could we maybe get some tracing and metrics on the Relayer to track the latency?
I suspect that it might also be the RPC causing this delay, so it would be good to understand the time spent on relayer itself.

@anthony-near
Copy link
Collaborator

I ran a profiler locally and 70%+ of cpu time (different than wall time including network requests) is making rpc requests and formatting the responses. The only network calls made are RPC requests. Locally they were the source of latency and bottlenecks. I also suspect it is RPC request latency

@anthony-near
Copy link
Collaborator

I will try to do more profiling to include network calls and go through the logs on prod to verify for sure that it is RPC.

@anthony-near
Copy link
Collaborator

relayer_flamegraph_cpu_time
relayer_flamegraph_cpu_time_pt2

As you can see from the flamechart profiling from a test, the vast majority of the CPU time is spent on networking beside the unwrapping an error for the test (relayer-53450093a1a2da8a`core::result::unwrap_failed).

@DavidM-D
Copy link
Author

DavidM-D commented Nov 3, 2023

How long does this test take to run on the wall clock?
Are we seeing the latency that is causing the issues in your test setup or is it only observable in prod?

@anthony-near
Copy link
Collaborator

anthony-near commented Nov 3, 2023

Test took 4.207 seconds total wall clock time.

There's latency in the call to the rpc locally, but that's accounted for in the wall clock time and is nothing like what is in the 99th percentile in prod. I'm not sure the how grafana request latency metrics are collected. I'm not entirely sure the methodology used there is accurate.

@anthony-near
Copy link
Collaborator

I also just ran a load test where I made 100 serial (non-parallelized) requests hitting the /relay endpoint locally. The load test took 4 minutes and 50 seconds, giving an average response time of (4*60+50)/100 = 2.9 seconds.

@DavidM-D
Copy link
Author

DavidM-D commented Nov 3, 2023

Hmm, then it doesn't seem to be capturing the latency we've observed though grafana and the MPC services metrics. Would it be possible to add logging so we can see what requests are taking so long?

This was referenced Nov 15, 2023
@anthony-near
Copy link
Collaborator

anthony-near commented Dec 1, 2023

Using broadcast_tx_async RPC endpoint to send the txn and then calling the view-account-changes RPC endpoint (which support querying updates for multiple account_ids in a single query) in a separate thread fixes the observed high edge case latency #56 (comment). This will require a new /view-account-changes endpoint on the relayer that accepts multiple account_ids and returns the result.

@firatNEAR
Copy link
Contributor

Did we test this out with the new release cut on devnet? Is this still an ongoing issue?

@DavidM-D
Copy link
Author

This appears to be an ongoing issue. We're still seeing 95% of 20+ seconds. Observing the logs it seems that most requests from FastAuth take 15-40 seconds and the quicker requests are the other endpoints which seem to come in at a few ms (not quite sure what those other endpoints are doing). That's reflected in the grafana dashboard which shows the 50% of requests at 5ms, which is too fast to do anything on chain.

@firatNEAR
Copy link
Contributor

I was under the impression that some of the requests hit async endpoints, that would explain the low latency calls. AFAIK 'Create Account' calls were taking very long due to the atomic nature of the transaction (it shouldn't take 40 seconds though I agree on that).

Have you guys also try testing this on the testnet relayer? That one runs the latest with the changes. cc: @anthony-near

@anthony-near
Copy link
Collaborator

anthony-near commented Feb 16, 2024

The updated version that has tracing for request latencies has been deployed on testnet dev env for ~1 week now. Here's a query for the request latency, which is usually milliseconds and occasionally (99%) 10-20 seconds max: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG;cursorTimestamp=2024-02-16T21:08:53.862672Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

The /send_meta_tx_async endpoint has a max latency of 1.5s where the median is a few ms. This is the most performant endpoint, but it's new in the release. https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG%0ASEARCH%2528%22%60%2Fsend_meta_tx_async%60%22%2529;cursorTimestamp=2024-02-16T18:13:45.876595Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

The /send_meta_tx has a max latency of 7.6s as the most stable endpoint to relay transactions and in line with my local testing: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG%0ASEARCH%2528%22%60%2Fsend_meta_tx%60%22%2529%0A%2528httpRequest.latency%3E%3D%227606ms%22%20OR%0AjsonPayload.latency%3E%3D%227606ms%22%2529;cursorTimestamp=2024-02-16T18:37:59.099269Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

The /create_account_atomic endpoint requests all take ~22s, which is somewhat expected since we need to force that txn through and it requires multiple steps and RPC calls: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG%0ASEARCH%2528%22%60%2Fcreate_account_atomic%60%22%2529;cursorTimestamp=2024-02-16T09:50:56.138514Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

The /relay endpoint yielded some surprising results and this may be a culprit for extra latency as this endpoint is more or less deprecated (borsh serialization/deserialization adds unnecessary time). The 2xx responses on this endpoint were fine and in line with the /send_meta_tx endpoints, but the 5xx response requests took >20 s: https://console.cloud.google.com/logs/query;query=resource.type%20%3D%20%22cloud_run_revision%22%0Aresource.labels.service_name%20%3D%20%22relayer-fastauth-testnet%22%0Aresource.labels.location%20%3D%20%22us-central1%22%0Aseverity%3E%3DDEBUG%0ASEARCH%2528%22%60%2Frelay%60%22%2529%0A%2528httpRequest.latency%3E%3D%2219737ms%22%20OR%0AjsonPayload.latency%3E%3D%2219737ms%22%2529;cursorTimestamp=2024-02-16T14:08:47.349072Z;resultsSearch=httpRequest.latency;duration=P7D?project=pagoda-discovery-platform-prod

There were no calls to the /get_allowance, /update_allowance, or /update_all_allowances endpoints which are just purely interacting with redis and do not touch RPC, so I'm not concerned about them.

I created a dashboard of all request breakdowns of all requests on cloud run services here: https://console.cloud.google.com/monitoring/dashboards/builder/0b2ba8d8-8810-447e-9e1e-ce44a78219fb;duration=P7D?project=pagoda-discovery-platform-prod. Note that there are inaccuracies in the reporting of 99% times as the max time I saw for any response was 29s, not 300+ seconds reported on the dashboard for the 99% 5xx failures line on the chart. Don't think it's just for the relayer since all cloud run latencies as all cloud run request latencies are incude, but I manually went through all logs in the past week and the max response time I saw was 29s, which is still long, but it's on a deprecated endpoint in a 5xx response.

@s-n-park
Copy link

Not 100% sure if this is related to relayer, but in case it is, wanted to share a sample video of trying to login via FastAuth and sitting on a "recovering account" screen for 15 seconds.

https://drive.google.com/file/d/10d6YgLzkU0I-mAcLTviUPyI8eCmcFGMz/view

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants