Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional 404/502 Errors from RPC Proxy #14

Open
chainzero opened this issue Aug 26, 2024 · 18 comments
Open

Occasional 404/502 Errors from RPC Proxy #14

chainzero opened this issue Aug 26, 2024 · 18 comments
Assignees

Comments

@chainzero
Copy link
Collaborator

chainzero commented Aug 26, 2024

When testing the current implementation/deployment of the RPC Proxy approx 8 out 10 requests are serviced properly but 10-20% of requests receive 404 not found or 502 forbidden responses.

Version - 0.0.1-rc2

Testable via - curl -ks https://peers.akash.network:443/rpc

Example - received 502 error moments ago and proxy logs show no evidence of issue such as:

Aug 26 22:21:29 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:21:29 INFO updated server list total=17
Aug 26 22:26:29 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:26:29 INFO updated server list total=17
Aug 26 22:31:29 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:31:29 INFO updated server list total=17
Aug 26 22:32:19 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:32:19 INFO proxying request name=Notional url=https://rpc-akash-ia.cosmosia.notional.ventures:443
Aug 26 22:32:19 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:32:19 INFO request done name=Notional avg=443.96568ms last=405.591407ms
Aug 26 22:33:57 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:33:57 INFO proxying request name=c29r3 url=http://akash.c29r3.xyz:80
Aug 26 22:33:58 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:33:58 INFO request done name=c29r3 avg=377.501588ms last=330.60033ms
Aug 26 22:33:59 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:33:59 INFO proxying request name="AutoStake 🛡️ Slash Protected" url=https://akash-mainnet-rpc.autostake.com:443
Aug 26 22:34:00 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:00 INFO request done name="AutoStake 🛡️ Slash Protected" avg=444.685044ms last=473.324183ms
Aug 26 22:34:03 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:03 INFO proxying request name=Kleomedes url=https://akash-rpc.kleomedes.network
Aug 26 22:34:03 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:03 INFO request done name=Kleomedes avg=518.63857ms last=500.393158ms
Aug 26 22:34:05 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:05 INFO proxying request name=Stakeflow url=https://rpc-akash-01.stakeflow.io
Aug 26 22:34:05 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:05 INFO request done name=Stakeflow avg=462.42117ms last=457.665301ms
Aug 26 22:34:06 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:06 INFO proxying request name="Cosmonaut Stakes" url=https://akash-mainnet-rpc.cosmonautstakes.com:443
Aug 26 22:34:07 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:07 INFO request done name="Cosmonaut Stakes" avg=281.433645ms last=308.477328ms
Aug 26 22:34:08 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:08 INFO proxying request name=w3coins url=https://akash-rpc.w3coins.io
Aug 26 22:34:08 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:08 INFO request done name=w3coins avg=382.23996ms last=565.146647ms
Aug 26 22:34:10 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:10 INFO proxying request name="Allnodes ⚡️ Nodes & Staking" url=https://akash-rpc.publicnode.com:443
Aug 26 22:34:10 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:10 INFO request done name="Allnodes ⚡️ Nodes & Staking" avg=328.144464ms last=228.557756ms
Aug 26 22:34:13 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:13 INFO proxying request name=ValidatorNode url=https://akash-rpc.validatornode.com
Aug 26 22:34:13 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:13 INFO request done name=ValidatorNode avg=154.788507ms last=181.662926ms
Aug 26 22:34:14 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:14 INFO proxying request name="WhisperNode 🤐" url=https://rpc-akash.whispernode.com:443
Aug 26 22:34:14 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:14 INFO request done name="WhisperNode 🤐" avg=355.714206ms last=436.637606ms
Aug 26 22:34:16 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:16 INFO proxying request name=Stakewolle url=https://public.stakewolle.com
Aug 26 22:34:16 rpc-proxy akash-rpc-proxy[52783]: 2024/08/26 22:34:16 INFO request done name=Stakewolle avg=295.560864ms last=318.479477ms
@chainzero chainzero changed the title Occasional 404/502 Errors from Proxy Occasional 404/502 Errors from RPC Proxy Aug 26, 2024
@caarlos0
Copy link
Contributor

that seems to be running an older version of this project.. the most recent one should have the status in that log, and also properly ignore servers that are erroring

see #12

@caarlos0
Copy link
Contributor

yeah, this is not the latest version: https://peers.akash.network

@chainzero
Copy link
Collaborator Author

Installed just cut release v0.0.1-rc3 and indeed status messages are written to logs in this version.

When request receives 404/502 response such as:

curl -ks https://peers.akash.network:443/rpc
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.26.1</center>
</body>
</html>

RPC proxy server logs show:

Aug 27 13:12:10 rpc-proxy akash-rpc-proxy[54476]: 2024/08/27 13:12:10 INFO request done name=Stakewolle avg=294.04119ms last=294.04119ms status=404
Aug 27 13:14:31 rpc-proxy akash-rpc-proxy[54476]: 2024/08/27 13:14:31 INFO request done name=Notional avg=398.930145ms last=398.930145ms status=502

@caarlos0
Copy link
Contributor

cool!

there are a few nodes that I think always error - maybe we should remove them from the seed file as well?

@troian
Copy link
Member

troian commented Aug 27, 2024

@caarlos0 are those Stakewolle and Notional that gives errors?

@troian
Copy link
Member

troian commented Aug 27, 2024

@caarlos0 can you capture reply status and rule out nodes with 404/502 errors?

@chainzero
Copy link
Collaborator Author

We could definitely consider removing nodes from seeds if those nodes consistently error.

But in addition - do we currently attempt an additional seed if the first attempted is unresponsive/returns error?

@caarlos0
Copy link
Contributor

@caarlos0 can you capture reply status and rule out nodes with 404/502 errors?

it already does that... but after a while it will try again... its designed for intermittent errors, not exactly for things that always error...

But in addition - do we currently attempt an additional seed if the first attempted is unresponsive/returns error?

we don't, easy enough to do though, can pr it later today

@caarlos0
Copy link
Contributor

@caarlos0 are those Stakewolle and Notional that gives errors?

these seem to be all the problematic ones

@caarlos0
Copy link
Contributor

The error rate is time-boxed too, can be customized by setting AKASH_PROXY_HEALTHY_ERROR_RATE_BUCKET_TIMEOUT (default is 1m, which is good for dev, probably too low for prod)

https://github.com/akash-network/rpc-proxy/blob/main/config.md

@troian
Copy link
Member

troian commented Aug 27, 2024

But in addition - do we currently attempt an additional seed if the first attempted is unresponsive/returns error?

if original node returns an error pass it to the client, don't try another node

@chainzero
Copy link
Collaborator Author

cool!

there are a few nodes that I think always error - maybe we should remove them from the seed file as well?

@caarlos0 - was there intent to remove these unresponsive nodes? Still getting sporadic occurrences of 404/502 and these same, identified nodes are used in such cases. Example - this morning getting 502 errors when https://rpc-akash.cosmos-spaces.cloud/ is used.

@caarlos0
Copy link
Contributor

I can remove them yes, @troian wdyt?

@caarlos0
Copy link
Contributor

this one seems to be back up though

@troian
Copy link
Member

troian commented Aug 29, 2024

@caarlos0 lets do the following.
if node replies with anything than 200

  • pass reply back to user
  • mark node as unhealthy.
  • run healthcheck query (for example query validator set) until it starts replying 200 within 10 consecutive attempts.
  • return node back to active set if query check pass.
  • if node becomes unhealthy 3 times within 30minutes then exclude is completely from set until rpc-proxy restarts

@caarlos0
Copy link
Contributor

caarlos0 commented Sep 1, 2024

that sounds a bit overcomplicated... maybe something like this works better: #11

for the cases here, in which the server is always 5xx/404, it should solve the issue I think 🤔

@troian
Copy link
Member

troian commented Sep 3, 2024

sadly nature of RPC nodes on cosmos needs sophisticated health checks, historically the were pretty bad.
for example node can become simply due to being asked way to many requested and get back to normal shortly

@nick134-bit
Copy link

nick134-bit commented Sep 6, 2024

@caarlos0 are those Stakewolle and Notional that gives errors?

these seem to be all the problematic ones

https://public.stakewolle.com/cosmos/akash/rpc/ <- note the last slash, is causing issues because they seem to use a proxy with the public available rpcs themself but dont filter them. cosmos-spaces and notional are down consistently and handled. Other works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@caarlos0 @troian @nick134-bit @chainzero and others