Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Blocks After RPC Failure, Resolved by Pruning DB #1008

Closed
iHiteshAgrawal opened this issue Jul 31, 2024 · 3 comments
Closed

Missing Blocks After RPC Failure, Resolved by Pruning DB #1008

iHiteshAgrawal opened this issue Jul 31, 2024 · 3 comments

Comments

@iHiteshAgrawal
Copy link

iHiteshAgrawal commented Jul 31, 2024

We are experiencing an issue where a number of blocks are missing from the blockchain after encountering RPC failures. We are utilizing load balancer for our RPC connections.

Summary:

  1. RPC calls fail at specific block heights (RPC was down).
  2. The blocks at these heights are missing from indexer, after switching to good RPC (without pruning the db).

Fix: Pruning the database and restarting the indexer successfully recovered the missing blocks.

PS: The load balancer doesn't seem to be switching to other RPC URL after one fails at a particular request.

Related: #974 and #861

@0xOlias
Copy link
Collaborator

0xOlias commented Aug 5, 2024

Thanks for opening. You're correct that the loadBalance transport does not "skip" an inner transport if it starts returning errors. We're working on a new transport that combines the behavior of loadBalance and Viem's fallback transport to handle this scenario.

Regarding the missing blocks issue, it would be helpful to understand how specifically the RPC was "down". If it was fully down (eg 500 status codes) it seems unlikely that Ponder would continue marking block ranges as cached. However, if it was serving incorrect data (e.g. incorrect block.logsBloom or incomplete eth_getLogs responses) that could cause the issue you're seeing. Any additional info you have here would be helpful.

Also, if you happen to still have the "corrupted" database around, it would be helpful to inspect it. Let me know if you do and perhaps you could DM me a connection string (or database file if using SQLite) on Telegram.

@iHiteshAgrawal
Copy link
Author

Thanks for opening. You're correct that the loadBalance transport does not "skip" an inner transport if it starts returning errors. We're working on a new transport that combines the behavior of loadBalance and Viem's fallback transport to handle this scenario.

Regarding the missing blocks issue, it would be helpful to understand how specifically the RPC was "down". If it was fully down (eg 500 status codes) it seems unlikely that Ponder would continue marking block ranges as cached. However, if it was serving incorrect data (e.g. incorrect block.logsBloom or incomplete eth_getLogs responses) that could cause the issue you're seeing. Any additional info you have here would be helpful.

Also, if you happen to still have the "corrupted" database around, it would be helpful to inspect it. Let me know if you do and perhaps you could DM me a connection string (or database file if using SQLite) on Telegram.

Thanks for the insight on the load balancer behavior. We're looking forward to the new transport that addresses this.

To clarify the RPC issue, it was indeed completely down, returning 500 status codes. We also observed errors in the Ponder logs, specifically messages like: "BlockNotFoundError: Block at hash .... could not be found". This seems to suggest that Ponder was attempting to process blocks but couldn't access them, and skipped them (blocks) after we changed RPC.

Unfortunately, we no longer have access to the corrupted database. We performed the pruning operation as a way to quickly recover and resume normal operations.

@0xOlias
Copy link
Collaborator

0xOlias commented Aug 7, 2024

Got it, thanks for the response. Without the corrupted database, there's not much we can do. However, we are working on an improvement to the internal realtime sync engine, and before we ship that we will improve testing for the "temporary 500" case to make sure we aren't marking invalid data as cached/valid.

Closing for now. If this happens again, please keep the corrupted database so we can inspect it, and re-open. Thanks again for reporting, every issue helps.

@0xOlias 0xOlias closed this as completed Aug 7, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Ponder Roadmap Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants