Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 #266

Open
replay opened this issue Jun 15, 2022 · 0 comments
Open

Comments

@replay
Copy link
Contributor

replay commented Jun 15, 2022

What did you do?

The customer upgraded 9 GEM ingesters from v1.7.0 to v2.0.1, the corresponding Prometheus version are:

v1.7.0: github.com/grafana/prometheus-private v0.0.0-20211105104652-a882d28d367e
v2.0.1: github.com/grafana/mimir-prometheus v0.0.0-20220210151959-f8e3195f7500

(Note that prometheus-private has been renamed to mimir-prometheus, so that's the same repo)

What did you expect to see?

We expected the new version to replay the WAL successfully.

What did you see instead? Under which circumstances?

After updating the 9 ingesters, 2 of them encountered a WAL corruption during startup. I have checked the diff between the two used Prometheus versions, and the only modifications to the WAL which I can see are in this PR, but I don't know how this change would lead to a corruption during replay.

The customer also reported that the WAL replay took much longer than usually on the other 7 ingesters which came back successfully, I'm not sure if that's relevant at all to this issue.

I only have screenshots of the logs.

This is the WAL corruption in the ingester log:

image

This is from the log of one of the ingesters which came up successfully, note that the reported replay duration is 2h22min, the customer said that usually restarting an ingester took 5-10min, not 2h22min:

image

Side-note: This could also simply indicate an issue with the used storage volumes, but I still wanted to submit this issue to check if somebody might be aware of a change that could lead to this.

@replay replay changed the title On-prem customer reported that after upgrading 9 Ingesters from GEM v1.7.0 -> v2.0.1 2/9 encountered a WAL corruption during startup WAL corruption on 2/9 ingesters after upgrading from GEM v1.7.0 -> v2.0.1 Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant