Skip to content
This repository has been archived by the owner on Mar 28, 2024. It is now read-only.

Pending seems like the wrong state for a cluster with 1 / 6 OSDs down #6

Open
ChrisMacNaughton opened this issue Sep 2, 2016 · 4 comments
Assignees

Comments

@ChrisMacNaughton
Copy link

root@juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-7:# ./safe -e
Current OSD statuses:
● 1: Pending
● 2: Pending
● 3: Pending
● 4: Pending
● 5: Pending
root@juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-7:
# ceph -s
cluster 577e36dc-7127-11e6-be6f-fa163e71fd29
health HEALTH_WARN 80 pgs degraded; 80 pgs stuck unclean; 1/6 in osds are down
monmap e2: 3 mons at {juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-7=10.5.6.0:6789/0,juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-8=10.5.5.254:6789/0,juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-9=10.5.5.255:6789/0}, election epoch 8, quorum 0,1,2 juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-8,juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-9,juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-7
osdmap e25: 6 osds: 5 up, 6 in
pgmap v47: 192 pgs, 3 pools, 0 bytes data, 0 objects
222 MB used, 55007 MB / 55229 MB avail
80 active+degraded
112 active+clean

@0X1A
Copy link
Contributor

0X1A commented Sep 3, 2016

@ChrisMacNaughton Hm, that doesn't look good. Those PGs are marked active+degraded, they should be marked as non-safe. This isn't a side effect of the timing we talked about in #3?

@ChrisMacNaughton
Copy link
Author

@0X1A The issue here is that at least one of those OSDs should, in theory, be removable. 1/6 is down, so there are stale PGs, but I'd guess (possibly incorrectly) that we can get away with removing another OSd before it's impossible to remove another.

@0X1A
Copy link
Contributor

0X1A commented Sep 6, 2016

@ChrisMacNaughton Ah yes, I meant to say Pending. I think this is a matter of digging deeper in the current implementation because as you said, those OSDs may still be removable based on whether those objects have been copied enough times IIRC (or at least once). The Ceph documentation for this case lists ways to look into which objects are lost but I'm unsure of how there could be a better suggestion other than Pending since we are unsure of what the actual state is.

@ChrisMacNaughton
Copy link
Author

@0X1A I suspect that the best solution is to build out the actual PG Map and find out if any disks can be removed, given that this is the exhaustive check

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants