Pending seems like the wrong state for a cluster with 1 / 6 OSDs down #6

ChrisMacNaughton · 2016-09-02T17:51:28Z

root@juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-7:# ./safe -e
Current OSD statuses:
● 1: Pending
● 2: Pending
● 3: Pending
● 4: Pending
● 5: Pending
root@juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-7:# ceph -s
cluster 577e36dc-7127-11e6-be6f-fa163e71fd29
health HEALTH_WARN 80 pgs degraded; 80 pgs stuck unclean; 1/6 in osds are down
monmap e2: 3 mons at {juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-7=10.5.6.0:6789/0,juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-8=10.5.5.254:6789/0,juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-9=10.5.5.255:6789/0}, election epoch 8, quorum 0,1,2 juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-8,juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-9,juju-06235e16-9387-4d0b-88a7-e9a908e16803-machine-7
osdmap e25: 6 osds: 5 up, 6 in
pgmap v47: 192 pgs, 3 pools, 0 bytes data, 0 objects
222 MB used, 55007 MB / 55229 MB avail
80 active+degraded
112 active+clean

0X1A · 2016-09-03T03:26:49Z

@ChrisMacNaughton Hm, that doesn't look good. Those PGs are marked active+degraded, they should be marked as non-safe. This isn't a side effect of the timing we talked about in #3?

ChrisMacNaughton · 2016-09-06T12:09:34Z

@0X1A The issue here is that at least one of those OSDs should, in theory, be removable. 1/6 is down, so there are stale PGs, but I'd guess (possibly incorrectly) that we can get away with removing another OSd before it's impossible to remove another.

0X1A · 2016-09-06T14:38:59Z

@ChrisMacNaughton Ah yes, I meant to say Pending. I think this is a matter of digging deeper in the current implementation because as you said, those OSDs may still be removable based on whether those objects have been copied enough times IIRC (or at least once). The Ceph documentation for this case lists ways to look into which objects are lost but I'm unsure of how there could be a better suggestion other than Pending since we are unsure of what the actual state is.

ChrisMacNaughton · 2016-09-06T14:51:08Z

@0X1A I suspect that the best solution is to build out the actual PG Map and find out if any disks can be removed, given that this is the exhaustive check

ChrisMacNaughton assigned 0X1A Sep 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pending seems like the wrong state for a cluster with 1 / 6 OSDs down #6

Pending seems like the wrong state for a cluster with 1 / 6 OSDs down #6

ChrisMacNaughton commented Sep 2, 2016

0X1A commented Sep 3, 2016

ChrisMacNaughton commented Sep 6, 2016

0X1A commented Sep 6, 2016

ChrisMacNaughton commented Sep 6, 2016

Pending seems like the wrong state for a cluster with 1 / 6 OSDs down #6

Pending seems like the wrong state for a cluster with 1 / 6 OSDs down #6

Comments

ChrisMacNaughton commented Sep 2, 2016

0X1A commented Sep 3, 2016

ChrisMacNaughton commented Sep 6, 2016

0X1A commented Sep 6, 2016

ChrisMacNaughton commented Sep 6, 2016