Ceph nodes showing degraded until OSD Start

100percentjake

New Member
Jan 3, 2017
20
0
1
30
Hi everyone,

Got a 3-node basic Ceph+Proxmox HA cluster set up and it worked great for roughly a day. Each node has one OSD on it taking up the space of the available RAID array, it was all working great and I spun up a couple machines to test it. Come in a day later and nodes two and three are down and out until I go to the OSD tab of the GUI and hit "start" on them at which point they come back just fine. At first I thought it could be an issue with a Ceph service not auto-starting since it consistently happened after a server reboot but I can't find anything to suggest this is the case by Googling, nor does it explain how the Ceph cluster will randomly fall apart.

Looking in the Ceph log at the time of the nodes going down and out shows this:

Code:
2017-01-16 14:05:22.032015 mon.0 10.10.30.220:6789/0 93 : cluster [INF] osd.0 10.10.30.220:6800/6857 failed (4 reports from 1 peers after 20.220493 >= grace 20.000000)
2017-01-16 14:05:22.032069 mon.0 10.10.30.220:6789/0 94 : cluster [INF] osd.1 10.10.30.221:6800/2461 failed (4 reports from 1 peers after 20.220286 >= grace 20.000000)
2017-01-16 14:05:23.165177 mon.0 10.10.30.220:6789/0 97 : cluster [INF] osdmap e43: 3 osds: 1 up, 3 in
2017-01-16 14:05:23.219525 mon.0 10.10.30.220:6789/0 98 : cluster [INF] pgmap v49956: 64 pgs: 53 stale+active+clean, 11 peering; 2700 MB data, 8180 MB used, 3105 GB / 3113 GB avail
2017-01-16 14:05:24.239117 mon.0 10.10.30.220:6789/0 99 : cluster [INF] osdmap e44: 3 osds: 1 up, 3 in
2017-01-16 14:05:24.298586 mon.0 10.10.30.220:6789/0 100 : cluster [INF] pgmap v49957: 64 pgs: 53 stale+active+clean, 11 peering; 2700 MB data, 8180 MB used, 3105 GB / 3113 GB avail
2017-01-16 14:05:28.900159 mon.0 10.10.30.220:6789/0 101 : cluster [INF] pgmap v49958: 64 pgs: 64 active+undersized+degraded; 2700 MB data, 8181 MB used, 3105 GB / 3113 GB avail; 1400/2100 objects degraded (66.667%)
2017-01-16 14:05:31.131221 mon.0 10.10.30.220:6789/0 102 : cluster [INF] HEALTH_WARN; 64 pgs degraded; 63 pgs stuck unclean; 64 pgs undersized; recovery 1400/2100 objects degraded (66.667%); too few PGs per OSD (21 < min 30); 2/3 in osds are down
2017-01-16 14:06:03.903965 mon.0 10.10.30.220:6789/0 103 : cluster [INF] pgmap v49959: 64 pgs: 64 active+undersized+degraded; 2700 MB data, 8181 MB used, 3105 GB / 3113 GB avail; 1400/2100 objects degraded (66.667%)
2017-01-16 14:06:18.904832 mon.0 10.10.30.220:6789/0 104 : cluster [INF] pgmap v49960: 64 pgs: 64 active+undersized+degraded; 2700 MB data, 8181 MB used, 3105 GB / 3113 GB avail; 450 B/s rd, 0 op/s; 1400/2100 objects degraded (66.667%)
2017-01-16 14:06:23.907947 mon.0 10.10.30.220:6789/0 105 : cluster [INF] pgmap v49961: 64 pgs: 64 active+undersized+degraded; 2700 MB data, 8181 MB used, 3105 GB / 3113 GB avail; 73505 B/s rd, 102 kB/s wr, 31 op/s; 1400/2100 objects degraded (66.667%)