Hi,
I'm making some testing in a ceph cluster before put VMs in production onto this environment.. but we are seeing a strange problem..
When I reboot a node (Clean OS Shutdown) everything works great in the Ceph Manager, the node OSD become DOWN and everything works as expected..
But if we simulate a node power failure, pulling the power cords out from the server (Dirty Shutdown) the CEPH Manager still shows the node OSDs as UP/IN
the Survivor node logs still shows: "pgmap v19142: 1024 pgs: 1024 active+clean", into the Proxmox GUI, the OSDs from the failed node still appears as UP/IN
Some more logs I collected from the survivor node:
/var/log/ceph/ceph.log:
cluster [WRN] Health check update: 129 slow ops, oldest one blocked for 537 sec, daemons [mon,pve01-bnu,mon,pve03-bnu] have slow ops. (SLOW_OPS)
/var/log/syslog:
09:40:41.025 7f9781bdd700 -1 osd.6 207 heartbeat_check: no reply from 189.XXX.XXX.XXX:6830 osd.19 since back 2019-10-24 09:30:17.278044 front 2019-10-24 09:30:17.277976 (oldest deadline 2019-10-24 09:30:42.577666)
/var/log/ceph/ceph-mgr.node02.log:
log_channel(cluster) log [DBG] : pgmap v19222: 1024 pgs: 1024 active+clean; 5.2 GiB data, 11 GiB used, 18 TiB / 18 TiB avail
In this situation, I can't access the STORAGE from survivor node anymore... and the VMs becomes unstable (read/write errors)
I can only get the environment stable again if a manually mark the OSDs from the failed node as DOWN, using the command: ceph osd down osd.X
I'm making some testing in a ceph cluster before put VMs in production onto this environment.. but we are seeing a strange problem..
When I reboot a node (Clean OS Shutdown) everything works great in the Ceph Manager, the node OSD become DOWN and everything works as expected..
But if we simulate a node power failure, pulling the power cords out from the server (Dirty Shutdown) the CEPH Manager still shows the node OSDs as UP/IN
the Survivor node logs still shows: "pgmap v19142: 1024 pgs: 1024 active+clean", into the Proxmox GUI, the OSDs from the failed node still appears as UP/IN
Some more logs I collected from the survivor node:
/var/log/ceph/ceph.log:
cluster [WRN] Health check update: 129 slow ops, oldest one blocked for 537 sec, daemons [mon,pve01-bnu,mon,pve03-bnu] have slow ops. (SLOW_OPS)
/var/log/syslog:
09:40:41.025 7f9781bdd700 -1 osd.6 207 heartbeat_check: no reply from 189.XXX.XXX.XXX:6830 osd.19 since back 2019-10-24 09:30:17.278044 front 2019-10-24 09:30:17.277976 (oldest deadline 2019-10-24 09:30:42.577666)
/var/log/ceph/ceph-mgr.node02.log:
log_channel(cluster) log [DBG] : pgmap v19222: 1024 pgs: 1024 active+clean; 5.2 GiB data, 11 GiB used, 18 TiB / 18 TiB avail
In this situation, I can't access the STORAGE from survivor node anymore... and the VMs becomes unstable (read/write errors)
I can only get the environment stable again if a manually mark the OSDs from the failed node as DOWN, using the command: ceph osd down osd.X