5 Node cluster with ceph and HA. VM/CT not failing over

poxin · Sep 25, 2021

I've got a 5 node setup with ceph running SSD. VM's and Containers have been setup for HA.

We just had a node crash due to hardware, but the machines that were on that node are not failing over. In the GUI they have a grey question mark next to them, and clicking on them shows "no route to hose (595).

I don't seem to be able to bring these machines up on a different node, they appear completely lost?? Does not seem like intended behavior.. Any help would be great.

Code:

:~# ceph -w
  cluster:
    id:     972b4d17-8d71-4b77-8ed9-8e44e2c84f16
    health: HEALTH_WARN
            1/5 mons down, quorum phy-hv-sl-0049,phy-hv-sl-0051,phy-hv-sl-0048,phy-hv-sl-0050
 
  services:
    mon: 5 daemons, quorum phy-hv-sl-0049,phy-hv-sl-0051,phy-hv-sl-0048,phy-hv-sl-0050 (age 89m), out of quorum: phy-hv-sl-0047
    mgr: phy-hv-sl-0051(active, since 88m), standbys: phy-hv-sl-0048, phy-hv-sl-0050, phy-hv-sl-0049
    mds: 1/1 daemons up, 3 standby
    osd: 25 osds: 20 up (since 89m), 20 in (since 79m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 1089 pgs
    objects: 73.61k objects, 158 GiB
    usage:   507 GiB used, 8.6 TiB / 9.1 TiB avail
    pgs:     1089 active+clean
 
  io:
    client:   1.3 KiB/s rd, 993 KiB/s wr, 0 op/s rd, 45 op/s wr

Code:

:~# ha-manager status
quorum OK
master phy-hv-sl-0050 (active, Fri Sep 24 21:23:04 2021)
lrm phy-hv-sl-0047 (old timestamp - dead?, Fri Sep 24 19:49:03 2021)
lrm phy-hv-sl-0048 (active, Fri Sep 24 21:23:04 2021)
lrm phy-hv-sl-0049 (active, Fri Sep 24 21:23:06 2021)
lrm phy-hv-sl-0050 (active, Fri Sep 24 21:23:06 2021)
lrm phy-hv-sl-0051 (active, Fri Sep 24 21:23:01 2021)
service ct:100 (phy-hv-sl-0051, started)
service ct:102 (phy-hv-sl-0048, started)
service ct:104 (phy-hv-sl-0050, started)
service ct:107 (phy-hv-sl-0051, started)
service vm:103 (phy-hv-sl-0048, started)
service vm:106 (phy-hv-sl-0049, started)

poxin · Sep 25, 2021

Interestingly from another node, I can see the conf file for the machines that did not move in the directory of the offline node. These machines did not have any local resources and are using shared ceph storage for the drive.

In the ha-manager status above, they don't appear to be listed. However I know every machine was setup for HA before this node went offline.

Code:

root@phy-hv-sl-0048:/# ls /etc/pve/nodes/phy-hv-sl-0047/qemu-server/
105.conf

Is there a way to force move these for now? Still wondering why they didn't come back up when the node went offline.

poxin · Sep 27, 2021

Guess I'm restoring from backup! I'll play around with the ha simulator to figure out what's going on.

dcsapak · Sep 28, 2021

it seems you did not put them under ha? (this does not happen automatically)

poxin said:
service ct:100 (phy-hv-sl-0051, started)
service ct:102 (phy-hv-sl-0048, started)
service ct:104 (phy-hv-sl-0050, started)
service ct:107 (phy-hv-sl-0051, started)
service vm:103 (phy-hv-sl-0048, started)
service vm:106 (phy-hv-sl-0049, started)

those are the ha managed guests, 101 and 105 are missing...

poxin said:
Is there a way to force move these for now? Still wondering why they didn't come back up when the node went offline.

yes there is a way, but only do this if you are know what your doing and verified that the node they were on is really not running anymore and the storage exists on the target nodes.
if you're not careful, you can corrupt your guests!

Code:

# mv /etc/pve/nodes/OLDNODE/qemu-server/ID.conf /etc/pve/nodes/NEWNODE/qemu-server/ID.conf

poxin · Sep 28, 2021

Yeah that's the thing, they were already configured for HA but when that node went offline they disappeared from the ha-manager status list.

I got the hardware fixed yesterday on the crashed node, booted it up, and the vm's came back online. They then appeared back in ha-manager status without me doing anything.

dcsapak · Sep 29, 2021

mhmm.. that sounds weird. can you post the logs from the nodes and the content of /etc/pve/ha/resources.cfg ?

poxin · Sep 29, 2021

Absolutely, what log file would be helpful? Here's resources.cfg

Code:

:~# cat /etc/pve/ha/resources.cfg
ct: 100
    group HA
    state started

ct: 102
    group HA
    state started

ct: 104
    group HA
    state started

vm: 103
    group HA
    state started

vm: 106
    group HA
    state started

ct: 107
    group HA
    state started

ct: 101
    group HA
    state started

ct: 108
    group HA
    state started

vm: 105
    group HA
    state started

dcsapak · Sep 30, 2021

ok the resources.cfg looks normal AFAICS

poxin said:
what log file would be helpful?

syslog (or journalctl output) from the time the node was going offline until a few minutes after, ideally from all remaining nodes

Search

Search

5 Node cluster with ceph and HA. VM/CT not failing over

poxin

Renowned Member

poxin

Renowned Member

poxin

Renowned Member

dcsapak

Proxmox Staff Member

poxin

Renowned Member

dcsapak

Proxmox Staff Member

poxin

Renowned Member

dcsapak

Proxmox Staff Member

We value your privacy