5 Node cluster with ceph and HA. VM/CT not failing over

poxin

Well-Known Member
Jun 27, 2017
70
6
48
I've got a 5 node setup with ceph running SSD. VM's and Containers have been setup for HA.

We just had a node crash due to hardware, but the machines that were on that node are not failing over. In the GUI they have a grey question mark next to them, and clicking on them shows "no route to hose (595).

I don't seem to be able to bring these machines up on a different node, they appear completely lost?? Does not seem like intended behavior.. Any help would be great.

1632532867631.png

1632532881130.png

1632532887043.png

Code:
:~# ceph -w
  cluster:
    id:     972b4d17-8d71-4b77-8ed9-8e44e2c84f16
    health: HEALTH_WARN
            1/5 mons down, quorum phy-hv-sl-0049,phy-hv-sl-0051,phy-hv-sl-0048,phy-hv-sl-0050
 
  services:
    mon: 5 daemons, quorum phy-hv-sl-0049,phy-hv-sl-0051,phy-hv-sl-0048,phy-hv-sl-0050 (age 89m), out of quorum: phy-hv-sl-0047
    mgr: phy-hv-sl-0051(active, since 88m), standbys: phy-hv-sl-0048, phy-hv-sl-0050, phy-hv-sl-0049
    mds: 1/1 daemons up, 3 standby
    osd: 25 osds: 20 up (since 89m), 20 in (since 79m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 1089 pgs
    objects: 73.61k objects, 158 GiB
    usage:   507 GiB used, 8.6 TiB / 9.1 TiB avail
    pgs:     1089 active+clean
 
  io:
    client:   1.3 KiB/s rd, 993 KiB/s wr, 0 op/s rd, 45 op/s wr

Code:
:~# ha-manager status
quorum OK
master phy-hv-sl-0050 (active, Fri Sep 24 21:23:04 2021)
lrm phy-hv-sl-0047 (old timestamp - dead?, Fri Sep 24 19:49:03 2021)
lrm phy-hv-sl-0048 (active, Fri Sep 24 21:23:04 2021)
lrm phy-hv-sl-0049 (active, Fri Sep 24 21:23:06 2021)
lrm phy-hv-sl-0050 (active, Fri Sep 24 21:23:06 2021)
lrm phy-hv-sl-0051 (active, Fri Sep 24 21:23:01 2021)
service ct:100 (phy-hv-sl-0051, started)
service ct:102 (phy-hv-sl-0048, started)
service ct:104 (phy-hv-sl-0050, started)
service ct:107 (phy-hv-sl-0051, started)
service vm:103 (phy-hv-sl-0048, started)
service vm:106 (phy-hv-sl-0049, started)
 
Interestingly from another node, I can see the conf file for the machines that did not move in the directory of the offline node. These machines did not have any local resources and are using shared ceph storage for the drive.

In the ha-manager status above, they don't appear to be listed. However I know every machine was setup for HA before this node went offline.

Code:
root@phy-hv-sl-0048:/# ls /etc/pve/nodes/phy-hv-sl-0047/qemu-server/
105.conf

Is there a way to force move these for now? Still wondering why they didn't come back up when the node went offline.
 
Last edited:
Guess I'm restoring from backup! I'll play around with the ha simulator to figure out what's going on.
 
it seems you did not put them under ha? (this does not happen automatically)

service ct:100 (phy-hv-sl-0051, started)
service ct:102 (phy-hv-sl-0048, started)
service ct:104 (phy-hv-sl-0050, started)
service ct:107 (phy-hv-sl-0051, started)
service vm:103 (phy-hv-sl-0048, started)
service vm:106 (phy-hv-sl-0049, started)
those are the ha managed guests, 101 and 105 are missing...

Is there a way to force move these for now? Still wondering why they didn't come back up when the node went offline.
yes there is a way, but only do this if you are know what your doing and verified that the node they were on is really not running anymore and the storage exists on the target nodes.
if you're not careful, you can corrupt your guests!

Code:
# mv /etc/pve/nodes/OLDNODE/qemu-server/ID.conf /etc/pve/nodes/NEWNODE/qemu-server/ID.conf
 
Yeah that's the thing, they were already configured for HA but when that node went offline they disappeared from the ha-manager status list.

I got the hardware fixed yesterday on the crashed node, booted it up, and the vm's came back online. They then appeared back in ha-manager status without me doing anything.
 
mhmm.. that sounds weird. can you post the logs from the nodes and the content of /etc/pve/ha/resources.cfg ?
 
Absolutely, what log file would be helpful? Here's resources.cfg

Code:
:~# cat /etc/pve/ha/resources.cfg
ct: 100
    group HA
    state started

ct: 102
    group HA
    state started

ct: 104
    group HA
    state started

vm: 103
    group HA
    state started

vm: 106
    group HA
    state started

ct: 107
    group HA
    state started

ct: 101
    group HA
    state started

ct: 108
    group HA
    state started

vm: 105
    group HA
    state started
 
ok the resources.cfg looks normal AFAICS

what log file would be helpful?
syslog (or journalctl output) from the time the node was going offline until a few minutes after, ideally from all remaining nodes
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!