HA failover not working

fractalv

Member
Jan 30, 2020
2
0
6
63
I built a 3 node cluster in Proxmox VE 6.1, 2 nodes sharing Ceph, 3rd node for quorum purposes. When killing one of the primary nodes, VMs transfer to the other node in the GUI, but cannot be pinged or console displayed until the killed node is restored and the VMs return. Manual migration works correctly as expected, only HA migration has this problem. Where can I begin to troubleshoot this problem?
 
When killing one of the primary nodes, VMs transfer to the other node in the GUI

How do you "kill" the node?

Is the VM responsive over the noVNC console through the webinterface (need to connect through the online node's webinterface).

2 nodes sharing Ceph

Only two? Ceph requires at least three nodes to work in a HA environment. Also for quorum purpose.
Else it will get read only as soon as a "primary", as you call it, node goes down and the VMs cannot work as expected.
That'd explain your symptoms.
 
Thank you for the explanation. Ok, installed Ceph on node 3, upgraded nodes 1 and 2 so all are the same version. Added node3 monitor and manager. Allowed Ceph to finish distribution, everything green.

I "kill" node 1 by disconnecting the DCA cable connecting node1 to the switch. VMs migrate to node2 as they should, but an error like the one below appears in the task history for each VM and the VMs are unreachable:

TASK ERROR: start failed: command '/usr/bin/kvm -id 102 -name {VM name} -chardev 'socket,id=qmp,path=/var/run/qemu-server/102.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/102.pid -daemonize -smbios 'type=1,uuid={uuid number}' -smp '6,sockets=1,cores=6,maxcpus=6' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/102.vnc,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 8192 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid={guid number}' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:23534b4d8f20' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' -drive 'file=rbd:ceph_data/vm-102-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/ceph_data.keyring,if=none,id=drive-sata0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap102i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac={mac addr},netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap102i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac={mac addr},netdev=net1,bus=pci.0,addr=0x13,id=net1,bootindex=301' -netdev 'type=tap,id=net2,ifname=tap102i2,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac={mac addr},netdev=net2,bus=pci.0,addr=0x14,id=net2,bootindex=302' -machine 'type=pc+pve1'' failed: got timeout

Thank you so much for the help.
 
Thank you for the explanation. Ok, installed Ceph on node 3, upgraded nodes 1 and 2 so all are the same version. Added node3 monitor and manager. Allowed Ceph to finish distribution, everything green.

I "kill" node 1 by disconnecting the DCA cable connecting node1 to the switch. VMs migrate to node2 as they should, but an error like the one below appears in the task history for each VM and the VMs are unreachable:

TASK ERROR: start failed: command '/usr/bin/kvm -id 102 -name {VM name} -chardev 'socket,id=qmp,path=/var/run/qemu-server/102.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/102.pid -daemonize -smbios 'type=1,uuid={uuid number}' -smp '6,sockets=1,cores=6,maxcpus=6' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/102.vnc,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 8192 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid={guid number}' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:23534b4d8f20' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' -drive 'file=rbd:ceph_data/vm-102-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/ceph_data.keyring,if=none,id=drive-sata0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap102i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac={mac addr},netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap102i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac={mac addr},netdev=net1,bus=pci.0,addr=0x13,id=net1,bootindex=301' -netdev 'type=tap,id=net2,ifname=tap102i2,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac={mac addr},netdev=net2,bus=pci.0,addr=0x14,id=net2,bootindex=302' -machine 'type=pc+pve1'' failed: got timeout

Thank you so much for the help.
were you able to solve it? it happens the same to me.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!