5.1 HA Cluster failed on single node failure

Haider

New Member
Mar 7, 2018
14
0
1
37
Hello all,

I have a 4 node cluster with multiple OSDs, 9 for each node. I have pool set for 2/2.

Recently one of the node (master) failed. All vms migrated to node2 but none of them came up. More strangely my vm on node4 and node4 also stopped working. I must have some huge mistake in my setup.

This is not how it is supposed to work.

Here are logs relevant to event from syslog

Apr 17 23:41:11 hosting-b pve-ha-lrm[3963]: start failed: command '/usr/bin/kvm -id 101 -chardev 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/101.pid -daemonize -smbios 'type=1,uuid=dda46807-1e54-492a-af87-c1a9aef58e7f' -name HybridITServices -smp '2,sockets=2,cores=1,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/101.vnc,x509,password -cpu kvm64,+lahf_lm,+sep -m 20480 -k en-us -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:303b856cc6' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=100' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:rbd/vm-101-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/rbd_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=200' -netdev 'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=F6:1F:51:C8:F9:24,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'accel=tcg'' failed: got timeout

Apr 17 23:41:11 hosting-b pve-ha-lrm[3960]: <root@pam> end task UPID:hosting-b:00000F7B:00090AE8:5AD6E869:qmstart:101:root@pam: start failed: command '/usr/bin/kvm -id 101 -chardev 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/101.pid -daemonize -smbios 'type=1,uuid=dda46807-1e54-492a-af87-c1a9aef58e7f' -name HybridITServices -smp '2,sockets=2,cores=1,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/101.vnc,x509,password -cpu kvm64,+lahf_lm,+sep -m 20480 -k en-us -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:303b856cc6' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=100' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:rbd/vm-101-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/rbd_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=200' -netdev 'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=F6:1F:51:C8:F9:24,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'accel=tcg'' failed: got timeout

Apr 17 23:41:11 hosting-b pve-ha-lrm[3960]: service status vm:101 started

Apr 17 23:41:11 hosting-b pve-ha-lrm[3966]: start failed: command '/usr/bin/kvm -id 104 -chardev 'socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/104.pid -daemonize -smbios 'type=1,uuid=68658ece-d058-411a-8944-be3b7bf8bc39' -name Hybrid-VPSv4-174-79-24-141 -smp '2,sockets=1,cores=2,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/104.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 20480 -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:303b856cc6' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:rbd/vm-104-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/rbd_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap104i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=6A:56:69:D8:ED:89,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap104i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=F6:F9:22:67:E2:0E,netdev=net1,bus=pci.0,addr=0x13,id=net1,bootindex=301'' failed: got timeout

Apr 17 23:41:11 hosting-b pve-ha-lrm[3964]: <root@pam> end task UPID:hosting-b:00000F7E:00090AEB:5AD6E869:qmstart:104:root@pam: start failed: command '/usr/bin/kvm -id 104 -chardev 'socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/104.pid -daemonize -smbios 'type=1,uuid=68658ece-d058-411a-8944-be3b7bf8bc39' -name Hybrid-VPSv4-174-79-24-141 -smp '2,sockets=1,cores=2,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/104.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 20480 -k en-us -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:303b856cc6' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:rbd/vm-104-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/rbd_vm.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap104i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=6A:56:69:D8:ED:89,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap104i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' -device 'e1000,mac=F6:F9:22:67:E2:0E,netdev=net1,bus=pci.0,addr=0x13,id=net1,bootindex=301'' failed: got timeout
 
Last edited:
Recently one of the node (master) failed. All vms migrated to node2 but none of them came up. More strangely my vm on node4 and node4 also stopped working. I must have some huge mistake in my setup.
what did/does ceph status say?

a 2/2 setup means as soon as there is not enough space to have full 2 replicas, ceph does not allow writes anymore, so if 9/36 osds were missing,
the 27 others would have to have enough space for all data*2. is this the case?

also this:
-machine 'accel=tcg'

is this by design that you disabled kvm for this vm (not really good for performance)
 
Ceph status was unhealthy with all sorts of warning.

My vms can migrate and work fine if I do it manually but in this scenario that didn't happen.

Sorry for my ignorance but where do I change this option ?

-machine 'accel=tcg'

What is recommended Pool size and how can I change existing pool or Can I destroy it and create new. This is all in production.
 
You just set a 2/2 replication which is the cause of your issues, this is very expected.

Make sure that all your nodes are up again and monitor the Ceph health status, as soon as up again, change to a replication of 3/2 (which is the default).

If you change such essential setting like the replication you need to really understand in detail what you do, otherwise please go with the defaults.
 
Thank you for your guidance. I have changed it to 3/2.

About -machine 'accel=tcg' option, Where do I change to get better performance, Can I do it in production, will it have impact on existing VMs ?
 
About -machine 'accel=tcg' option, Where do I change to get better performance, Can I do it in production, will it have impact on existing VMs ?
in the options tab of the vm is an option 'KVMhardware virtualization'. enable this

for this to take effect, you have to shutdown the vm, and then start it again, a reboot from inside the vm is not enough
this is the default for vms, i do not know why you turned it off in the first place (this should only be used for testing)
 
Oh I see, I though that option was for nested virtualization. Thanks for clarifying and all your help.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!