Startup of VMs fails because of timeout

Dec 19, 2019
3
0
6
51
Hi all,

Since the upgrade to Proxmox 6, I face sometimes issues with startup of VMs, it looks like following in the web interface:
TASK ERROR: start failed: command '/usr/bin/kvm -id 105 -name test3 ... -machine 'type=pc'' failed: got timeout

When I invoke 'qm showcmd 105 | bash', it just takes some time and VM is starting and everything works fine.

I use CIFS via 10GE with a cluster of 4 nodes with about 10-12 VMs. Storage system is Synology with 4 spinning HDDs and SSD r/w cache, the overall storage performance looks good.
It doesn't happen with local storage, I couldn't even reproduce this problem stable, but it occurs very regularly. Once this problem is visible for one VM, it also fails for all other VMs. When I start the VM manually via showcmd and shut it down after that, I usually can start it again via Web UI without any problems. One day later (or some hours later) the problem is back again.

I tried to trace back the issue and thought it me be related somehow to storage IO performance, but I could not confirm that till now. I used cifsiostat and saw only one IO request in the timeframe between invoking kvm process and real startup of VM. dd on the hypervisor shows pretty stable good throughput without any delays to the mounpoint.

With debian stretch and proxmox 5 the problem isn't there. Any ideas/hints how to proceed with troubleshooting or where to look at?

Detailed error message:
TASK ERROR: start failed: command '/usr/bin/kvm -id 105 -name test3 -chardev 'socket,id=qmp,path=/var/run/qemu-server/105.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/105.pid -daemonize -smbios 'type=1,uuid=393adff8-6e37-4c6e-b55a-6123235839bd' -smp '1,sockets=1,cores=1,maxcpus=1' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/105.vnc,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 512 -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'vmgenid,guid=971a536c-8d1b-4d2d-9826-157893157b3c' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/105.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:ad17a1a73553' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/mnt/pve/HDMZ/images/105/vm-105-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap105i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=7A:47:9B:35:55:34,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc'' failed: got timeout
 
When I invoke 'qm showcmd 105 | bash', it just takes some time and VM is starting and everything works fine.

What is some time, roughly? 10 seconds? a minute? multiple minutes?

Do you run on latest 6.1, with the 5.3 based kernel booted?
 
Differently: sometimes about 10 secs, sometimes about 30 secs, sometimes about 1m, but not more.

root@proxmox1:~# pveversion
pve-manager/6.0-15/52b91481 (running kernel: 5.0.21-5-pve)
root@proxmox1:~# uname -a
Linux proxmox1 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64 GNU/Linux
 
Differently: sometimes about 10 secs, sometimes about 30 secs, sometimes about 1m, but not more.

root@proxmox1:~# pveversion
pve-manager/6.0-15/52b91481 (running kernel: 5.0.21-5-pve)
root@proxmox1:~# uname -a
Linux proxmox1 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64 GNU/Linux

You should definitely update. 6.0 was a new release, id get over on 6.1 and see how things are.
 
I upgraded to 6.1, same picture :-(

root@proxmox1:~# pveversion
pve-manager/6.1-5/9bf06119 (running kernel: 5.3.13-1-pve)
root@proxmox1:~# uname -a
Linux proxmox1 5.3.13-1-pve #1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100) x86_64 GNU/Linux