[SOLVED] TASK ERROR: start failed... got timeout

Nov 14, 2019
36
2
28
35
Hello,

we have problems with recovering from a node outage.

The scenario is:
  • 4 nodes
  • a VM on node A (part of a HA group)
  • we cut of power of node A
  • after a while the VM is migrated to node B
  • the Start-Task on the new Node fails (see error), but the Status is running and HA State is started
  • no answer to ping
  • no VNC
Code:
TASK ERROR: start failed: command '/usr/bin/kvm -id 102 -name debian1 -chardev 'socket,id=qmp,path=/var/run/qemu-server/102.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/102.pid -daemonize -smbios 'type=1,uuid=1ff44064-a6a1-4c53-8ca4-a1952157d65e' -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/102.vnc,password -cpu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep -m 2048 -object 'iothread,id=iothread-virtioscsi0' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' -device 'vmgenid,guid=516f5b55-7636-47d5-ba06-dd468370d4ce' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/102.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:abaf84a6f7c7' -drive 'file=/mnt/pve/cephfs/template/iso/debian-10.6.0-amd64-netinst.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=rbd:rbd_pool/vm-102-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/rbd_pool.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap102i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=1E:A4:4B:7A:57:DB,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout

After a long time (~ 13 minutes) the VM gets started properly. In the logs are a lot of

Code:
Oct 13 11:20:49 px01 pvedaemon[2594]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - unable to connect to VM 102 qga socket - timeout after 31 retries

This issue seems related to https://forum.proxmox.com/threads/task-error-start-failed.72450/, but I'm not sure.

Could you help us?

Cedric
 
I also have this "got timeout" issue sometimes. In my case it happens only with VMs that have a PCI passthrough usually after the hypervisor has been running for quite a while. I don't know why that is but I have a workaround that always seems to work.

What I do is copy the entire command from the error message and paste it into my proxmox shell. Pay attention to the quotes in the command (copy from after the first quote until just before the last quote).

This way you give the VM more time to start without a timeout.

Hope this helps anybody.
 
I also have this "got timeout" issue sometimes. In my case it happens only with VMs that have a PCI passthrough usually after the hypervisor has been running for quite a while. I don't know why that is but I have a workaround that always seems to work.

What I do is copy the entire command from the error message and paste it into my proxmox shell. Pay attention to the quotes in the command (copy from after the first quote until just before the last quote).

This way you give the VM more time to start without a timeout.
Sounds like Proxmox cannot free enough memory fast enough to pin all the VM memory in actual memory (because of passthrough). Starting the VM with less memory, stopping another VM and/or reducing the ZFS memory might also help.
 
Sounds like Proxmox cannot free enough memory fast enough to pin all the VM memory in actual memory (because of passthrough). Starting the VM with less memory, stopping another VM and/or reducing the ZFS memory might also help.
The VM uses 16 GB memory and indeed all memory is taken by ZFS ARC. I'm wondering why this only seems to happen when there is a PCI passthrough. Is memory handled differently then?
 
The VM uses 16 GB memory and indeed all memory is taken by ZFS ARC. I'm wondering why this only seems to happen when there is a PCI passthrough. Is memory handled differently then?
Because PCI(e) devices can do DMA at any time without warning, all VM memory needs to be pinned into actual host RAM. Ballooning therefore won't work but memory hot-(un)plug does.
 
How to solve it? I have the same problem as you. It takes about 12 minutes for the virtual machine to start.
 
Last edited:
  • Like
Reactions: M4XWELL

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!