[SOLVED] TASK ERROR: start failed... got timeout

some-admins · Oct 13, 2020

Hello,

we have problems with recovering from a node outage.

The scenario is:

4 nodes
a VM on node A (part of a HA group)
we cut of power of node A
after a while the VM is migrated to node B
the Start-Task on the new Node fails (see error), but the Status is running and HA State is started
no answer to ping
no VNC

Code:

TASK ERROR: start failed: command '/usr/bin/kvm -id 102 -name debian1 -chardev 'socket,id=qmp,path=/var/run/qemu-server/102.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/102.pid -daemonize -smbios 'type=1,uuid=1ff44064-a6a1-4c53-8ca4-a1952157d65e' -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/102.vnc,password -cpu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep -m 2048 -object 'iothread,id=iothread-virtioscsi0' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' -device 'vmgenid,guid=516f5b55-7636-47d5-ba06-dd468370d4ce' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/102.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:abaf84a6f7c7' -drive 'file=/mnt/pve/cephfs/template/iso/debian-10.6.0-amd64-netinst.iso,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=rbd:rbd_pool/vm-102-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/rbd_pool.keyring,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap102i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=1E:A4:4B:7A:57:DB,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc+pve0'' failed: got timeout

After a long time (~ 13 minutes) the VM gets started properly. In the logs are a lot of

Code:

Oct 13 11:20:49 px01 pvedaemon[2594]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - unable to connect to VM 102 qga socket - timeout after 31 retries

This issue seems related to https://forum.proxmox.com/threads/task-error-start-failed.72450/, but I'm not sure.

Could you help us?

Cedric

some-admins · Oct 15, 2020

Where can I modify timeouts like the one above?

some-admins · Oct 16, 2020

Changing size/min_size from 3/3 to 3/2 seems to fix the issue.

M4XWELL · Jan 25, 2022

some-admins said:
Changing size/min_size from 3/3 to 3/2 seems to fix the issue.

where did you make this change, I have the same problem, the same error, but if I disable kvm virtualization, the vm turns on, but the other vms have kvm virtualization enabled and turned on

stooovie · Jun 4, 2023

Solved how?

richieman · Jun 23, 2023

I also have this "got timeout" issue sometimes. In my case it happens only with VMs that have a PCI passthrough usually after the hypervisor has been running for quite a while. I don't know why that is but I have a workaround that always seems to work.

What I do is copy the entire command from the error message and paste it into my proxmox shell. Pay attention to the quotes in the command (copy from after the first quote until just before the last quote).

This way you give the VM more time to start without a timeout.

Hope this helps anybody.

leesteken · Jun 23, 2023

richieman said:
I also have this "got timeout" issue sometimes. In my case it happens only with VMs that have a PCI passthrough usually after the hypervisor has been running for quite a while. I don't know why that is but I have a workaround that always seems to work.

What I do is copy the entire command from the error message and paste it into my proxmox shell. Pay attention to the quotes in the command (copy from after the first quote until just before the last quote).

This way you give the VM more time to start without a timeout.

Sounds like Proxmox cannot free enough memory fast enough to pin all the VM memory in actual memory (because of passthrough). Starting the VM with less memory, stopping another VM and/or reducing the ZFS memory might also help.

richieman · Jul 4, 2023

leesteken said:
Sounds like Proxmox cannot free enough memory fast enough to pin all the VM memory in actual memory (because of passthrough). Starting the VM with less memory, stopping another VM and/or reducing the ZFS memory might also help.

The VM uses 16 GB memory and indeed all memory is taken by ZFS ARC. I'm wondering why this only seems to happen when there is a PCI passthrough. Is memory handled differently then?

leesteken · Jul 4, 2023

richieman said:
The VM uses 16 GB memory and indeed all memory is taken by ZFS ARC. I'm wondering why this only seems to happen when there is a PCI passthrough. Is memory handled differently then?

Because PCI(e) devices can do DMA at any time without warning, all VM memory needs to be pinned into actual host RAM. Ballooning therefore won't work but memory hot-(un)plug does.

Search

Search

[SOLVED] TASK ERROR: start failed... got timeout

some-admins

Active Member

some-admins

Active Member

some-admins

Active Member

M4XWELL

Member

stooovie

Member

richieman

Member

leesteken

Distinguished Member

richieman

Member

leesteken

Distinguished Member