- May 13, 2012
This also stop backups from running when the backup reached the failed VM. Any chance there's some forward motion on this? Anything i can do on my end to help?
root@pve-4:~# qm status 150 status: running root@pve-4:~# qm terminal 150 unable to find a serial interface root@pve-4:~# qm reset 150 VM 150 qmp command 'system_reset' failed - unable to connect to VM 150 qmp socket - timeout after 31 retries root@pve-4:~# qm stop 150 VM quit/powerdown failed - terminating now with SIGTERM VM still running - terminating now with SIGKILL root@pve-4:~# ps ax | grep 150 10097 pts/0 S+ 0:00 grep 150 root@pve-4:~# qm start 150 timeout waiting on systemd root@pve-4:~# root@pve-4:~# qm migrate 150 pve-3 2020-05-29 09:41:12 starting migration of VM 150 to node 'pve-3' (192.168.100.112) ^Ccommand '/usr/bin/qemu-img info '--output=json' /mnt/pve/Slow-NAS/images/150/vm-150-disk-0.qcow2' failed: interrupted by signal could not parse qemu-img info command output for '/mnt/pve/Slow-NAS/images/150/vm-150-disk-0.qcow2' 2020-05-29 09:46:55 migration finished successfully (duration 00:05:43) root@pve-4:~# /CODE]
root@hv01:~# pvecm status Cluster information ------------------- Name: pvecluchaosinc Config Version: 3 Transport: knet Secure auth: on Quorum information ------------------ Date: Thu Jun 4 15:24:01 2020 Quorum provider: corosync_votequorum Nodes: 3 Node ID: 0x00000001 Ring ID: 1.289 Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 10.99.0.99 0x00000002 1 10.99.0.1 (local) 0x00000003 1 10.99.0.9
root@hv01:~# pveversion -v proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve) pve-manager: 6.2-4 (running version: 6.2-4/9824574a) pve-kernel-5.4: 6.2-2 pve-kernel-helper: 6.2-2 pve-kernel-5.4.41-1-pve: 5.4.41-1 ceph-fuse: 12.2.13-pve1 corosync: 3.0.3-pve1 criu: 3.11-3 glusterfs-client: 5.5-3 ifupdown: 0.8.35+pve1 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.15-pve1 libproxmox-acme-perl: 1.0.4 libpve-access-control: 6.1-1 libpve-apiclient-perl: 3.0-3 libpve-common-perl: 6.1-2 libpve-guest-common-perl: 3.0-10 libpve-http-server-perl: 3.0-5 libpve-storage-perl: 6.1-8 libqb0: 1.0.5-1 libspice-server1: 0.14.2-4~pve6+1 lvm2: 2.03.02-pve4 lxc-pve: 4.0.2-1 lxcfs: 4.0.3-pve2 novnc-pve: 1.1.0-1 openvswitch-switch: 2.12.0-1 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.2-1 pve-cluster: 6.1-8 pve-container: 3.1-6 pve-docs: 6.2-4 pve-edk2-firmware: 2.20200229-1 pve-firewall: 4.1-2 pve-firmware: 3.1-1 pve-ha-manager: 3.0-9 pve-i18n: 2.1-2 pve-qemu-kvm: 5.0.0-2 pve-xtermjs: 4.3.0-1 qemu-server: 6.2-2 smartmontools: 7.1-pve2 spiceterm: 3.1-1 vncterm: 1.6-1 zfsutils-linux: 0.8.4-pve1
We cannot reproduce and understand how this could still happen with current Proxmox VE 6.2.Hey Proxmox folk... What can I do to help you solve this issue?
TASK ERROR: timeout waiting on systemd
What is your VM Os and what happen on it before that issue ?I have the same issue with 1 of VM.
My setup is:
2 x Intel(R) Xeon(R) Gold 6130
768 GB of RAM
Proxmox 6.2 with latest updates from apt
Disks are LVM-Thin
Do you want access to my host and check the error?
TASK ERROR: timeout waiting on systemd
It happen when a guest os fail/hang and no respond to host caused by bugg like zerocopy issue, vswitch drivers kernel panic, etc....We cannot reproduce and understand how this could still happen with current Proxmox VE 6.2.
We check really closely on the VM systemd scopes and the timeouts are set so that running into them would normally indicate that something is really really slow (close to hanging)
While checking out this problem closely during the 5.x release we came to the conclusion that there can be some race/timing issue due to how systemd behaves if we only trust the "systemctl" command. This lead to developing a solution which talks over DBus directly to systemd to poll the current status in a safe way:
With this we could not reproduce this at all anymore, and we're talking hundreds of machines, production and testing, doing various amounts of tasks which go through this code path.
For now, I can only recommend ensuring your setups are updated on latest 6.x release, nothing weird is in the logs, and that nothing really hangs which would make this message just a side effect - especially NFS can be prone to get into the infamous "D" (uninterruptible IO) state if network or the share are down.
Hints on things done out of the ordinary on your setup(s) could help to reproduce this, and we would be happy to hear them.
Yeah, I mean for that this error is totally expected and is only a side effect caused by the real error: your VM freezing.It happen when a guest os fail/hang and no respond to host caused by bugg like zerocopy issue, vswitch drivers kernel panic, etc....
It seams that the host can't stop correctly all the freezing VM process and still keep previous running guest's status until to reboot the node.
So is only the systemd scopes acting on this ?