root@pve-4:~# qm status 150
status: running
root@pve-4:~# qm terminal 150
unable to find a serial interface
root@pve-4:~# qm reset 150
VM 150 qmp command 'system_reset' failed - unable to connect to VM 150 qmp socket - timeout after 31 retries
root@pve-4:~# qm stop 150
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
root@pve-4:~# ps ax | grep 150
10097 pts/0 S+ 0:00 grep 150
root@pve-4:~# qm start 150
timeout waiting on systemd
root@pve-4:~#
root@pve-4:~# qm migrate 150 pve-3
2020-05-29 09:41:12 starting migration of VM 150 to node 'pve-3' (192.168.100.112)
^Ccommand '/usr/bin/qemu-img info '--output=json' /mnt/pve/Slow-NAS/images/150/vm-150-disk-0.qcow2' failed: interrupted by signal
could not parse qemu-img info command output for '/mnt/pve/Slow-NAS/images/150/vm-150-disk-0.qcow2'
2020-05-29 09:46:55 migration finished successfully (duration 00:05:43)
root@pve-4:~#
/CODE]
root@hv01:~# pvecm status
Cluster information
-------------------
Name: pvecluchaosinc
Config Version: 3
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Jun 4 15:24:01 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.289
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.99.0.99
0x00000002 1 10.99.0.1 (local)
0x00000003 1 10.99.0.9
root@hv01:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
Hey Proxmox folk... What can I do to help you solve this issue?
TASK ERROR: timeout waiting on systemd
I have the same issue with 1 of VM.
My setup is:
HP DL380G10:
2 x Intel(R) Xeon(R) Gold 6130
768 GB of RAM
Proxmox 6.2 with latest updates from apt
Single Node
Disks are LVM-Thin
Do you want access to my host and check the error?
Code:TASK ERROR: timeout waiting on systemd
At first i only use Lvm-Thin and now Raidz1.For those who have these problems - do you use hardware raid?
We cannot reproduce and understand how this could still happen with current Proxmox VE 6.2.
We check really closely on the VM systemd scopes and the timeouts are set so that running into them would normally indicate that something is really really slow (close to hanging)
While checking out this problem closely during the 5.x release we came to the conclusion that there can be some race/timing issue due to how systemd behaves if we only trust the "systemctl" command. This lead to developing a solution which talks over DBus directly to systemd to poll the current status in a safe way:
https://git.proxmox.com/?p=pve-comm...d7e877a9fd1910daf4e7cd937aa4bca8;hb=HEAD#l142
With this we could not reproduce this at all anymore, and we're talking hundreds of machines, production and testing, doing various amounts of tasks which go through this code path.
For now, I can only recommend ensuring your setups are updated on latest 6.x release, nothing weird is in the logs, and that nothing really hangs which would make this message just a side effect - especially NFS can be prone to get into the infamous "D" (uninterruptible IO) state if network or the share are down.
Hints on things done out of the ordinary on your setup(s) could help to reproduce this, and we would be happy to hear them.
It happen when a guest os fail/hang and no respond to host caused by bugg like zerocopy issue, vswitch drivers kernel panic, etc....
It seams that the host can't stop correctly all the freezing VM process and still keep previous running guest's status until to reboot the node.
So is only the systemd scopes acting on this ?
Up to date, and still having issues. PVE has gone from a remarkably stable product to ultra flakey. Thanks systemd.