TASK ERROR: timeout waiting on systemd

Oct 1, 2020
2
0
1
52
We have been getting error TASK ERROR: timeout waiting on systemd on system shutdowns/startups for purposes of backups/snapshots. This is occurring on multiple PVE servers and multiple VMs. (both Ubuntu and Windows OS VMs). This has caused backups to fail. This issue has been discussed on the user forums for awhile but didn't find a solution,

This began after our upgrade from 5.x to 6.x. This is affecting linux systems which are NOT running NTFS file system and Windows 10 systems. Because you have announced end-of-life for 5.x, downgrading to 5.x is apparently no longer option. We can (as a temporary measure) written and cron'ed babysitting scripts, but this really should not be necessary.

Servers are HP Proliant GEN-9, either with single or dual XEON processors, and plenty of RAM well above the total usage of all running VMs. One location is running Proxmox on HP Workstation PC. Most VM's are Ubuntu 18.04 or Windows 10 1909.
 
@wolfgang recently managed to reproduce this issue - if load on the DBUS system bus is high, we can't talk to systemd and can't start VMs. there are some settings that we can tweak to improve the situation, Wolfgang is working on finalizing proper patches.

do you by chance have a service running that communicates a lot over DBUS - known causes that we found out about so far are udisks2 or prometheus running on the nodes.. you can check with 'dbus-monitor --system', if you are not currently starting/stopping VMs it should be (near) silent.
 
no, not in the VM. on the host ;) PVE is talking to systemd via DBUS when starting a VM, if DBUS does not work because of an overload situation we can't start the VM
 
We also appear to also bee experiencing the same problem. We have some Windows VMs (appears to primarily affect Active Directory role related hosts, such as DCs and dedicated Azure Connect instance, although we primarily host Linux VMs on the affected cluster). We are most probably also not observing this problem as much on our other production clusters as the VMs there very rarely get stopped and started as a KVM guest as a whole (Windows Update restarts wouldn't cause the whole VM container to get restarted).

I initially found troubleshooting steps in the following thread:
https://forum.proxmox.com/threads/vm-doesnt-start-proxmox-6-timeout-waiting-on-systemd.56218/

Code:
[admin@kvm1a ~]# systemctl status qemu.slice
● qemu.slice
   Loaded: loaded
   Active: active since Tue 2020-12-22 12:29:36 SAST; 1 weeks 4 days ago
    Tasks: 189
   Memory: 230.9G
   CGroup: /qemu.slice
     <snip>
           └─144.scope
             └─9941 [kvm]

Dec 22 12:31:57 kvm1a QEMU[4935]: kvm: warning: TSC frequency mismatch between VM (2499998 kHz) and host (2499999 kHz), and TSC scaling unavailable
Dec 22 12:31:57 kvm1a QEMU[4935]: kvm: warning: TSC frequency mismatch between VM (2499998 kHz) and host (2499999 kHz), and TSC scaling unavailable
Dec 22 12:31:57 kvm1a QEMU[4935]: kvm: warning: TSC frequency mismatch between VM (2499998 kHz) and host (2499999 kHz), and TSC scaling unavailable
Dec 22 12:31:57 kvm1a QEMU[4935]: kvm: warning: TSC frequency mismatch between VM (2499998 kHz) and host (2499999 kHz), and TSC scaling unavailable
Dec 22 12:31:57 kvm1a QEMU[4935]: kvm: warning: TSC frequency mismatch between VM (2499998 kHz) and host (2499999 kHz), and TSC scaling unavailable
Dec 22 12:36:30 kvm1a ovs-vsctl[9976]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap144i0
Dec 22 12:36:30 kvm1a ovs-vsctl[9976]: ovs|00002|db_ctl_base|ERR|no port named tap144i0
Dec 22 12:36:30 kvm1a ovs-vsctl[9977]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln144i0
Dec 22 12:36:30 kvm1a ovs-vsctl[9977]: ovs|00002|db_ctl_base|ERR|no port named fwln144i0
Dec 22 12:36:30 kvm1a ovs-vsctl[9978]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl -- add-port vmbr0 tap144i0 tag=1 vlan_mode=dot1q-tunnel other-config:qi


All VMs running normally typically have a long line of options following the scope's KVM instance. They also typically start as:
<PID> /usr/bin/kvm -id xxx -name yyy...

The '[kvm]' subsequently identifies the process as defunct, confirmed when showing process information:

Code:
[admin@kvm1a ~]# ps -Flww -p 9941
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
7 Z admin       9941       1  4  80   0 -     0 -          0   2  2020 ?        14:13:38 [kvm] <defunct>


One can't kill the process so one essentially has to move all other VMs off the host and then restart it:
Code:
[admin@kvm1a ~]# systemctl stop 144.scope;
[admin@kvm1a ~]# telinit u; # Restart init just in case it's causing the problem (highly unlikely)
[admin@kvm1a ~]# ps -Flww -p 9941;
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
7 Z admin       9941       1  4  80   0 -     0 -          0   2  2020 ?        14:13:38 [kvm] <defunct>
[admin@kvm1a ~]# kill -9 9941;
[admin@kvm1a ~]# ps -Flww -p 9941;
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
7 Z admin       9941       1  4  80   0 -     0 -          0   2  2020 ?        14:13:38 [kvm] <defunct>
[admin@kvm1a ~]# cd /proc/9941/task;
[admin@kvm1a task]# kill *;
[admin@kvm1a task]# for f in *; do ps -Flww -p $f; done;
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
7 Z admin       9941       1  4  80   0 -     0 -          0   2  2020 ?        14:13:38 [kvm] <defunct>


Code:
[admin@kvm1a log]# cat /etc/pve/qemu-server/144.conf
agent: 1
bios: ovmf
boot: cdn
bootdisk: scsi0
cores: 1
cpu: SandyBridge,flags=+pcid
efidisk0: rbd_hdd:vm-144-disk-1,size=1M
ide2: none,media=cdrom
localtime: 1
memory: 4096
name: redacted
net0: virtio=DE:AD:BEE:EEF:DE:AD,bridge=vmbr0,tag=1
numa: 1
onboot: 1
ostype: win10
protection: 1
scsi0: rbd_hdd:base-116-disk-0/vm-144-disk-0,cache=writeback,discard=on,size=80G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=deadbeef-dead-beef-dead-beefdeadbeef
sockets: 2

System is running OvS with LACP bonded ethernet ports.


Code:
[admin@kvm1a ~]# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!