TASK ERROR: timeout waiting on systemd

Aug 3, 2023
9
2
3
Hi, dear,

I recently faced a big problem in a Proxmox 8 cluster. It all started when a Windows VM wouldn't start and displayed the following error in the task log: TASK ERROR: timeout waiting on systemd

In some research on Proxmox forums, some reported that restarting the host solved the problem, so I followed this idea: I started migrating the VMs to another host, and then a series of other problems started: some live migrations were "frozen", the VM stopped responding, other migrations gave an error and I could only migrate with the VM turned off, other migrations were even successful, but the "migrate" task never finished, it reached 100% and did not show "success", and error rbd: rbd2: no lock owners detected was constant in the syslog.

In the end, when I managed to remove all the VMs, I restarted the host and a message was displayed on the Proxmox host monitor: something to do with a timeout for terminating RBD processes. We use Ceph storage in a cluster of 8 nodes in version 17.2, and this problem reported happened specifically with Windows VMs that use the TPM device, it seems that the TPM process was stuck on the 2 hosts that support Windows VMs.

After restarting these 2 hosts, everything returned to normal, but my concern is that I don't know how this problem arose, I only know that it has to do with the RBD process on the TPM device of Windows VMs

Configurações da VM Windows que foram afetadas:

Code:
agent: 1
bios: ovmf
boot: order=scsi0
cores: 2
cpu: Haswell-noTSX
efidisk0: stor-vms:vm-10010-disk-2,efitype=4m,pre-enrolled-keys=1,size=528K
ide0: stor-vms:vm-10010-cloudinit,media=cdrom
ide2: none,media=cdrom
kvm: 1
machine: pc-i440fx-7.2
memory: 6144
meta: creation-qemu=7.2.0,ctime=1696525080
net0: virtio=BC:24:11:BE:2C:94,bridge=vmbr0,firewall=1,rate=12.5
numa: 0
ostype: win10
scsi0: stor-vms:vm-10010-disk-0,discard=on,iops_rd=2000,iops_wr=1000,mbps_rd=300,mbps_wr=200,size=50G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=4264191e-905d-4d06-a5be-4955a7ae0dc7
sockets: 1
tpmstate0: stor-vms:vm-10010-disk-1,size=4M,version=v2.0
vmgenid: 07cfe099-ca6f-4ba7-ab19-6f1ea61eba6e

Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-8
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph: 17.2.7-pve3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.4
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-3
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
 
I'm also interested in this. Currently seeing the 'Error: timeout waiting on systemd' and looking for potential solutions and to understand.
 
After what happened, I updated the PVE version and it is running at 8.2 and this problem no longer happened, but I never knew the root cause of it. I hope it stays that way
 
  • Like
Reactions: lknite
Also seeing this when migrating to a new host that is running 8.2.7. Trying to patch the others. Ceph is still v17.2 and I need to upgrade that.
 
I also have encountered this issue. When searching forum for "timeout waiting on systemd" we can find a lot of posts like this
It seems that there's some bug that doesn't cleanup VM properly after crash preventing VM from being started again.

I'm using Proxmox VE 8.3.3 with kernel 6.8.12-7-pve and I'm not using Ceph nor ZFS.

Basically for some reason VM crashed and in dmesg I see
Code:
[17023.949984] kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable
[17571.802191] INFO: task pvedaemon worke:1813 blocked for more than 122 seconds.
[17571.802198]       Not tainted 6.8.12-7-pve #1
[17571.802200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[17571.802203] task:pvedaemon worke state:D stack:0     pid:1813  tgid:1813  ppid:1812   flags:0x00000002
[17571.802207] Call Trace:
[17571.802212]  <TASK>
[17571.802308]  __schedule+0x42b/0x1500
[17571.802467]  ? restore_fpregs_from_fpstate+0x3d/0xd0
[17571.802472]  schedule+0x33/0x110
[17571.802475]  kvm_async_pf_task_wait_schedule+0x171/0x1b0
[17571.802479]  __kvm_handle_async_pf+0x5c/0xe0
[17571.802488]  exc_page_fault+0xb6/0x1b0
[17571.802491]  asm_exc_page_fault+0x27/0x30
[17571.802613] RIP: 0033:0x57453ce9a170
[17571.802725] RSP: 002b:00007ffdccf00030 EFLAGS: 00010202
[17571.802728] RAX: 00007fd0799000b8 RBX: 0000574545e86520 RCX: 00007fd0798e5010
[17571.802730] RDX: 0000000000000007 RSI: 0000574549bdd888 RDI: 00005745446842a0
[17571.802732] RBP: 00000000cc5e3615 R08: 0000000000000000 R09: 00005745495da058
[17571.802733] R10: 000057454cb74e58 R11: 0000574549bdd880 R12: 000057454468c960
[17571.802734] R13: 0000574545e86520 R14: 0000574549bdd888 R15: 00000000cc5e3615
[17571.802808]  </TASK>
[17571.803226] INFO: task CPU 1/KVM:125455 blocked for more than 122 seconds.
[17571.803230]       Not tainted 6.8.12-7-pve #1
[17571.803231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[17571.803233] task:CPU 1/KVM       state:D stack:0     pid:125455 tgid:125387 ppid:1      flags:0x00000002
[17571.803237] Call Trace:
[17571.803238]  <TASK>
[17571.803240]  __schedule+0x42b/0x1500
[17571.803245]  schedule+0x33/0x110
[17571.803247]  kvm_async_pf_task_wait_schedule+0x171/0x1b0
[17571.803250]  __kvm_handle_async_pf+0x5c/0xe0
[17571.803253]  exc_page_fault+0xb6/0x1b0
[17571.803256]  asm_exc_page_fault+0x27/0x30
[17571.803259] RIP: 0033:0x7b44e4b270d9
[17571.803264] RSP: 002b:00007b44d3dfacf0 EFLAGS: 00010206
[17571.803265] RAX: 000000000001ae91 RBX: 0000000000000190 RCX: 00007b4444005ff0
[17571.803267] RDX: 0000000000000195 RSI: 00007b4444006170 RDI: 0000000000000004
[17571.803268] RBP: 00007b4444000030 R08: 0000000000000000 R09: 0000000000000000
[17571.803269] R10: 0000000000000000 R11: 0000000000000190 R12: 0000000000000180
[17571.803270] R13: 0000000000000019 R14: ffffffffffffda70 R15: 00007b4444000090
[17571.803272]  </TASK>

systemctl status qemu.slice
Code:
● qemu.slice - Slice /qemu
     Loaded: loaded
     Active: active since Tue 2025-04-22 17:26:07 UTC; 48min ago
      Tasks: 3
     Memory: 1.4G
        CPU: 2min 56.626s
     CGroup: /qemu.slice
             └─105.scope
               └─125387 "[kvm]"

systemctl status 105.scope
Code:
○ 105.scope
     Loaded: loaded (/run/systemd/transient/105.scope; transient)
  Transient: yes
     Active: inactive (dead) since Tue 2025-04-22 17:41:51 UTC; 34min ago
   Duration: 15min 43.734s
      Tasks: 3 (limit: 9220)
     Memory: 1.4G
        CPU: 2min 56.626s
     CGroup: /qemu.slice/105.scope
             └─125387 "[kvm]"

ps -up 125387
Code:
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      125387  5.7  0.0      0     0 ?        Zl   17:26   2:54 [kvm] <defunct>

This KVM process can't be killed (kill -9 does nothing).

Code:
$ kill -9 125387
$ systemctl stop 105.scope
$ qm start 105
timeout waiting on systemd


I guess current only workaround is to reboot whole PVE host.
 
Last edited: