VM doesn't start Proxmox 6 - timeout waiting on systemd

gbr · May 25, 2020

This also stop backups from running when the backup reached the failed VM. Any chance there's some forward motion on this? Anything i can do on my end to help?

gbr · May 29, 2020

So, the same VM is down this morning. No backups, no stress, just stopped responding at 8:40 AM. It no longer listens to any commands from the web interface, and shows as running, but it clearly isn't.

Code:

root@pve-4:~# qm status 150
status: running
root@pve-4:~# qm terminal 150
unable to find a serial interface
root@pve-4:~# qm reset 150
VM 150 qmp command 'system_reset' failed - unable to connect to VM 150 qmp socket - timeout after 31 retries
root@pve-4:~# qm stop 150
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
root@pve-4:~# ps ax | grep 150
10097 pts/0    S+     0:00 grep 150
root@pve-4:~# qm start 150
timeout waiting on systemd
root@pve-4:~#

root@pve-4:~# qm migrate 150 pve-3
2020-05-29 09:41:12 starting migration of VM 150 to node 'pve-3' (192.168.100.112)
^Ccommand '/usr/bin/qemu-img info '--output=json' /mnt/pve/Slow-NAS/images/150/vm-150-disk-0.qcow2' failed: interrupted by signal
could not parse qemu-img info command output for '/mnt/pve/Slow-NAS/images/150/vm-150-disk-0.qcow2'
2020-05-29 09:46:55 migration finished successfully (duration 00:05:43)
root@pve-4:~#
/CODE]

Migrating the VM to a server that had a previous failure results in the same error. Migrating to a server that has never run this VM (to failure) and I can start it.

kohly · Jun 4, 2020

Hi all,

I have (nearly?) the same issue here:

Some VM got stuck and I have no chance to restart or kill and start it again on the same pve-node (lets call ist '1st').
After killing the VM, I can (live) migrate it to an other pve-host (2nd) and start it there.
Only a reboot of the 1st node recovers the issue.
After the reboot of the 1st node I also can (live) migrate the VM back to the 1st node.

Bash:

root@hv01:~# pvecm status
Cluster information
-------------------
Name:             pvecluchaosinc
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jun  4 15:24:01 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.289
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.99.0.99
0x00000002          1 10.99.0.1 (local)
0x00000003          1 10.99.0.9

Bash:

root@hv01:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

All qcow2 disks are located on a NFS-storage

Catwoolfii · Jun 5, 2020

Good afternoon.
I encounter similar problems usually on old hardware (for example: intel xeon x3400 LGA1156). Most recently with kernel 5.4
I have never encountered such problems on more modern equipment.

gbr · Jun 5, 2020

Hey Proxmox folk... What can I do to help you solve this issue?

Gerald

t.lamprecht · Jun 5, 2020

gbr said:
Hey Proxmox folk... What can I do to help you solve this issue?

We cannot reproduce and understand how this could still happen with current Proxmox VE 6.2.
We check really closely on the VM systemd scopes and the timeouts are set so that running into them would normally indicate that something is really really slow (close to hanging)

While checking out this problem closely during the 5.x release we came to the conclusion that there can be some race/timing issue due to how systemd behaves if we only trust the "systemctl" command. This lead to developing a solution which talks over DBus directly to systemd to poll the current status in a safe way:
https://git.proxmox.com/?p=pve-comm...d7e877a9fd1910daf4e7cd937aa4bca8;hb=HEAD#l142
With this we could not reproduce this at all anymore, and we're talking hundreds of machines, production and testing, doing various amounts of tasks which go through this code path.

For now, I can only recommend ensuring your setups are updated on latest 6.x release, nothing weird is in the logs, and that nothing really hangs which would make this message just a side effect - especially NFS can be prone to get into the infamous "D" (uninterruptible IO) state if network or the share are down.
Hints on things done out of the ordinary on your setup(s) could help to reproduce this, and we would be happy to hear them.

dominiaz · Jun 6, 2020

I have the same issue with 1 of VM.

My setup is:

HP DL380G10:

2 x Intel(R) Xeon(R) Gold 6130
768 GB of RAM

Proxmox 6.2 with latest updates from apt
Single Node
Disks are LVM-Thin

Do you want access to my host and check the error?

Code:

TASK ERROR: timeout waiting on systemd

Catwoolfii · Jun 6, 2020

For those who have these problems - do you use hardware raid?

Asr · Jun 6, 2020

dominiaz said:
I have the same issue with 1 of VM.

My setup is:

HP DL380G10:

2 x Intel(R) Xeon(R) Gold 6130
768 GB of RAM

Proxmox 6.2 with latest updates from apt
Single Node
Disks are LVM-Thin

Do you want access to my host and check the error?

Code:

TASK ERROR: timeout waiting on systemd

What is your VM Os and what happen on it before that issue ?

Asr · Jun 6, 2020

Catwoolfii said:
For those who have these problems - do you use hardware raid?

At first i only use Lvm-Thin and now Raidz1.

Asr · Jun 6, 2020

t.lamprecht said:
We cannot reproduce and understand how this could still happen with current Proxmox VE 6.2.
We check really closely on the VM systemd scopes and the timeouts are set so that running into them would normally indicate that something is really really slow (close to hanging)

While checking out this problem closely during the 5.x release we came to the conclusion that there can be some race/timing issue due to how systemd behaves if we only trust the "systemctl" command. This lead to developing a solution which talks over DBus directly to systemd to poll the current status in a safe way:
https://git.proxmox.com/?p=pve-comm...d7e877a9fd1910daf4e7cd937aa4bca8;hb=HEAD#l142
With this we could not reproduce this at all anymore, and we're talking hundreds of machines, production and testing, doing various amounts of tasks which go through this code path.

For now, I can only recommend ensuring your setups are updated on latest 6.x release, nothing weird is in the logs, and that nothing really hangs which would make this message just a side effect - especially NFS can be prone to get into the infamous "D" (uninterruptible IO) state if network or the share are down.
Hints on things done out of the ordinary on your setup(s) could help to reproduce this, and we would be happy to hear them.

It happen when a guest os fail/hang and no respond to host caused by bugg like zerocopy issue, vswitch drivers kernel panic, etc....
It seams that the host can't stop correctly all the freezing VM process and still keep previous running guest's status until to reboot the node.
So is only the systemd scopes acting on this ?

To reproduce, can you totaly freeze a VM ?

t.lamprecht · Jun 6, 2020

Asr said:
It happen when a guest os fail/hang and no respond to host caused by bugg like zerocopy issue, vswitch drivers kernel panic, etc....
It seams that the host can't stop correctly all the freezing VM process and still keep previous running guest's status until to reboot the node.
So is only the systemd scopes acting on this ?

Yeah, I mean for that this error is totally expected and is only a side effect caused by the real error: your VM freezing.

Note that if the VM process freezes in such away that it cannot be stopped this timeout error will always be shown, because well, the VM scope can never exit as a process in it isn't responding.

A freezing guest normally means a dead NFS or the like, so check there first.

To all, if you're on up to date PVE 5.4 or 6.2 and see this error you're 99.999% not affected by the issues of the thread starters.
They had the issue that the VM could be stopped fine, but the scope was still around even if no process was.

If you issue is that the VM is also still around just please open a new thread as then this error is expected and your real problem is the VM refusing to stop.

Jay Sullivan · Jun 18, 2020

FWIW, we just started to see this problem in our two clusters in the past few days. We upgraded from 5.4 to 6.2 a couple of weeks ago: pve-manager/6.2-4/9824574a (running kernel: 5.4.41-1-pve). No openvswitch; one cluster with PVE (hyperconverged) Ceph, one cluster with it's own dedicated non-PVE-managed Ceph cluster.

I have a few VMs that I can't start right now and I see defunct kvm processes on the host. I see the same systemd timeout that others have seen. These VMs aren't critical so I haven't tried rebooting the host yet, but the defunct kvm process is unkillable so I think I might be looking at a reboot. It's happening on multiple hosts across clusters.

kohly · Jun 20, 2020

it happened again to one guest here.
what i saw in the monitoring (nagios based checkmk) was, that the guest-os was at a high cpu utilization (97% io-wait).
all other guests on this and the other hosts are running fine (atm.)
trying to stop the guest from the web-gui does not work correctly, the guest stops only if i cancel the stop command. (!?)
i know, sounds a little bit strange, but:
if i migrate the so 'stop canceled' guest, the migration process finishes immediately after canceling the migration. (!?)
that is really strange, isn't it?
the stopped guest can be migrated offline back and forth in a normal way from and to other hosts, but once the guest is migrated back to 'hv01' on which the issue has begun, migration is only possible by canceling the migration process.
the vm can be started only on other host then 'hv01'
once the guest was started on an other host, it can not be migrated online back to hv01.
offline migration works instead.
a reboot of hv01 reverts to normal state.

the issue did occur in the past to different guests on different hosts in a random manner.

which logfile/information may i post to analyze this behavior?

sahostking · Jun 23, 2020

Same problem here. Just ran an update on all nodes last night and first servers to go down seem to be the Windows KVM ones. first it lost NFS connections not sure why. Then it the vms which use LVM started freezing. I stopped the vms but then could not start them as I got:

TASK ERROR: timeout waiting on systemd

I checked qemu.slice and it was not running so weird I could not start it. Even tried the fix posted earlier in thread but that didnt help.

Had to reboot two nodes now to get them stable and stop our clients killing us.

Jay Sullivan · Jun 23, 2020

Interesting, it was actually Windows guests that I first noticed it on, too.

gbr · Jun 30, 2020

I get roughly one VM a day, across multiple Linux OSs. No Windows yet. I'm going to start monitoring IOWait to see if that is an issue.

I've been running NFS for storage for years, and have not seen this issue. Plus, none of my other VMs are down, so I can't see it being an NFS issue.

proxwolfe · Jun 30, 2020

Hi,

I, too, am seeing this error.

My Thinkstation P700 is on PVE 6.2-6.

Last night I shutdown the host and the host shutdown my Windows 10 VM. Earlier tonight I started the host again and it booted up without any issues. My Windows VM was set to auto start but it did not come up. I accessed the PVE web gui and found the OP error. In addition, the overall summary page wasn't functioning and looking at the hardware choices (like usb passthrough) for the affected VM only brought up empty lists.

Rebooted the host but the issue is persisting so far.

gbr · Jul 3, 2020

Up to date, and still having issues. PVE has gone from a remarkably stable product to ultra flakey. Thanks systemd.

I'm either going to have to roll back to PVE 5.x or leave Proxmox entirely. This instability is seriously hurting my production servers.

tom · Jul 3, 2020

gbr said:
Up to date, and still having issues. PVE has gone from a remarkably stable product to ultra flakey. Thanks systemd.

My experience is exactly the opposite. As this thread is huge, please open a new thread describing your issue.

VM doesn't start Proxmox 6 - timeout waiting on systemd

Well-Known Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Proxmox Staff Member

Renowned Member

Renowned Member

New Member

New Member

New Member

Proxmox Staff Member

Member

Renowned Member

Renowned Member

Member

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

We value your privacy