VM doesn't start Proxmox 6 - timeout waiting on systemd

gbr

Active Member
May 13, 2012
116
2
38
This also stop backups from running when the backup reached the failed VM. Any chance there's some forward motion on this? Anything i can do on my end to help?
 

gbr

Active Member
May 13, 2012
116
2
38
So, the same VM is down this morning. No backups, no stress, just stopped responding at 8:40 AM. It no longer listens to any commands from the web interface, and shows as running, but it clearly isn't.

Code:
root@pve-4:~# qm status 150
status: running
root@pve-4:~# qm terminal 150
unable to find a serial interface
root@pve-4:~# qm reset 150
VM 150 qmp command 'system_reset' failed - unable to connect to VM 150 qmp socket - timeout after 31 retries
root@pve-4:~# qm stop 150
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
root@pve-4:~# ps ax | grep 150
10097 pts/0    S+     0:00 grep 150
root@pve-4:~# qm start 150
timeout waiting on systemd
root@pve-4:~#

root@pve-4:~# qm migrate 150 pve-3
2020-05-29 09:41:12 starting migration of VM 150 to node 'pve-3' (192.168.100.112)
^Ccommand '/usr/bin/qemu-img info '--output=json' /mnt/pve/Slow-NAS/images/150/vm-150-disk-0.qcow2' failed: interrupted by signal
could not parse qemu-img info command output for '/mnt/pve/Slow-NAS/images/150/vm-150-disk-0.qcow2'
2020-05-29 09:46:55 migration finished successfully (duration 00:05:43)
root@pve-4:~#
/CODE]
Migrating the VM to a server that had a previous failure results in the same error. Migrating to a server that has never run this VM (to failure) and I can start it.
 
Last edited:

kohly

Active Member
Dec 24, 2011
38
3
28
Hi all,

I have (nearly?) the same issue here:

Some VM got stuck and I have no chance to restart or kill and start it again on the same pve-node (lets call ist '1st').
After killing the VM, I can (live) migrate it to an other pve-host (2nd) and start it there.
Only a reboot of the 1st node recovers the issue.
After the reboot of the 1st node I also can (live) migrate the VM back to the 1st node.

Bash:
root@hv01:~# pvecm status
Cluster information
-------------------
Name:             pvecluchaosinc
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jun  4 15:24:01 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.289
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.99.0.99
0x00000002          1 10.99.0.1 (local)
0x00000003          1 10.99.0.9
Bash:
root@hv01:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
All qcow2 disks are located on a NFS-storage
 

Catwoolfii

Member
Nov 6, 2016
24
0
6
Russia
Good afternoon.
I encounter similar problems usually on old hardware (for example: intel xeon x3400 LGA1156). Most recently with kernel 5.4
I have never encountered such problems on more modern equipment.
 
Last edited:

gbr

Active Member
May 13, 2012
116
2
38
Hey Proxmox folk... What can I do to help you solve this issue?

Gerald
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
2,876
453
103
South Tyrol/Italy
shop.maurer-it.com
Hey Proxmox folk... What can I do to help you solve this issue?
We cannot reproduce and understand how this could still happen with current Proxmox VE 6.2.
We check really closely on the VM systemd scopes and the timeouts are set so that running into them would normally indicate that something is really really slow (close to hanging)

While checking out this problem closely during the 5.x release we came to the conclusion that there can be some race/timing issue due to how systemd behaves if we only trust the "systemctl" command. This lead to developing a solution which talks over DBus directly to systemd to poll the current status in a safe way:
https://git.proxmox.com/?p=pve-comm...d7e877a9fd1910daf4e7cd937aa4bca8;hb=HEAD#l142
With this we could not reproduce this at all anymore, and we're talking hundreds of machines, production and testing, doing various amounts of tasks which go through this code path.

For now, I can only recommend ensuring your setups are updated on latest 6.x release, nothing weird is in the logs, and that nothing really hangs which would make this message just a side effect - especially NFS can be prone to get into the infamous "D" (uninterruptible IO) state if network or the share are down.
Hints on things done out of the ordinary on your setup(s) could help to reproduce this, and we would be happy to hear them.
 

dominiaz

Member
Sep 16, 2016
22
1
8
33
I have the same issue with 1 of VM.

My setup is:

HP DL380G10:

2 x Intel(R) Xeon(R) Gold 6130
768 GB of RAM

Proxmox 6.2 with latest updates from apt
Single Node
Disks are LVM-Thin

Do you want access to my host and check the error?

Code:
TASK ERROR: timeout waiting on systemd
 

Asr

New Member
Dec 6, 2019
8
0
1
39
I have the same issue with 1 of VM.

My setup is:

HP DL380G10:

2 x Intel(R) Xeon(R) Gold 6130
768 GB of RAM

Proxmox 6.2 with latest updates from apt
Single Node
Disks are LVM-Thin

Do you want access to my host and check the error?

Code:
TASK ERROR: timeout waiting on systemd
What is your VM Os and what happen on it before that issue ?
 

Asr

New Member
Dec 6, 2019
8
0
1
39
We cannot reproduce and understand how this could still happen with current Proxmox VE 6.2.
We check really closely on the VM systemd scopes and the timeouts are set so that running into them would normally indicate that something is really really slow (close to hanging)

While checking out this problem closely during the 5.x release we came to the conclusion that there can be some race/timing issue due to how systemd behaves if we only trust the "systemctl" command. This lead to developing a solution which talks over DBus directly to systemd to poll the current status in a safe way:
https://git.proxmox.com/?p=pve-comm...d7e877a9fd1910daf4e7cd937aa4bca8;hb=HEAD#l142
With this we could not reproduce this at all anymore, and we're talking hundreds of machines, production and testing, doing various amounts of tasks which go through this code path.

For now, I can only recommend ensuring your setups are updated on latest 6.x release, nothing weird is in the logs, and that nothing really hangs which would make this message just a side effect - especially NFS can be prone to get into the infamous "D" (uninterruptible IO) state if network or the share are down.
Hints on things done out of the ordinary on your setup(s) could help to reproduce this, and we would be happy to hear them.
It happen when a guest os fail/hang and no respond to host caused by bugg like zerocopy issue, vswitch drivers kernel panic, etc....
It seams that the host can't stop correctly all the freezing VM process and still keep previous running guest's status until to reboot the node.
So is only the systemd scopes acting on this ?

To reproduce, can you totaly freeze a VM ?
 
Last edited:

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
2,876
453
103
South Tyrol/Italy
shop.maurer-it.com
It happen when a guest os fail/hang and no respond to host caused by bugg like zerocopy issue, vswitch drivers kernel panic, etc....
It seams that the host can't stop correctly all the freezing VM process and still keep previous running guest's status until to reboot the node.
So is only the systemd scopes acting on this ?
Yeah, I mean for that this error is totally expected and is only a side effect caused by the real error: your VM freezing.

Note that if the VM process freezes in such away that it cannot be stopped this timeout error will always be shown, because well, the VM scope can never exit as a process in it isn't responding.

A freezing guest normally means a dead NFS or the like, so check there first.

To all, if you're on up to date PVE 5.4 or 6.2 and see this error you're 99.999% not affected by the issues of the thread starters.
They had the issue that the VM could be stopped fine, but the scope was still around even if no process was.

If you issue is that the VM is also still around just please open a new thread as then this error is expected and your real problem is the VM refusing to stop.
 
  • Like
Reactions: Stoiko Ivanov

Jay Sullivan

Member
Mar 27, 2017
5
0
6
39
FWIW, we just started to see this problem in our two clusters in the past few days. We upgraded from 5.4 to 6.2 a couple of weeks ago: pve-manager/6.2-4/9824574a (running kernel: 5.4.41-1-pve). No openvswitch; one cluster with PVE (hyperconverged) Ceph, one cluster with it's own dedicated non-PVE-managed Ceph cluster.

I have a few VMs that I can't start right now and I see defunct kvm processes on the host. I see the same systemd timeout that others have seen. These VMs aren't critical so I haven't tried rebooting the host yet, but the defunct kvm process is unkillable so I think I might be looking at a reboot. It's happening on multiple hosts across clusters.
 

kohly

Active Member
Dec 24, 2011
38
3
28
it happened again to one guest here.
what i saw in the monitoring (nagios based checkmk) was, that the guest-os was at a high cpu utilization (97% io-wait).
all other guests on this and the other hosts are running fine (atm.)
trying to stop the guest from the web-gui does not work correctly, the guest stops only if i cancel the stop command. (!?)
i know, sounds a little bit strange, but:
if i migrate the so 'stop canceled' guest, the migration process finishes immediately after canceling the migration. (!?)
that is really strange, isn't it?
the stopped guest can be migrated offline back and forth in a normal way from and to other hosts, but once the guest is migrated back to 'hv01' on which the issue has begun, migration is only possible by canceling the migration process.
the vm can be started only on other host then 'hv01'
once the guest was started on an other host, it can not be migrated online back to hv01.
offline migration works instead.
a reboot of hv01 reverts to normal state.

the issue did occur in the past to different guests on different hosts in a random manner.

which logfile/information may i post to analyze this behavior?
 

sahostking

Active Member
Same problem here. Just ran an update on all nodes last night and first servers to go down seem to be the Windows KVM ones. first it lost NFS connections not sure why. Then it the vms which use LVM started freezing. I stopped the vms but then could not start them as I got:

TASK ERROR: timeout waiting on systemd

I checked qemu.slice and it was not running so weird I could not start it. Even tried the fix posted earlier in thread but that didnt help.

Had to reboot two nodes now to get them stable and stop our clients killing us.
 

gbr

Active Member
May 13, 2012
116
2
38
I get roughly one VM a day, across multiple Linux OSs. No Windows yet. I'm going to start monitoring IOWait to see if that is an issue.

I've been running NFS for storage for years, and have not seen this issue. Plus, none of my other VMs are down, so I can't see it being an NFS issue.
 

proxwolfe

New Member
Jun 20, 2020
12
2
3
45
Hi,

I, too, am seeing this error.

My Thinkstation P700 is on PVE 6.2-6.

Last night I shutdown the host and the host shutdown my Windows 10 VM. Earlier tonight I started the host again and it booted up without any issues. My Windows VM was set to auto start but it did not come up. I accessed the PVE web gui and found the OP error. In addition, the overall summary page wasn't functioning and looking at the hardware choices (like usb passthrough) for the affected VM only brought up empty lists.

Rebooted the host but the issue is persisting so far.
 

gbr

Active Member
May 13, 2012
116
2
38
Up to date, and still having issues. PVE has gone from a remarkably stable product to ultra flakey. Thanks systemd.

I'm either going to have to roll back to PVE 5.x or leave Proxmox entirely. This instability is seriously hurting my production servers.
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
14,353
546
133
Up to date, and still having issues. PVE has gone from a remarkably stable product to ultra flakey. Thanks systemd.
My experience is exactly the opposite. As this thread is huge, please open a new thread describing your issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!