VM doesn't start Proxmox 6 - timeout waiting on systemd

Jay Sullivan · Dec 18, 2020

Jay Sullivan said:
Doh! I just upgraded my GPU-est host (4 x Quadro RTX 8000) to 6.3. No problems yet, but it doesn't usually become a problem until she's got a bit of uptime under her belt.

And just a few days later it happened again. Out of 16 VMs, only four don't have the "waiting for systemd" problem. A full host reboot is the only (temporary) fix.

mira · Dec 18, 2020

Are the VMs in a 'stuck' state, meaning they don't continue to run and can't be stopped?
If so, what is running in the VMs and could you provide a syslog for one of them?

Jay Sullivan · Dec 18, 2020

mira said:
Are the VMs in a 'stuck' state, meaning they don't run and can't be stopped?
If so, what is running in the VMs and could you provide a syslog for one of them?

This host is running only Windows VMs (I know, yuck), all with Quadro vDWS GPUs. But I have a similar recurring issue with Linux VMs and various Tesla GPUs (in any config, not just vDWS). The Windows VMs rebooted last night, but instead of the OS restarting, the VM turns off and can't be turned on. The last time I had time to dig without any pressure to get everything back up ASAP, it looked like

Code:

qm clean

was stalled out and couldn't be killed.

The Windows guest logs are super boring. Just rebooting for updates on schedule.

mira · Dec 18, 2020

If you could provide the syslog from one of those Linux VMs once it happens as well as from the host, that would be great.

It looks like there are 2 different issues in this thread which both result in a 'timeout waiting on systemd' but for different reasons.
The one you're describing looks like something gets stuck, while the other one is just a 'scope lives too long, stopping it runs into timeout', but there nothing gets stuck and it is more a 'timing issue'.

shruxx88 · Dec 21, 2020

t.lamprecht said:
So not a physical server? What kernel runs on the host, this can well be an issue on netcups environment- not saying it has to be, but can be. Getting the info about distro and kernel in use by them would help.

Nope, no physical machine. Netcup will not tell me any configuration regarding the host system. I think that it must be a debian host, using qemu for virtualization - but configwise there is no information I can get from them...
A little update: since I ran just three machines with debian at the same time I had the same problems like before. Suddenly a rebooting host, no errors in log what so ever... I don't know what I should do.

lordrasmus · Apr 15, 2021

i had that issue too

the process became a zombie process and i had to reboot the vm host

i don't if it's the same problem but the same vm was freezing now with 100% cpu on one core

t.lamprecht · Apr 16, 2021

Hi, thanks for reporting with some actual details!

What is the VM's configuration? qm config VMID

lordrasmus said:
the process became a zombie process

Meaning it died but wasn't collected anymore?
When did that happen, after a backup?

lordrasmus · Apr 16, 2021

when it happens next time i'll try to dump the information
this is just how it looked like when it was in that state

like i said im an embedded linux developer and i can produce some coredumps of interesting processes for you
just tell me the names of the processes which you need

no there was no backup running

the state of the process was zombie after i killed it with kill -9 PID

when you look at the state of the pid in the proc fs it normaly looks like this

cat /proc/6992/status
Name: kvm
Umask: 0027
State: S (sleeping)
...

when the systemd timeout happend it looks like this

cat /proc/6992/status
Name: kvm
Umask: 0027
State: Z (zombie)
...

here is a description
https://www.howtogeek.com/119815/htg-explains-what-is-a-zombie-process-on-linux/

i tried this
kill -s SIGCHLD 1

because for me it looks like systemd is the parrent process of the kvm process but it didnt help

i think proxmox wasn't able to restart the kvm process because the pid was still there

systemctl status qemu.slice looked somethink like this

systemctl status qemu.slice
● qemu.slice
Loaded: loaded
Active: active since Wed 2021-04-14 11:18:11 CEST; 1 day 22h ago
Tasks: 95
Memory: 23.3G
CGroup: /qemu.slice
├─101.scope
│ └─6992 kvm

instead of the normal output when the vm is running

systemctl status qemu.slice
● qemu.slice
Loaded: loaded
Active: active since Wed 2021-04-14 11:18:11 CEST; 1 day 22h ago
Tasks: 95
Memory: 23.3G
CGroup: /qemu.slice
├─101.scope
│ └─6992 /usr/bin/kvm -id 101 -name PetersBuildVM -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmeventd.s

qm config 10107
agent: 1
balloon: 2048
boot:
cores: 12
cpu: host
cpuunits: 128
description: Toolchain Builder
ide2: none,media=cdrom
memory: 8192
name: ToolchainFedora34-beta
net0: virtio=E6:86:86:AB:05:B7,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-10107-disk-0,cache=writeback,discard=on,size=10G,ssd=1
scsi1: local-zfs-build:vm-10107-disk-0,cache=writeback,discard=on,size=32G,snapshot=1,ssd=1
scsi2: local-zfs-build:vm-10107-disk-1,cache=writeback,discard=on,size=8G,snapshot=1,ssd=1
scsihw: virtio-scsi-pci
shares: 100
smbios1: uuid=a2a5b585-c9ad-48b3-9cb9-af0abd97aa18
sockets: 1
vmgenid: d36b6f2d-d8ff-4a98-8780-f3fd9e62dfca

t.lamprecht · Apr 16, 2021

lordrasmus said:
like i said im an embedded linux developer and i can produce some coredumps of interesting processes for you
just tell me the names of the processes which you need

A coredump of the whole kvm process would be great, as the whole ram is dumped ensure nothing too sensitive is in there.
But IMO that's probably only really helpful for the problem causing the 100% spike in the first place (which may well be guest-internal), but not for the systemd-wait issue.

lordrasmus said:
no there was no backup running

OK, just for my understanding, the events were as follows:

QEMU process or VM goes nuts, 100% CPU usage, a possible unrelated issue
You then issued a SIGKILL to the process. Here there are two questions
1. Was a qm stop VMID (or web-interface equivalent) tried before that?
2. Was the main process or a child (thread) of the main QEMU process?
The process now is killed, but not collected. So, that would the actual issue preventing the slice from be gone and thus resulting in the error here

lordrasmus said:
the state of the process was zombie after i killed it with kill -9 PID

That was the missing piece, you never said you killed it - so it could have been a bug where it died but wasn't collected by the ppid.

lordrasmus said:
when the systemd timeout happend it looks like this

cat /proc/6992/status
Name: kvm
Umask: 0027
State: Z (zombie)
...

But that is then after you killed it, or did you try to kill a zombie process (which cannot work)?

lordrasmus said:
here is a description
https://www.howtogeek.com/119815/htg-explains-what-is-a-zombie-process-on-linux/

Thx, I go out on a limb here and hope the time spent reading waitpid, clone manpages and kernel code when studying comp. eng. or hacking around on tools/kernel allowed me to grasp the concept enough to safely skip this

lordrasmus said:
i tried this
kill -s SIGCHLD 1

because for me it looks like systemd is the parrent process of the kvm process but it didnt help

Was the coredump still in-progress (dumping) at this point? As during that time the process will show up with state Z but cannot yet be reaped by any parent until the core dump handler finished.

lordrasmus · Apr 16, 2021

i just posted the link to make sure we are talking about the same thing

i guess the 100% cpu spike is another problem but maybe somehow related to the systemd timeout
because from my experience a process goes to the zombie state if it was calling a systemcall which doesn't return and you kill that process
so maybe that is the reason for the 100% cpu spike

but I also believe that the problem may have something to do with the guest os

i have no experience in debugging kvm related kernel problems , just with messed up kernel drivers

in the vm i'm compiling cross compilers for my embedded systems
so it's sometimes using all cores at 100%
all other vms are still running and i had that issue only with the vms which are creating the cross compilers

i think i had that problem that the vm was hanging with different guest linux distributions
but i just rebooted the vm host and didn't took a closer look what happpend

i think it happend 2 or 3 times in the last weeks

i will have a closer look now when it happens again to help you figure out what the problem is

i created the core dump when the vm was hanging the last time and not when it became a zombie
when i created the core dump the vm came back after clicking reset in the webui
the coredump was finshed before i clicked reset

i will try to create a coredump when the systemd timeout issue happens

here is what i did when the kvm became a zombie ( hopefully i remeber it correctly

)

1. i had a noVNC window open and the vm wasn' responding anymore
2. i clicked reset in the webui but nothing happend
3. i clicked stop in the webui but nothing happend
4. the error message systemd timeout appeard in the log on the webui
5. i executed "systemctl status qemu.slice" and the vm process was still there but not in a zombie state ( i don't remeber if it had the 100% cpu spike )
6. i executed kill -9 on the vm pid and it became a zombie ( the main pid which is shown in "systemctl status qemu.slice" not a thread )
7. i clicked start in the webui but nothing happend
8. the error message systemd timeout appeard in the log on the webui
9. i executed "systemctl status qemu.slice" and the vm process was still there but now without the commandline arguments

something like this
systemctl status qemu.slice
● qemu.slice
Loaded: loaded
Active: active since Wed 2021-04-14 11:18:11 CEST; 1 day 22h ago
Tasks: 95
Memory: 23.3G
CGroup: /qemu.slice
├─101.scope
│ └─6992 kvm

so i guess the kvm process wasn't responding anymore to signals before i killed it

i will now start some compiling jobs in different guest linux distributions
maybe i can find something what is reproducible

how can i send you the core dump ?
i uploaded it to onedrive

t.lamprecht · Apr 16, 2021

Sure no worries, my response to the link was probably a bit too cheeky anyway, especially as I feel like I barely remember what I did two weeks ago

lordrasmus said:
in the vm i'm compiling cross compilers for my embedded systems
so it's sometimes using all cores at 100%
all other vms are still running and i had that issue only with the vms which are creating the cross compilers

I mean compilers are highly optimized and complex tools which exercise so many code paths, so triggering some QEMU or kernel bug with them seems not completely unlikely.

lordrasmus said:
i will have a closer look now when it happens again to help you figure out what the problem is

Thanks! If you can boil it down to something like "boot fedora XY, install gcc cross compile tool chain for arm64 and compile the Linux kernel in a loop" it would be great to know, as then we may try to reproduce it here and finally observe this behaviour.

lordrasmus said:
1. i had a noVNC window open and the vm wasn' responding anymore
2. i clicked reset in the webui but nothing happend
3. i clicked stop in the webui but nothing happend
4. the error message systemd timeout appeard in the log on the webui

And there was nothing else before that in the journal/syslog, like some segfault, assertion failure or the like?

lordrasmus said:
9. i executed "systemctl status qemu.slice" and the vm process was still there but now without the commandline arguments

That means it is then a real zombie process.

btw., out of interest: do you have systemd-coredump installed or some other coredump handler setup in /proc/sys/kernel/core_pattern ?

lordrasmus said:
i will now start some compiling jobs in different guest linux distributions
maybe i can find something what is reproducible

yeah that would be great.

lordrasmus said:
how can i send you the core dump ?

You can send me a mail, use my username here in the forum and add @proxmox.com to reach me.

shruxx88 · Apr 26, 2021

Little update from my system - after an upgrade from netcup - i do not know what they did - suddenly my system seems to be stable. And after a while, netcup did a kvm upgrade and all out of a sudden nothing happened anymore. System is stable for about 20 days without any issues or problems. Weird... because at first, they told me that no hardware nor software is broken on my machine. Take your two cents from that, I think they messed it up and repaired every bit of errors which were on there. It was not a proxmox problem at all.

NabiKAZ · Sep 21, 2021

I also had the problem of "timeout waiting on systemd" and I just wanted to say that the problem was solved by rebooting the Host machine.

Edmund Fiadzo · Nov 22, 2021

Hello All,

I am having the same issue, all the hosts on pve (second proxmox node are not accessible), i rebooted the server but issue is still not resolved. what else do i have to do to fix this issue?

jw6677 · Dec 14, 2021

This issue is periodically reoccurring for me as well. On my end, I've been stable for months, but this issue appeared at about the same time that a backup process was manually terminated, as it was running indefinitely on an unavailable cephfs.

Frozenozone · Feb 2, 2022

This issue seems like it has never been fixed? Why is all I ask. I have the same issue now with the "timeout waiting on systemd" on 2 of my vm's and this also happened to a Domain controller a few weeks ago and had to rebuild it. A 5th reboot in a row did the trick for one of my vm's but not the other.

I would really like to know how to troubleshoot these issues correctly. I have tried everything in this thread.

So, my question is how do I troubleshoot this error message: "timeout waiting on systemd"

kohly · Feb 2, 2022

since upgrade to

Bash:

root@hv03:~# pveversion
pve-manager/7.1-10/6ddebafe (running kernel: 5.15.12-1-pve)

the error seams to be gone (here)

Frozenozone · Feb 2, 2022

That's the version I am on. But my pve kernel version is off compared to yours?

root@pve01:~# pveversion
pve-manager/7.1-10/6ddebafe (running kernel: 5.13.19-3-pve)

But after several reboots today, 1 came back and now after a few more reboots I can now get all my vm's to start. Just a weird thing. I would love to know the cause for this to happen. Anyway, at the moment all my vm's are back and running.

paulprox · Feb 5, 2022

I have the exact same issue after the last update! One vm refuses to start. So far I rebooted twice to no avail.

TorqueWrench · Feb 5, 2022

paulprox said:
I have the exact same issue after the last update! One vm refuses to start. So far I rebooted twice to no avail.

Same story here. Are you booting your VM off a USB device? That's my scenario; VM will no longer boot after most recent update. (Running 5.13.19-4-pve where it now fails; came from 5.13.19-3 where it worked.)

Update: Reverting back to 5.13.19-3 for now which fixed the issue. USB device passthrough working again. Something appears to be very wrong with 5.13.19-4.

Update Update: Here's how to fix this quickly- https://forum.proxmox.com/threads/v...-timeout-waiting-on-systemd.56218/post-449606

VM doesn't start Proxmox 6 - timeout waiting on systemd

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Active Member

Member

Renowned Member

Member

Member

Member

We value your privacy