i just posted the link to make sure we are talking about the same thing
i guess the 100% cpu spike is another problem but maybe somehow related to the systemd timeout
because from my experience a process goes to the zombie state if it was calling a systemcall which doesn't return and you kill that process
so maybe that is the reason for the 100% cpu spike
but I also believe that the problem may have something to do with the guest os
i have no experience in debugging kvm related kernel problems , just with messed up kernel drivers
in the vm i'm compiling cross compilers for my embedded systems
so it's sometimes using all cores at 100%
all other vms are still running and i had that issue only with the vms which are creating the cross compilers
i think i had that problem that the vm was hanging with different guest linux distributions
but i just rebooted the vm host and didn't took a closer look what happpend
i think it happend 2 or 3 times in the last weeks
i will have a closer look now when it happens again to help you figure out what the problem is
i created the core dump when the vm was hanging the last time and not when it became a zombie
when i created the core dump the vm came back after clicking reset in the webui
the coredump was finshed before i clicked reset
i will try to create a coredump when the systemd timeout issue happens
here is what i did when the kvm became a zombie ( hopefully i remeber it correctly
)
1. i had a noVNC window open and the vm wasn' responding anymore
2. i clicked reset in the webui but nothing happend
3. i clicked stop in the webui but nothing happend
4. the error message systemd timeout appeard in the log on the webui
5. i executed "systemctl status qemu.slice" and the vm process was still there but not in a zombie state ( i don't remeber if it had the 100% cpu spike )
6. i executed kill -9 on the vm pid and it became a zombie ( the main pid which is shown in "systemctl status qemu.slice" not a thread )
7. i clicked start in the webui but nothing happend
8. the error message systemd timeout appeard in the log on the webui
9. i executed "systemctl status qemu.slice" and the vm process was still there but now without the commandline arguments
something like this
systemctl status qemu.slice
● qemu.slice
Loaded: loaded
Active: active since Wed 2021-04-14 11:18:11 CEST; 1 day 22h ago
Tasks: 95
Memory: 23.3G
CGroup: /qemu.slice
├─101.scope
│ └─6992 kvm
so i guess the kvm process wasn't responding anymore to signals before i killed it
i will now start some compiling jobs in different guest linux distributions
maybe i can find something what is reproducible
how can i send you the core dump ?
i uploaded it to onedrive