systemd 100% CPU and zombie processes

pveuser

New Member
Mar 25, 2023
8
7
3
Hello - we have a number of standalone PVE and also some clusters - all running various different versions of PVE 7


pve-manager/7.3-4/d69b70d4

pve-manager/7.2-11/b76d3178

would be 2 examples.

Last night after pveupdate ran all the systems are pretty locked up with sbin/init churning through 100% of CPI and depending on what was happening on the box either a few or thousands of zombie processes.

All VMS are working - so we don';t want to reboot at the moment -

anyone got any ideas?


root@dub-cwt-pve5:/etc# ps axo stat,ppid,pid,comm | grep -w defunct


Zs 1 2851130 pveupdate <defunct>
Z 1 2851893 systemctl <defunct>
Z 1 2851894 grep <defunct>
Z 1 2851895 awk <defunct>
Z 1 2851896 grep <defunct>
Z 1 2851898 systemctl <defunct>
Z 1 2851899 grep <defunct>
Z 1 2851900 awk <defunct>
Z 1 2851901 grep <defunct>
Z 1 2851903 systemctl <defunct>
<snip>


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 168112 11420 7564 R 100.0 0.0 653:37.74 systemd

The problem is that VMS are running but I can't use any PVE commands to do a migration for example as they will just hang because the OS is borked.
 
  • Like
Reactions: IcarusFalling
Hello - we have a number of standalone PVE and also some clusters - all running various different versions of PVE 7


pve-manager/7.3-4/d69b70d4

pve-manager/7.2-11/b76d3178

would be 2 examples.

Last night after pveupdate ran all the systems are pretty locked up with sbin/init churning through 100% of CPI and depending on what was happening on the box either a few or thousands of zombie processes.

All VMS are working - so we don';t want to reboot at the moment -

anyone got any ideas?


root@dub-cwt-pve5:/etc# ps axo stat,ppid,pid,comm | grep -w defunct


Zs 1 2851130 pveupdate <defunct>
Z 1 2851893 systemctl <defunct>
Z 1 2851894 grep <defunct>
Z 1 2851895 awk <defunct>
Z 1 2851896 grep <defunct>
Z 1 2851898 systemctl <defunct>
Z 1 2851899 grep <defunct>
Z 1 2851900 awk <defunct>
Z 1 2851901 grep <defunct>
Z 1 2851903 systemctl <defunct>
<snip>


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 168112 11420 7564 R 100.0 0.0 653:37.74 systemd

The problem is that VMS are running but I can't use any PVE commands to do a migration for example as they will just hang because the OS is borked.
having the exact same issue right now

top - 17:40:59 up 7 min, 1 user, load average: 1.07, 0.80, 0.40
Tasks: 796 total, 2 running, 790 sleeping, 0 stopped, 4 zombie
%Cpu(s): 0.6 us, 0.7 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 257830.0 total, 256140.2 free, 1501.6 used, 188.2 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 255116.7 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 164560 10876 7744 R 100.0 0.0 6:58.59 systemd
2748 root 20 0 10844 4520 3232 R 0.7 0.0 0:00.07 top
64 root rt 0 0 0 0 S 0.3 0.0 0:01.96 migration/8
 
Code:
-bash-5.1# journalctl -f
-- Journal begins at Sat 2023-03-25 15:20:30 GMT. --
Mar 25 17:34:05 atlas kernel: vmbr0: port 1(enp4s0) entered forwarding state
Mar 25 17:34:05 atlas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
Mar 25 17:34:13 atlas chronyd[2629]: Selected source 85.199.214.99 (2.debian.pool.ntp.org)
Mar 25 17:34:13 atlas chronyd[2629]: System clock TAI offset set to 37 seconds
Mar 25 17:34:14 atlas sshd[2706]: Accepted password for root from 192.168.0.8 port 51422 ssh2
Mar 25 17:34:14 atlas sshd[2706]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Mar 25 17:36:09 atlas systemd-logind[2527]: Failed to start user service 'user@0.service', ignoring: Connection timed out
Mar 25 17:36:34 atlas systemd-logind[2527]: Failed to start session scope session-1.scope: Connection timed out
Mar 25 17:36:34 atlas sshd[2706]: pam_systemd(sshd:session): Failed to create session: Connection timed out
Mar 25 17:37:09 atlas systemd-logind[2527]: Failed to stop user service 'user@0.service', ignoring: Connection timed out