PVE services killed by OOMK (Out Of Memory Killer)

NashMaster · May 5, 2022

There was an instance in where one PVE node was sporadically failing.
A quick look into the servers management interface reported a defective memory.
Since it is an older Test Server the integrated remote console is not working properly...
So no worry's I have a PVE-Web-GUI with a integrated shell, I can stress the RAM here, to verify the faulty Ram.

Done as thought, filled the free remaining Ram with the "memtester" programm.
Some time later the two other nodes in the cluster reported that the Node with the defective Ram is dead and gone (red cross icon).
I took a look at Web-GUI of the Node, with the defective RAM, surprisingly it was still responsive and reported that the other 2 Nodes are okay (ticked green icon).

What happened? I took a look at the /var/log/messages and noticed several OOM-kills.

Code:

kernel: pvedaemon worke invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pvedaemon.service,mems_allowed=0-1,global_oom,task_memcg=/system.slice/corosync.service,task=corosync,pid=3584,uid=0

Code:

kernel: pvestatd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pvestatd.service,mems_allowed=0-1,global_oom,task_memcg=/system.slice/pveproxy.service,task=pveproxy worker,pid=2270813,uid=33

and many more pve services like: pve-ha-lrm, pve-ha-crm, pmxcfs.... and somewhere in between the memtester"

Code:

kernel: pve-ha-lrm invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pve-ha-lrm.service,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/session-198.scope,task=memtester,pid=2371817,uid=0

Interestingly all imported services (e.g.: pve-proxy, ...) seemed to be restated automatically, besides corosync.

A short: ps -ef | grep corosync verified this. Jup corosync is gone, witch explained the broken cluster state.

Starting corosync again with: systemctl start corosync Fixed the cluster again.

Questions here to the Proxmox Team:
1. Shouldn't these important cluster services, be excluded by the OOM-Killer, for example by Adjusting the according oom_scores under /proc/<PID>/oom_score_adjust ?
2. Is there a reason that all the other services are restarting automatically but corosync not ?

Search

Search

PVE services killed by OOMK (Out Of Memory Killer)

NashMaster

New Member

We value your privacy