PVE services killed by OOMK (Out Of Memory Killer)

May 5, 2022
1
3
3
There was an instance in where one PVE node was sporadically failing.
A quick look into the servers management interface reported a defective memory.
Since it is an older Test Server the integrated remote console is not working properly...
So no worry's I have a PVE-Web-GUI with a integrated shell, I can stress the RAM here, to verify the faulty Ram.

Done as thought, filled the free remaining Ram with the "memtester" programm.
Some time later the two other nodes in the cluster reported that the Node with the defective Ram is dead and gone (red cross icon).
I took a look at Web-GUI of the Node, with the defective RAM, surprisingly it was still responsive and reported that the other 2 Nodes are okay (ticked green icon).

What happened? I took a look at the /var/log/messages and noticed several OOM-kills.
Code:
kernel: pvedaemon worke invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pvedaemon.service,mems_allowed=0-1,global_oom,task_memcg=/system.slice/corosync.service,task=corosync,pid=3584,uid=0
Code:
kernel: pvestatd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pvestatd.service,mems_allowed=0-1,global_oom,task_memcg=/system.slice/pveproxy.service,task=pveproxy worker,pid=2270813,uid=33
and many more pve services like: pve-ha-lrm, pve-ha-crm, pmxcfs.... and somewhere in between the memtester"
Code:
kernel: pve-ha-lrm invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pve-ha-lrm.service,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/session-198.scope,task=memtester,pid=2371817,uid=0
Interestingly all imported services (e.g.: pve-proxy, ...) seemed to be restated automatically, besides corosync.

A short: ps -ef | grep corosync verified this. Jup corosync is gone, witch explained the broken cluster state.

Starting corosync again with: systemctl start corosync Fixed the cluster again.

Questions here to the Proxmox Team:
1. Shouldn't these important cluster services, be excluded by the OOM-Killer, for example by Adjusting the according oom_scores under /proc/<PID>/oom_score_adjust ?
2. Is there a reason that all the other services are restarting automatically but corosync not ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!