There was an instance in where one PVE node was sporadically failing.
A quick look into the servers management interface reported a defective memory.
Since it is an older Test Server the integrated remote console is not working properly...
So no worry's I have a PVE-Web-GUI with a integrated shell, I can stress the RAM here, to verify the faulty Ram.
Done as thought, filled the free remaining Ram with the "memtester" programm.
Some time later the two other nodes in the cluster reported that the Node with the defective Ram is dead and gone (red cross icon).
I took a look at Web-GUI of the Node, with the defective RAM, surprisingly it was still responsive and reported that the other 2 Nodes are okay (ticked green icon).
What happened? I took a look at the /var/log/messages and noticed several OOM-kills.
and many more pve services like: pve-ha-lrm, pve-ha-crm, pmxcfs.... and somewhere in between the memtester"
Interestingly all imported services (e.g.: pve-proxy, ...) seemed to be restated automatically, besides corosync.
A short:
Starting corosync again with:
Questions here to the Proxmox Team:
1. Shouldn't these important cluster services, be excluded by the OOM-Killer, for example by Adjusting the according oom_scores under /proc/<PID>/oom_score_adjust ?
2. Is there a reason that all the other services are restarting automatically but corosync not ?
A quick look into the servers management interface reported a defective memory.
Since it is an older Test Server the integrated remote console is not working properly...
So no worry's I have a PVE-Web-GUI with a integrated shell, I can stress the RAM here, to verify the faulty Ram.
Done as thought, filled the free remaining Ram with the "memtester" programm.
Some time later the two other nodes in the cluster reported that the Node with the defective Ram is dead and gone (red cross icon).
I took a look at Web-GUI of the Node, with the defective RAM, surprisingly it was still responsive and reported that the other 2 Nodes are okay (ticked green icon).
What happened? I took a look at the /var/log/messages and noticed several OOM-kills.
Code:
kernel: pvedaemon worke invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pvedaemon.service,mems_allowed=0-1,global_oom,task_memcg=/system.slice/corosync.service,task=corosync,pid=3584,uid=0
Code:
kernel: pvestatd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pvestatd.service,mems_allowed=0-1,global_oom,task_memcg=/system.slice/pveproxy.service,task=pveproxy worker,pid=2270813,uid=33
Code:
kernel: pve-ha-lrm invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kernel: oom_kill_process.cold+0xb/0x10
kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=pve-ha-lrm.service,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/session-198.scope,task=memtester,pid=2371817,uid=0
A short:
ps -ef | grep corosync
verified this. Jup corosync is gone, witch explained the broken cluster state.Starting corosync again with:
systemctl start corosync
Fixed the cluster again.Questions here to the Proxmox Team:
1. Shouldn't these important cluster services, be excluded by the OOM-Killer, for example by Adjusting the according oom_scores under /proc/<PID>/oom_score_adjust ?
2. Is there a reason that all the other services are restarting automatically but corosync not ?