Ceph MDS OOM killed on weekends

k4y53r

Member
Jun 2, 2021
3
0
6
48
Hi,

I have 4 node PVE Cluster with CephFS deployed and from a couple of months ago i get MDS oom kills and sometimes MDS are deployed on another node and get stucked on clientreplay status, so i need to restart this MDS again to gain acces to cephfs from all clients

Checked scheduled jobs or ceph syslog but cannot get what kind of job could be running to use all memory avaliable on node so i don't know were to look to avoid this issues

Cluster has 4 nodes, 2 with 96 GB RAM and another 2 with 192 GB RAM, all with 24 cores Intel(R) Xeon(R) CPU X5675 @ 3.07GHz (2 Sockets)
PVE still on 7.3.3 and Ceph Version 17.2.5

Get 2 Ceph pools:
FAST with nvme disks (2 disks / OSD 2 TB per node )
SLOW with HDD disks (2 disks / OSD 2 TB per node )

Only 1 cephfs on FAST with 2 MDS active and another 2 MDS on standby, this setup was made after some issues with only one MDS to keep critical services deployed on cephfs working in case of mds fails without automatic recovery works

Get this errors mostly on weekends, beginning friday night till monday morning (no weekend backups to PBS enabled now)

Code:
-- Journal begins at Fri 2022-10-28 12:19:59 CEST, ends at Mon 2025-04-28 08:51:01 CEST. --
Apr 25 00:00:00 zpveo1 ceph-mds[782250]: 2025-04-25T00:00:00.080+0200 7f8d3c47c700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 3615629) UID: 0
Apr 25 00:00:00 zpveo1 ceph-mds[782250]: 2025-04-25T00:00:00.104+0200 7f8d3c47c700 -1 received  signal: Hangup from  (PID: 3615630) UID: 0
Apr 25 23:55:34 zpveo1 systemd[1]: ceph-mds@zpveo1.service: A process of this unit has been killed by the OOM killer.
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Main process exited, code=killed, status=9/KILL
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Failed with result 'oom-kill'.
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Consumed 1d 1h 11min 37.806s CPU time.
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Scheduled restart job, restart counter is at 7.
Apr 25 23:55:38 zpveo1 systemd[1]: Stopped Ceph metadata server daemon.
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Consumed 1d 1h 11min 37.806s CPU time.
Apr 25 23:55:39 zpveo1 systemd[1]: Started Ceph metadata server daemon.
Apr 25 23:55:39 zpveo1 ceph-mds[794009]: starting mds.zpveo1 at
Apr 26 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-26T00:00:00.091+0200 7f8bc83c9700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 798340) UID: 0
Apr 26 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-26T00:00:00.127+0200 7f8bc83c9700 -1 received  signal: Hangup from  (PID: 798341) UID: 0
Apr 27 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-27T00:00:00.097+0200 7f8bc83c9700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 2171840) UID: 0
Apr 27 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-27T00:00:00.121+0200 7f8bc83c9700 -1 received  signal: Hangup from  (PID: 2171841) UID: 0
Apr 28 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-28T00:00:00.074+0200 7f8bc83c9700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 3543923) UID: 0
Apr 28 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-28T00:00:00.102+0200 7f8bc83c9700 -1 received  signal: Hangup from  (PID: 3543924) UID: 0

1745827680028.png
Each memory peak get a oom-kill event on node and mostly of RAM were used by mds
1745827777390.png

Try to limit memory usage but no luck
1745827845696.png

1745827899369.png

1745828019161.png

I've checked doc about ceph with no clue about root cause, only some referers to possible memory leak on ceph, but no way how to check it

Any clue about this??

Thanks and regards
 
Hello k4y53r,

It very much looks like the service was killed because the host does not have enough memory for its current load, from the graphs above the host is often hitting the available memory.

Do you use ZFS for the root filesystem or VMs? If so could you please send us the output from `arcstat`? This will tell us how much memory can the (ZFS) ARC take.

Do you have a swap partition?
 
Hi,

No ZFS no swap used at all
1745963336044.png
1745963357771.png
It seems it's not about how much RAM are avaliable, today MDS fails on 192 GB RAM node as you could see below
1745963507309.png
I could check logs and something break at midnight

Code:
-- Journal begins at Fri 2022-10-28 12:10:38 CEST, ends at Tue 2025-04-29 23:49:34 CEST. --
Apr 26 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-26T00:00:31.742+0200 7fbc5ee29700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1034873) UID: 0
Apr 26 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-26T00:00:31.766+0200 7fbc5ee29700 -1 received  signal: Hangup from  (PID: 1034874) UID: 0
Apr 27 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-27T00:00:31.741+0200 7fbc5ee29700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 2466053) UID: 0
Apr 27 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-27T00:00:31.773+0200 7fbc5ee29700 -1 received  signal: Hangup from  (PID: 2466054) UID: 0
Apr 28 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-28T00:00:31.743+0200 7fbc5ee29700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 3895442) UID: 0
Apr 28 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-28T00:00:31.767+0200 7fbc5ee29700 -1 received  signal: Hangup from  (PID: 3895443) UID: 0
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: did not load config file, using default settings.
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: 2025-04-28T18:21:47.686+0200 7f86c7f44780 -1 Errors while parsing config file!
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: 2025-04-28T18:21:47.686+0200 7f86c7f44780 -1 can't open ceph.conf: (2) No such file or directory
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: ignoring --setuser ceph since I am not root
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: ignoring --setgroup ceph since I am not root
Apr 28 18:21:55 zpveo2 ceph-mds[2474769]: unable to get monitor info from DNS SRV with service name: ceph-mon
Apr 28 18:21:55 zpveo2 ceph-mds[2474769]: 2025-04-28T18:21:55.722+0200 7f86c7f44780 -1 failed for service _ceph-mon._tcp
Apr 28 18:21:55 zpveo2 ceph-mds[2474769]: 2025-04-28T18:21:55.722+0200 7f86c7f44780 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Apr 28 18:21:55 zpveo2 ceph-mds[2474769]: failed to fetch mon config (--no-mon-config to skip)
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Main process exited, code=exited, status=1/FAILURE
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Failed with result 'exit-code'.
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Consumed 1d 3h 56min 47.404s CPU time.
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Scheduled restart job, restart counter is at 3.
Apr 28 18:21:55 zpveo2 systemd[1]: Stopped Ceph metadata server daemon.
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Consumed 1d 3h 56min 47.404s CPU time.
Apr 28 18:21:56 zpveo2 systemd[1]: Started Ceph metadata server daemon.
Apr 28 18:21:56 zpveo2 ceph-mds[797254]: starting mds.zpveo2 at
Apr 29 00:00:31 zpveo2 ceph-mds[797254]: 2025-04-29T00:00:31.753+0200 7f604d754700 -1 Fail to open '/proc/1140492/cmdline' error = (2) No such file or directory
Apr 29 00:00:31 zpveo2 ceph-mds[797254]: 2025-04-29T00:00:31.753+0200 7f604d754700 -1 received  signal: Hangup from <unknown> (PID: 1140492) UID: 0
Apr 29 00:00:31 zpveo2 ceph-mds[797254]: 2025-04-29T00:00:31.785+0200 7f604d754700 -1 received  signal: Hangup from  (PID: 1140493) UID: 0

no cronjobs configured from my side
1745963773339.png
Today MDS migration was smooth with no impact at all, but when it happens on weekend mostly fails MDS restart and need manual restart
1745963882040.png
Also no disk usage pressure on pool as you could check below
1745964094802.png

Any log i should check ?

Thanks for your help