Ceph MDS OOM killed on weekends

k4y53r

Member
Jun 2, 2021
6
0
21
49
Hi,

I have 4 node PVE Cluster with CephFS deployed and from a couple of months ago i get MDS oom kills and sometimes MDS are deployed on another node and get stucked on clientreplay status, so i need to restart this MDS again to gain acces to cephfs from all clients

Checked scheduled jobs or ceph syslog but cannot get what kind of job could be running to use all memory avaliable on node so i don't know were to look to avoid this issues

Cluster has 4 nodes, 2 with 96 GB RAM and another 2 with 192 GB RAM, all with 24 cores Intel(R) Xeon(R) CPU X5675 @ 3.07GHz (2 Sockets)
PVE still on 7.3.3 and Ceph Version 17.2.5

Get 2 Ceph pools:
FAST with nvme disks (2 disks / OSD 2 TB per node )
SLOW with HDD disks (2 disks / OSD 2 TB per node )

Only 1 cephfs on FAST with 2 MDS active and another 2 MDS on standby, this setup was made after some issues with only one MDS to keep critical services deployed on cephfs working in case of mds fails without automatic recovery works

Get this errors mostly on weekends, beginning friday night till monday morning (no weekend backups to PBS enabled now)

Code:
-- Journal begins at Fri 2022-10-28 12:19:59 CEST, ends at Mon 2025-04-28 08:51:01 CEST. --
Apr 25 00:00:00 zpveo1 ceph-mds[782250]: 2025-04-25T00:00:00.080+0200 7f8d3c47c700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 3615629) UID: 0
Apr 25 00:00:00 zpveo1 ceph-mds[782250]: 2025-04-25T00:00:00.104+0200 7f8d3c47c700 -1 received  signal: Hangup from  (PID: 3615630) UID: 0
Apr 25 23:55:34 zpveo1 systemd[1]: ceph-mds@zpveo1.service: A process of this unit has been killed by the OOM killer.
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Main process exited, code=killed, status=9/KILL
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Failed with result 'oom-kill'.
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Consumed 1d 1h 11min 37.806s CPU time.
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Scheduled restart job, restart counter is at 7.
Apr 25 23:55:38 zpveo1 systemd[1]: Stopped Ceph metadata server daemon.
Apr 25 23:55:38 zpveo1 systemd[1]: ceph-mds@zpveo1.service: Consumed 1d 1h 11min 37.806s CPU time.
Apr 25 23:55:39 zpveo1 systemd[1]: Started Ceph metadata server daemon.
Apr 25 23:55:39 zpveo1 ceph-mds[794009]: starting mds.zpveo1 at
Apr 26 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-26T00:00:00.091+0200 7f8bc83c9700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 798340) UID: 0
Apr 26 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-26T00:00:00.127+0200 7f8bc83c9700 -1 received  signal: Hangup from  (PID: 798341) UID: 0
Apr 27 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-27T00:00:00.097+0200 7f8bc83c9700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 2171840) UID: 0
Apr 27 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-27T00:00:00.121+0200 7f8bc83c9700 -1 received  signal: Hangup from  (PID: 2171841) UID: 0
Apr 28 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-28T00:00:00.074+0200 7f8bc83c9700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 3543923) UID: 0
Apr 28 00:00:00 zpveo1 ceph-mds[794009]: 2025-04-28T00:00:00.102+0200 7f8bc83c9700 -1 received  signal: Hangup from  (PID: 3543924) UID: 0

1745827680028.png
Each memory peak get a oom-kill event on node and mostly of RAM were used by mds
1745827777390.png

Try to limit memory usage but no luck
1745827845696.png

1745827899369.png

1745828019161.png

I've checked doc about ceph with no clue about root cause, only some referers to possible memory leak on ceph, but no way how to check it

Any clue about this??

Thanks and regards
 
Hello k4y53r,

It very much looks like the service was killed because the host does not have enough memory for its current load, from the graphs above the host is often hitting the available memory.

Do you use ZFS for the root filesystem or VMs? If so could you please send us the output from `arcstat`? This will tell us how much memory can the (ZFS) ARC take.

Do you have a swap partition?
 
Hi,

No ZFS no swap used at all
1745963336044.png
1745963357771.png
It seems it's not about how much RAM are avaliable, today MDS fails on 192 GB RAM node as you could see below
1745963507309.png
I could check logs and something break at midnight

Code:
-- Journal begins at Fri 2022-10-28 12:10:38 CEST, ends at Tue 2025-04-29 23:49:34 CEST. --
Apr 26 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-26T00:00:31.742+0200 7fbc5ee29700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 1034873) UID: 0
Apr 26 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-26T00:00:31.766+0200 7fbc5ee29700 -1 received  signal: Hangup from  (PID: 1034874) UID: 0
Apr 27 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-27T00:00:31.741+0200 7fbc5ee29700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 2466053) UID: 0
Apr 27 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-27T00:00:31.773+0200 7fbc5ee29700 -1 received  signal: Hangup from  (PID: 2466054) UID: 0
Apr 28 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-28T00:00:31.743+0200 7fbc5ee29700 -1 received  signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror  (PID: 3895442) UID: 0
Apr 28 00:00:31 zpveo2 ceph-mds[2474769]: 2025-04-28T00:00:31.767+0200 7fbc5ee29700 -1 received  signal: Hangup from  (PID: 3895443) UID: 0
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: did not load config file, using default settings.
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: 2025-04-28T18:21:47.686+0200 7f86c7f44780 -1 Errors while parsing config file!
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: 2025-04-28T18:21:47.686+0200 7f86c7f44780 -1 can't open ceph.conf: (2) No such file or directory
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: ignoring --setuser ceph since I am not root
Apr 28 18:21:47 zpveo2 ceph-mds[2474769]: ignoring --setgroup ceph since I am not root
Apr 28 18:21:55 zpveo2 ceph-mds[2474769]: unable to get monitor info from DNS SRV with service name: ceph-mon
Apr 28 18:21:55 zpveo2 ceph-mds[2474769]: 2025-04-28T18:21:55.722+0200 7f86c7f44780 -1 failed for service _ceph-mon._tcp
Apr 28 18:21:55 zpveo2 ceph-mds[2474769]: 2025-04-28T18:21:55.722+0200 7f86c7f44780 -1 monclient: get_monmap_and_config cannot identify monitors to contact
Apr 28 18:21:55 zpveo2 ceph-mds[2474769]: failed to fetch mon config (--no-mon-config to skip)
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Main process exited, code=exited, status=1/FAILURE
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Failed with result 'exit-code'.
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Consumed 1d 3h 56min 47.404s CPU time.
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Scheduled restart job, restart counter is at 3.
Apr 28 18:21:55 zpveo2 systemd[1]: Stopped Ceph metadata server daemon.
Apr 28 18:21:55 zpveo2 systemd[1]: ceph-mds@zpveo2.service: Consumed 1d 3h 56min 47.404s CPU time.
Apr 28 18:21:56 zpveo2 systemd[1]: Started Ceph metadata server daemon.
Apr 28 18:21:56 zpveo2 ceph-mds[797254]: starting mds.zpveo2 at
Apr 29 00:00:31 zpveo2 ceph-mds[797254]: 2025-04-29T00:00:31.753+0200 7f604d754700 -1 Fail to open '/proc/1140492/cmdline' error = (2) No such file or directory
Apr 29 00:00:31 zpveo2 ceph-mds[797254]: 2025-04-29T00:00:31.753+0200 7f604d754700 -1 received  signal: Hangup from <unknown> (PID: 1140492) UID: 0
Apr 29 00:00:31 zpveo2 ceph-mds[797254]: 2025-04-29T00:00:31.785+0200 7f604d754700 -1 received  signal: Hangup from  (PID: 1140493) UID: 0

no cronjobs configured from my side
1745963773339.png
Today MDS migration was smooth with no impact at all, but when it happens on weekend mostly fails MDS restart and need manual restart
1745963882040.png
Also no disk usage pressure on pool as you could check below
1745964094802.png

Any log i should check ?

Thanks for your help
 
Hi again,

another MDS error today, not at weekend, but one MDS keeps freezed but on active status, but almost no request processed
Code:
ceph fs status
zk8scephfso1 - 78 clients
============
RANK  STATE    MDS       ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  zpveo3  Reqs:    4 /s  83.7k  61.6k  21.1k  60.8k
 1    active  zpveo2  Reqs:    3 /s  30.5k  25.9k   785   23.2k
         POOL            TYPE     USED  AVAIL
zk8scephfso1_metadata  metadata  8399M  1327G
  zk8scephfso1_data      data    4788G  1327G
STANDBY MDS
   zpveo4
   zpveo1
MDS version: ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)

No memory pressure today, only yesterday...
1746535212594.png
Last used memory falls was after restart MDS deployed on this node manually, not due oom kill event
1746535193194.png
But MDS at node3 still failing
1746535309778.png
Code:
[WRN] FS_DEGRADED: 1 filesystem is degraded

After restart mds.0 recover filesystem and services affected

Code:
ceph fs status
zk8scephfso1 - 83 clients
============
RANK  STATE    MDS       ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  zpveo1  Reqs:  354 /s   116k  88.7k  14.4k  18.2k
 1    active  zpveo2  Reqs:  162 /s  11.4k  11.3k   392   10.0k
         POOL            TYPE     USED  AVAIL
zk8scephfso1_metadata  metadata  7252M  1324G
  zk8scephfso1_data      data    4790G  1324G
STANDBY MDS
   zpveo4
   zpveo3
MDS version: ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)

I'll keep monitor and updating till get resolved...
 

Attachments

  • 1746535167617.png
    1746535167617.png
    15 KB · Views: 1
Hi,

things seems to get worse, 3 MDS errors last 24 h, last a couple of minutes ago, MDS get stucks on clientreplay status and get errors on syslog (cannot paste full log due post limits, see attached txt file)

Code:
May 12 14:29:19 zpveo2 ceph-mds[2264075]: ./src/mds/CDentry.h: In function 'virtual CDentry::~CDentry()' thread 7f36b4c3c700 time 2025-05-12T14:29:19.734946+0200
May 12 14:29:19 zpveo2 ceph-mds[2264075]: ./src/mds/CDentry.h: 130: FAILED ceph_assert(batch_ops.empty())
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7f36c017223c]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  2: /usr/lib/ceph/libceph-common.so.2(+0x25a3da) [0x7f36c01723da]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  3: (CDentry::~CDentry()+0x484) [0x564a70d7b2e4]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  4: (CDir::remove_dentry(CDentry*)+0x1f3) [0x564a70d87de3]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  5: (MDCache::trim_dentry(CDentry*, std::map<int, boost::intrusive_ptr<MCacheExpire>, std::less<int>, std::allocator<std::pair<int const, boost::intrusive_ptr<MCacheExpire> > > >&)+0x4c9) [0x564a70c39599]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  6: (MDCache::trim_lru(unsigned long, std::map<int, boost::intrusive_ptr<MCacheExpire>, std::less<int>, std::allocator<std::pair<int const, boost::intrusive_ptr<MCacheExpire> > > >&)+0x5d1) [0x564a70c3ab31]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  7: (MDCache::trim(unsigned long)+0xa5) [0x564a70c56935]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  8: (Migrator::export_finish(CDir*)+0x7df) [0x564a70d4e2df]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  9: (Migrator::export_logged_finish(CDir*)+0x79f) [0x564a70d4f05f]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  10: (MDSContext::complete(int)+0x5b) [0x564a70e4f3cb]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  11: (MDSIOContextBase::complete(int)+0x524) [0x564a70e4fb44]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  12: (MDSLogContextBase::complete(int)+0x41) [0x564a70e4fed1]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  13: (Finisher::finisher_thread_entry()+0x18d) [0x7f36c020a32d]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  14: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f36bfe8aea7]
May 12 14:29:19 zpveo2 ceph-mds[2264075]:  15: clone()

1747055176317.png

Also all nodes shown unkown status till restart and migrate MDS to another node

1747055219993.png

This happens on almost 3 nodes, not only affected node zpveo2

Code:
ceph fs status
zk8scephfso1 - 54 clients
============
RANK     STATE       MDS       ACTIVITY     DNS    INOS   DIRS   CAPS
 0    clientreplay  zpveo2                 60.6k  35.5k  2383   1263
 1       active     zpveo3  Reqs:   10 /s   195k   194k  46.2k  24.5k
         POOL            TYPE     USED  AVAIL
zk8scephfso1_metadata  metadata  7957M  1130G
  zk8scephfso1_data      data    5370G  1130G
STANDBY MDS
   zpveo4
   zpveo1
MDS version: ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)

After restar MDS on clientreplay status for more than 10 minutes ceph get back online and all nodes shown green check, any help will be appreciate
 

Attachments