I have a 3-node PVE 7.4-18 cluster running Ceph 15.2.17. There is one OSD per node, so pretty simple. I'm using 3 replicas, so the data should basically be mirrored across all OSDs in the cluster.
Everything has been running fine for months, but I've suddenly lost the ability to get my OSDs up and running.
The ceph-osd on each node keeps crashing on startup, and it looks like it's being killed by the Linux OOM killer:
In the ceph-osd log, the last message before the crash is like this:
This seemed to come on suddenly and it's affecting all OSDs in the cluster. So I guess it must be something to do with the data, and I wonder if there's a way to recover it.
I found this article and wonder if I should go through this process, but wanted to find out if anyone had experienced anything similar.
https://www.croit.io/blog/how-to-solve-the-oom-killer-process-from-killing-your-osds
Happy to provide any other detail, but I'm not sure what else would be helpful.
TIA!
Everything has been running fine for months, but I've suddenly lost the ability to get my OSDs up and running.
The ceph-osd on each node keeps crashing on startup, and it looks like it's being killed by the Linux OOM killer:
[ 4530.421204] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=system-ceph\x2dosd.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/system-ceph\x2dosd.slice/[EMAIL]ceph-osd@4.service[/EMAIL],task=ceph-osd,pid=37704,uid=64045
[ 4530.421315] Out of memory: Killed process 37704 (ceph-osd) total-vm:39459496kB, anon-rss:31373092kB, file-rss:756kB, shmem-rss:0kB, UID:64045 pgtables:76112kB oom_score_adj:0
In the ceph-osd log, the last message before the crash is like this:
2025-01-07T14:35:42.066-0500 7f3bf2418d80 0 osd.4 36581 load_pgs
This seemed to come on suddenly and it's affecting all OSDs in the cluster. So I guess it must be something to do with the data, and I wonder if there's a way to recover it.
I found this article and wonder if I should go through this process, but wanted to find out if anyone had experienced anything similar.
https://www.croit.io/blog/how-to-solve-the-oom-killer-process-from-killing-your-osds
Happy to provide any other detail, but I'm not sure what else would be helpful.
TIA!
Last edited: