Ceph keeps crashing, but only on a single node

anaxagoras

Renowned Member
Aug 23, 2012
48
7
73
I've been trying to figure this out for over a week and i'm getting nowhere. I have 3 machines with identical hardware,, each with 3 enterprise nvme drives. 2x 4tb samsung m.2 pm983, and 1x 8 tb samsung u.2 pm983a (i think this is an oem drive for amazon).

For some reason PVE2 keeps getting ceph crashes and errors. Earlier today i destroyed all of the OSD's on PVE2, destroyed cep-mon, ceph-mgr, the ceph-mds, and deleted pve2 from the crush map, did a secure erase on all the drives, and then re-added everything to the cluster. I don't see anything wrong in the smart data, nothing in the OS to indicate hardware issues, nothing to indicate drive issues. I've run memtest86+, run various benchmarks on the machine as a stress test, but can't replicate any of the issues i've been having on PVE2.

Despite all of that PVE had a crash a couple hours AFTER the recovery finished with ceph-mon and ceph-mgr...

OSD 0, 3 and 8 are all on PVE2. I'm not getting crashes on PVE3 or PVE4. (there is no PVE1, it's retired).

Code:
root@pve2:/var/log/ceph# ceph crash ls
ID                                                                ENTITY    NEW
2025-03-08T09:02:04.422973Z_0a1ef0d0-a2e1-409b-b906-dbdf60e99b42  osd.3        
2025-04-06T12:21:39.334260Z_eac93866-64e1-42fd-8684-5e6bd6b78ed8  osd.3        
2025-04-30T22:22:23.606109Z_6f8a53d4-2a79-4a15-beeb-f4b47933a728  osd.8        
2025-05-01T23:54:28.590957Z_662ba25a-4d1f-40b0-8535-e5669cde47cb  osd.8        
2025-05-04T03:20:37.659471Z_10e04b3e-d68a-4c1b-bc49-853a795b44ba  osd.0        
2025-05-04T03:20:38.764210Z_10a46c7a-00fd-40d3-8700-911e1483d5ee  osd.8        
2025-05-04T06:17:33.229329Z_136203d3-690b-475f-a540-c4a9b330b771  osd.0        
2025-05-07T12:55:45.875588Z_4b98d3b1-7d78-498b-a5e9-988a60d568fd  mgr.pve2      
2025-05-08T22:59:06.340977Z_3977e900-d87b-45ea-872e-f737804f78e3  osd.0        
2025-05-09T00:40:51.579106Z_d849e41e-889e-485c-a8e2-8f7a6c6dd03e  mgr.pve2      
2025-05-09T03:30:05.519605Z_1cf6f13e-0de2-4faa-8856-e1b9ff313555  osd.8        
2025-05-09T15:18:02.068671Z_3ceff48b-e47e-4fa4-b90b-3beb6df219ac  mon.pve2      
2025-05-09T15:19:42.533056Z_d3d60b71-dec2-46be-8843-70e10202f576  mon.pve2      
2025-05-09T15:21:07.883912Z_bf737bea-f9df-4d21-9669-9279b8169c6e  mon.pve2      
2025-05-09T15:23:25.558031Z_fe4671ca-a628-41d1-a116-8dc7f5eff2db  mon.pve2      
2025-05-09T15:23:41.865995Z_4f6e50e9-f441-47d6-afdf-281e5917d19f  mon.pve2      
2025-05-09T15:23:55.375545Z_0c30f2bc-0861-4226-888c-5a51acd0bb91  mon.pve2      
2025-05-09T15:30:01.815767Z_b66aaf1d-781f-4195-bcdc-6434008c35f6  osd.0        
2025-05-10T00:20:16.062190Z_87482866-80ec-4ff9-99bb-3b5c9825eb0c  osd.0        
2025-05-10T05:17:52.227734Z_f71eec52-dc6c-4f0c-ae40-238841694c1b  osd.8        
2025-05-10T17:22:49.941053Z_4119cb0a-9443-4bff-8ee6-5aebac3a56db  osd.0        
2025-05-10T19:25:38.840688Z_ac59de83-5e5f-4341-964e-705baf3e8e0b  osd.8        
2025-05-11T10:22:16.202949Z_23e0457e-43b8-4620-85d3-e4b358e1b387  mon.pve2      
2025-05-11T23:34:38.100949Z_3d496dce-4f19-44f9-8112-ba2e5ab4ac5c  osd.8        
2025-05-14T01:28:48.230247Z_dfdf3317-1c63-496a-a2fe-fc218ab5e81f  osd.0        
2025-05-15T03:53:57.244401Z_45ca5c58-f23f-4453-b30c-19ce41716b86  mgr.pve2      
2025-05-15T08:46:37.737160Z_bcaf4a75-bfa3-459a-b07d-cba600308f52  osd.0        
2025-05-15T08:46:56.291875Z_b5214405-c5a5-4d14-82e5-63bcbf6465f7  osd.0        
2025-05-15T08:47:16.315073Z_750f8f25-8b2b-4f4f-8241-4e60b603772a  osd.0        
2025-05-15T08:47:37.188887Z_1efa19e5-0eed-4544-b56c-96be6ca2ac43  osd.0        
2025-05-15T09:28:15.079537Z_71d39be8-d494-4b60-8185-b15c89a51099  osd.0        
2025-05-15T09:28:35.087848Z_3d9128f1-21b3-4536-b471-98bacac9d35b  osd.0        
2025-05-15T09:28:55.474514Z_ec7455c3-1194-4284-8591-06e032256218  osd.0        
2025-05-15T09:33:11.154229Z_59dd26a9-9b88-49c0-95a2-7058da251494  osd.0        
2025-05-15T09:33:32.172022Z_fbb9eca4-0d31-4c42-8dda-293c8ddba991  osd.0        
2025-05-15T09:33:52.204986Z_92606717-45b4-44ba-a2b6-9a6457efd1f7  osd.0        
2025-05-15T09:35:07.614505Z_466e90ae-2166-441c-a6db-2e4dc4d2ed3b  osd.0        
2025-05-15T09:35:22.775400Z_f4c915c8-6ed8-42bc-929b-01ccbc38f2be  osd.0        
2025-05-15T09:35:43.260798Z_b818083e-3772-470f-8101-383fe5d59d71  osd.0        
2025-05-15T09:44:31.592511Z_c61766b5-5794-46dd-835d-0c243470e2f0  osd.0        
2025-05-15T09:44:52.002511Z_f7955f15-56c9-4ac1-a9ce-50f72e1e06d9  osd.0        
2025-05-15T09:45:08.332280Z_327c13fd-34ff-4ccd-93d6-f1881b042d69  osd.0        
2025-05-16T18:49:03.356772Z_211e8571-471e-44fd-93e4-9d5f84af145a  mon.pve2      
2025-05-16T19:02:08.462218Z_d1f5fd9e-1eca-4f0a-881b-33b971570f25  mon.pve2      
2025-05-17T04:50:50.408230Z_89a406f1-6f21-4586-97ca-d8ec9a262f0f  osd.8        
2025-05-17T14:47:31.097463Z_85173b19-3f5b-4699-a01a-54c3fcdcc60f  osd.3        
2025-05-18T15:04:22.321243Z_231c0c05-426d-4db0-aa95-62ebd5bd2b94  mgr.pve2      
2025-05-19T00:52:56.602501Z_e8882ff1-3d42-476e-bf7a-1b14078da96e  osd.3        
2025-05-19T02:07:45.252358Z_bee8ebde-4b7c-43f9-af13-fcc9d831de30  osd.8        
2025-05-19T04:11:09.443414Z_a6934bcf-6473-49cc-a65f-fb3d5d4156d9  mon.pve2      
2025-05-19T04:12:04.076445Z_b6746f4f-b3d8-4cce-a4dd-9ce251ef8a73  mon.pve2      
2025-05-19T04:15:38.313004Z_03415c9e-9890-4251-aa79-40db1ecbc420  mon.pve2      
2025-05-19T04:16:29.763155Z_29f7a38f-7eea-4cd3-8f72-147231e8eeef  mon.pve2      
2025-05-19T04:16:46.161168Z_b3ec4ac8-15a3-462d-bd1a-e25394ffe0c1  mon.pve2      
2025-05-19T04:17:09.074646Z_8189c791-b471-4c98-80d4-4c63fc6a5001  mon.pve2      
2025-05-19T08:22:19.989183Z_0a41b287-b7a3-4d5b-a3f5-2ef0bd53519f  mgr.pve2      
2025-05-19T15:45:30.822715Z_037ba8de-c0d9-4dc8-9870-948b3f7eeb9e  osd.0        
2025-05-19T20:18:55.358298Z_90b4ecaf-846a-42ee-9b13-8e2cb3533e77  osd.3        
2025-05-19T20:19:18.552775Z_ad18c3a4-493c-48f6-85e8-e7fc44062ab5  osd.3        
2025-05-20T15:23:33.755591Z_237d9241-8d67-4512-a1eb-2eb2f48c9fe5  mon.pve2      
2025-05-20T15:23:45.345998Z_93ca8888-7824-4de2-bb54-35ac4e66c276  mon.pve2      
2025-05-20T15:23:56.857746Z_0b1c47d2-9319-4ce9-9dfa-bcdafd26884d  mon.pve2      
2025-05-20T15:24:08.347289Z_74daf345-6d40-466b-88bd-a71126ddd6e5  mon.pve2      
2025-05-20T15:24:21.449721Z_7ac2e9f2-6f39-48ff-96b2-3d7c6f860633  mon.pve2      
2025-05-21T00:17:06.437811Z_5cbe3490-baf7-4ca6-bb93-9c9d7d4cdd52  mgr.pve2   *  
2025-05-21T01:40:27.574018Z_4da27986-0c78-472e-aab0-0c8951043693  mon.pve2   *
Code:
May 20 21:40:27 pve2 ceph-mon[2005]: *** Caught signal (Segmentation fault) **
May 20 21:40:27 pve2 ceph-mon[2005]:  in thread 78b0c46946c0 thread_name:rocksdb:low
May 20 21:40:27 pve2 ceph-mon[2005]:  ceph version 18.2.7 (4cac8341a72477c60a6f153f3ed344b49870c932) reef (stable)
May 20 21:40:27 pve2 ceph-mon[2005]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x78b0c7bf5050]
May 20 21:40:27 pve2 ceph-mon[2005]:  2: (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)+0x1b) [0x634fedb6584b]
May 20 21:40:27 pve2 ceph-mon[2005]:  3: (rocksdb::BlockBuilder::AddWithLastKeyImpl(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const*, unsigned long)+0x15a) [0x634fedafa61a]
May 20 21:40:27 pve2 ceph-mon[2005]:  4: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::SubcompactionState*)+0x14c6) [0x634feda5f776]
May 20 21:40:27 pve2 ceph-mon[2005]:  5: (rocksdb::CompactionJob::Run()+0x338) [0x634feda61878]
May 20 21:40:27 pve2 ceph-mon[2005]:  6: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xda7) [0x634fed751027]
May 20 21:40:27 pve2 ceph-mon[2005]:  7: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x141) [0x634fed753261]
May 20 21:40:27 pve2 ceph-mon[2005]:  8: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x87) [0x634fed753b67]
May 20 21:40:27 pve2 ceph-mon[2005]:  9: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x539) [0x634fedb0b7b9]
May 20 21:40:27 pve2 ceph-mon[2005]:  10: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x64) [0x634fedb0bd64]
May 20 21:40:27 pve2 ceph-mon[2005]:  11: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x78b0c7f6e4a3]
May 20 21:40:27 pve2 ceph-mon[2005]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x78b0c7c421f5]
May 20 21:40:27 pve2 ceph-mon[2005]:  13: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x78b0c7cc289c]
May 20 21:40:27 pve2 ceph-mon[2005]: 2025-05-20T21:40:27.572-0400 78b0c46946c0 -1 *** Caught signal (Segmentation fault) **
May 20 21:40:27 pve2 ceph-mon[2005]:  in thread 78b0c46946c0 thread_name:rocksdb:low
May 20 21:40:27 pve2 ceph-mon[2005]:  ceph version 18.2.7 (4cac8341a72477c60a6f153f3ed344b49870c932) reef (stable)
May 20 21:40:27 pve2 ceph-mon[2005]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x78b0c7bf5050]
May 20 21:40:27 pve2 ceph-mon[2005]:  2: (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)+0x1b) [0x634fedb6584b]
May 20 21:40:27 pve2 ceph-mon[2005]:  3: (rocksdb::BlockBuilder::AddWithLastKeyImpl(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const*, unsigned long)+0x15a) [0x634fedafa61a]
May 20 21:40:27 pve2 ceph-mon[2005]:  4: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::SubcompactionState*)+0x14c6) [0x634feda5f776]
May 20 21:40:27 pve2 ceph-mon[2005]:  5: (rocksdb::CompactionJob::Run()+0x338) [0x634feda61878]
May 20 21:40:27 pve2 ceph-mon[2005]:  6: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xda7) [0x634fed751027]
May 20 21:40:27 pve2 ceph-mon[2005]:  7: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x141) [0x634fed753261]
May 20 21:40:27 pve2 ceph-mon[2005]:  8: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x87) [0x634fed753b67]
May 20 21:40:27 pve2 ceph-mon[2005]:  9: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x539) [0x634fedb0b7b9]
May 20 21:40:27 pve2 ceph-mon[2005]:  10: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x64) [0x634fedb0bd64]
May 20 21:40:27 pve2 ceph-mon[2005]:  11: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x78b0c7f6e4a3]
May 20 21:40:27 pve2 ceph-mon[2005]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x78b0c7c421f5]
May 20 21:40:27 pve2 ceph-mon[2005]:  13: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x78b0c7cc289c]
May 20 21:40:27 pve2 ceph-mon[2005]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 20 21:40:27 pve2 ceph-mon[2005]:      0> 2025-05-20T21:40:27.572-0400 78b0c46946c0 -1 *** Caught signal (Segmentation fault) **
May 20 21:40:27 pve2 ceph-mon[2005]:  in thread 78b0c46946c0 thread_name:rocksdb:low
May 20 21:40:27 pve2 ceph-mon[2005]:  ceph version 18.2.7 (4cac8341a72477c60a6f153f3ed344b49870c932) reef (stable)
May 20 21:40:27 pve2 ceph-mon[2005]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x78b0c7bf5050]
May 20 21:40:27 pve2 ceph-mon[2005]:  2: (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)+0x1b) [0x634fedb6584b]
May 20 21:40:27 pve2 ceph-mon[2005]:  3: (rocksdb::BlockBuilder::AddWithLastKeyImpl(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const*, unsigned long)+0x15a) [0x634fedafa61a]
May 20 21:40:27 pve2 ceph-mon[2005]:  4: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::SubcompactionState*)+0x14c6) [0x634feda5f776]
May 20 21:40:27 pve2 ceph-mon[2005]:  5: (rocksdb::CompactionJob::Run()+0x338) [0x634feda61878]
May 20 21:40:27 pve2 ceph-mon[2005]:  6: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xda7) [0x634fed751027]
May 20 21:40:27 pve2 ceph-mon[2005]:  7: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x141) [0x634fed753261]
May 20 21:40:27 pve2 ceph-mon[2005]:  8: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x87) [0x634fed753b67]
May 20 21:40:27 pve2 ceph-mon[2005]:  9: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x539) [0x634fedb0b7b9]
May 20 21:40:27 pve2 ceph-mon[2005]:  10: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x64) [0x634fedb0bd64]
May 20 21:40:27 pve2 ceph-mon[2005]:  11: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x78b0c7f6e4a3]
May 20 21:40:27 pve2 ceph-mon[2005]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x78b0c7c421f5]
May 20 21:40:27 pve2 ceph-mon[2005]:  13: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x78b0c7cc289c]
May 20 21:40:27 pve2 ceph-mon[2005]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 20 21:40:27 pve2 ceph-mon[2005]:      0> 2025-05-20T21:40:27.572-0400 78b0c46946c0 -1 *** Caught signal (Segmentation fault) **
May 20 21:40:27 pve2 ceph-mon[2005]:  in thread 78b0c46946c0 thread_name:rocksdb:low
May 20 21:40:27 pve2 ceph-mon[2005]:  ceph version 18.2.7 (4cac8341a72477c60a6f153f3ed344b49870c932) reef (stable)
May 20 21:40:27 pve2 ceph-mon[2005]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x78b0c7bf5050]
May 20 21:40:27 pve2 ceph-mon[2005]:  2: (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned long)+0x1b) [0x634fedb6584b]
May 20 21:40:27 pve2 ceph-mon[2005]:  3: (rocksdb::BlockBuilder::AddWithLastKeyImpl(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const*, unsigned long)+0x15a) [0x634fedafa61a]
May 20 21:40:27 pve2 ceph-mon[2005]:  4: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::SubcompactionState*)+0x14c6) [0x634feda5f776]
May 20 21:40:27 pve2 ceph-mon[2005]:  5: (rocksdb::CompactionJob::Run()+0x338) [0x634feda61878]
May 20 21:40:27 pve2 ceph-mon[2005]:  6: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xda7) [0x634fed751027]
May 20 21:40:27 pve2 ceph-mon[2005]:  7: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x141) [0x634fed753261]
May 20 21:40:27 pve2 ceph-mon[2005]:  8: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x87) [0x634fed753b67]
May 20 21:40:27 pve2 ceph-mon[2005]:  9: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x539) [0x634fedb0b7b9]
May 20 21:40:27 pve2 ceph-mon[2005]:  10: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x64) [0x634fedb0bd64]
May 20 21:40:27 pve2 ceph-mon[2005]:  11: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x78b0c7f6e4a3]
May 20 21:40:27 pve2 ceph-mon[2005]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x78b0c7c421f5]
May 20 21:40:27 pve2 ceph-mon[2005]:  13: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x78b0c7cc289c]
May 20 21:40:27 pve2 ceph-mon[2005]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 20 21:40:27 pve2 systemd[1]: ceph-mon@pve2.service: Main process exited, code=killed, status=11/SEGV
May 20 21:40:27 pve2 systemd[1]: ceph-mon@pve2.service: Failed with result 'signal'.
May 20 21:40:27 pve2 systemd[1]: ceph-mon@pve2.service: Consumed 1min 5.580s CPU time.
May 20 21:40:37 pve2 systemd[1]: ceph-mon@pve2.service: Scheduled restart job, restart counter is at 1.
May 20 21:40:37 pve2 systemd[1]: Stopped ceph-mon@pve2.service - Ceph cluster monitor daemon.
May 20 21:40:37 pve2 systemd[1]: ceph-mon@pve2.service: Consumed 1min 5.580s CPU time.
May 20 21:40:37 pve2 systemd[1]: Started ceph-mon@pve2.service - Ceph cluster monitor daemon.
 
Last edited:
Code:
May 20 20:17:06 pve2 ceph-mgr[2004]: *** Caught signal (Segmentation fault) **
May 20 20:17:06 pve2 ceph-mgr[2004]:  in thread 7cb342dad6c0 thread_name:msgr-worker-0
May 20 20:17:06 pve2 ceph-mgr[2004]:  ceph version 18.2.7 (4cac8341a72477c60a6f153f3ed344b49870c932) reef (stable)
May 20 20:17:06 pve2 ceph-mgr[2004]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7cb3452e5050]
May 20 20:17:06 pve2 ceph-mgr[2004]:  2: (MgrMap::ModuleOption::encode(ceph::buffer::v15_2_0::list&) const+0x130) [0x61f92a833dc0]
May 20 20:17:06 pve2 ceph-mgr[2004]:  3: (MMgrBeacon::encode_payload(unsigned long)+0x614) [0x61f92a838874]
May 20 20:17:06 pve2 ceph-mgr[2004]:  4: (Message::encode(unsigned long, int, bool)+0x2a) [0x7cb345cfbbaa]
May 20 20:17:06 pve2 ceph-mgr[2004]:  5: (ProtocolV2::write_event()+0x276) [0x7cb345de6e56]
May 20 20:17:06 pve2 ceph-mgr[2004]:  6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x5d4) [0x7cb345df53f4]
May 20 20:17:06 pve2 ceph-mgr[2004]:  7: /usr/lib/ceph/libceph-common.so.2(+0x635a21) [0x7cb345dfba21]
May 20 20:17:06 pve2 ceph-mgr[2004]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x7cb34565e4a3]
May 20 20:17:06 pve2 ceph-mgr[2004]:  9: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x7cb3453321f5]
May 20 20:17:06 pve2 ceph-mgr[2004]:  10: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x7cb3453b289c]
May 20 20:17:06 pve2 ceph-mgr[2004]: 2025-05-20T20:17:06.437-0400 7cb342dad6c0 -1 *** Caught signal (Segmentation fault) **
May 20 20:17:06 pve2 ceph-mgr[2004]:  in thread 7cb342dad6c0 thread_name:msgr-worker-0
May 20 20:17:06 pve2 ceph-mgr[2004]:  ceph version 18.2.7 (4cac8341a72477c60a6f153f3ed344b49870c932) reef (stable)
May 20 20:17:06 pve2 ceph-mgr[2004]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7cb3452e5050]
May 20 20:17:06 pve2 ceph-mgr[2004]:  2: (MgrMap::ModuleOption::encode(ceph::buffer::v15_2_0::list&) const+0x130) [0x61f92a833dc0]
May 20 20:17:06 pve2 ceph-mgr[2004]:  3: (MMgrBeacon::encode_payload(unsigned long)+0x614) [0x61f92a838874]
May 20 20:17:06 pve2 ceph-mgr[2004]:  4: (Message::encode(unsigned long, int, bool)+0x2a) [0x7cb345cfbbaa]
May 20 20:17:06 pve2 ceph-mgr[2004]:  5: (ProtocolV2::write_event()+0x276) [0x7cb345de6e56]
May 20 20:17:06 pve2 ceph-mgr[2004]:  6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x5d4) [0x7cb345df53f4]
May 20 20:17:06 pve2 ceph-mgr[2004]:  7: /usr/lib/ceph/libceph-common.so.2(+0x635a21) [0x7cb345dfba21]
May 20 20:17:06 pve2 ceph-mgr[2004]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x7cb34565e4a3]
May 20 20:17:06 pve2 ceph-mgr[2004]:  9: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x7cb3453321f5]
May 20 20:17:06 pve2 ceph-mgr[2004]:  10: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x7cb3453b289c]
May 20 20:17:06 pve2 ceph-mgr[2004]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 20 20:17:06 pve2 ceph-mgr[2004]:      0> 2025-05-20T20:17:06.437-0400 7cb342dad6c0 -1 *** Caught signal (Segmentation fault) **
May 20 20:17:06 pve2 ceph-mgr[2004]:  in thread 7cb342dad6c0 thread_name:msgr-worker-0
May 20 20:17:06 pve2 ceph-mgr[2004]:  ceph version 18.2.7 (4cac8341a72477c60a6f153f3ed344b49870c932) reef (stable)
May 20 20:17:06 pve2 ceph-mgr[2004]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7cb3452e5050]
May 20 20:17:06 pve2 ceph-mgr[2004]:  2: (MgrMap::ModuleOption::encode(ceph::buffer::v15_2_0::list&) const+0x130) [0x61f92a833dc0]
May 20 20:17:06 pve2 ceph-mgr[2004]:  3: (MMgrBeacon::encode_payload(unsigned long)+0x614) [0x61f92a838874]
May 20 20:17:06 pve2 ceph-mgr[2004]:  4: (Message::encode(unsigned long, int, bool)+0x2a) [0x7cb345cfbbaa]
May 20 20:17:06 pve2 ceph-mgr[2004]:  5: (ProtocolV2::write_event()+0x276) [0x7cb345de6e56]
May 20 20:17:06 pve2 ceph-mgr[2004]:  6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x5d4) [0x7cb345df53f4]
May 20 20:17:06 pve2 ceph-mgr[2004]:  7: /usr/lib/ceph/libceph-common.so.2(+0x635a21) [0x7cb345dfba21]
May 20 20:17:06 pve2 ceph-mgr[2004]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x7cb34565e4a3]
May 20 20:17:06 pve2 ceph-mgr[2004]:  9: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x7cb3453321f5]
May 20 20:17:06 pve2 ceph-mgr[2004]:  10: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x7cb3453b289c]
May 20 20:17:06 pve2 ceph-mgr[2004]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 20 20:17:06 pve2 ceph-mgr[2004]:      0> 2025-05-20T20:17:06.437-0400 7cb342dad6c0 -1 *** Caught signal (Segmentation fault) **
May 20 20:17:06 pve2 ceph-mgr[2004]:  in thread 7cb342dad6c0 thread_name:msgr-worker-0
May 20 20:17:06 pve2 ceph-mgr[2004]:  ceph version 18.2.7 (4cac8341a72477c60a6f153f3ed344b49870c932) reef (stable)
May 20 20:17:06 pve2 ceph-mgr[2004]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7cb3452e5050]
May 20 20:17:06 pve2 ceph-mgr[2004]:  2: (MgrMap::ModuleOption::encode(ceph::buffer::v15_2_0::list&) const+0x130) [0x61f92a833dc0]
May 20 20:17:06 pve2 ceph-mgr[2004]:  3: (MMgrBeacon::encode_payload(unsigned long)+0x614) [0x61f92a838874]
May 20 20:17:06 pve2 ceph-mgr[2004]:  4: (Message::encode(unsigned long, int, bool)+0x2a) [0x7cb345cfbbaa]
May 20 20:17:06 pve2 ceph-mgr[2004]:  5: (ProtocolV2::write_event()+0x276) [0x7cb345de6e56]
May 20 20:17:06 pve2 ceph-mgr[2004]:  6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x5d4) [0x7cb345df53f4]
May 20 20:17:06 pve2 ceph-mgr[2004]:  7: /usr/lib/ceph/libceph-common.so.2(+0x635a21) [0x7cb345dfba21]
May 20 20:17:06 pve2 ceph-mgr[2004]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x7cb34565e4a3]
May 20 20:17:06 pve2 ceph-mgr[2004]:  9: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x7cb3453321f5]
May 20 20:17:06 pve2 ceph-mgr[2004]:  10: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x7cb3453b289c]
May 20 20:17:06 pve2 ceph-mgr[2004]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
May 20 20:17:06 pve2 systemd[1]: ceph-mgr@pve2.service: Main process exited, code=killed, status=11/SEGV
May 20 20:17:06 pve2 systemd[1]: ceph-mgr@pve2.service: Failed with result 'signal'.
May 20 20:17:06 pve2 systemd[1]: ceph-mgr@pve2.service: Consumed 5.825s CPU time.
May 20 20:17:16 pve2 systemd[1]: ceph-mgr@pve2.service: Scheduled restart job, restart counter is at 1.
May 20 20:17:16 pve2 systemd[1]: Stopped ceph-mgr@pve2.service - Ceph cluster manager daemon.
May 20 20:17:16 pve2 systemd[1]: ceph-mgr@pve2.service: Consumed 5.825s CPU time.
May 20 20:17:16 pve2 systemd[1]: Started ceph-mgr@pve2.service - Ceph cluster manager daemon.
May 20 20:17:16 pve2 ceph-mgr[128778]: 2025-05-20T20:17:16.835-0400 7b4792c3d380 -1 mgr[py] Module osd_support has missing NOTIFY_TYPES member
May 20 20:17:16 pve2 ceph-mgr[128778]: 2025-05-20T20:17:16.980-0400 7b4792c3d380 -1 mgr[py] Module balancer has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.155-0400 7b4792c3d380 -1 mgr[py] Module volumes has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.234-0400 7b4792c3d380 -1 mgr[py] Module telemetry has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.272-0400 7b4792c3d380 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.311-0400 7b4792c3d380 -1 mgr[py] Module alerts has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.347-0400 7b4792c3d380 -1 mgr[py] Module selftest has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.387-0400 7b4792c3d380 -1 mgr[py] Module snap_schedule has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.471-0400 7b4792c3d380 -1 mgr[py] Module orchestrator has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.547-0400 7b4792c3d380 -1 mgr[py] Module influx has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.582-0400 7b4792c3d380 -1 mgr[py] Module telegraf has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.674-0400 7b4792c3d380 -1 mgr[py] Module nfs has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.822-0400 7b4792c3d380 -1 mgr[py] Module zabbix has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.869-0400 7b4792c3d380 -1 mgr[py] Module pg_autoscaler has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.905-0400 7b4792c3d380 -1 mgr[py] Module devicehealth has missing NOTIFY_TYPES member
May 20 20:17:17 pve2 ceph-mgr[128778]: 2025-05-20T20:17:17.951-0400 7b4792c3d380 -1 mgr[py] Module osd_perf_query has missing NOTIFY_TYPES member
May 20 20:17:18 pve2 ceph-mgr[128778]: 2025-05-20T20:17:18.043-0400 7b4792c3d380 -1 mgr[py] Module test_orchestrator has missing NOTIFY_TYPES member
May 20 20:17:18 pve2 ceph-mgr[128778]: 2025-05-20T20:17:18.093-0400 7b4792c3d380 -1 mgr[py] Module crash has missing NOTIFY_TYPES member
May 20 20:17:18 pve2 ceph-mgr[128778]: 2025-05-20T20:17:18.144-0400 7b4792c3d380 -1 mgr[py] Module status has missing NOTIFY_TYPES member
May 20 20:17:18 pve2 ceph-mgr[128778]: 2025-05-20T20:17:18.231-0400 7b4792c3d380 -1 mgr[py] Module rbd_support has missing NOTIFY_TYPES member
May 20 20:17:18 pve2 ceph-mgr[128778]: 2025-05-20T20:17:18.266-0400 7b4792c3d380 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
May 20 20:17:18 pve2 ceph-mgr[128778]: 2025-05-20T20:17:18.413-0400 7b4792c3d380 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
 
Last edited:
mmm, both ceph-mon && ceph-mgr crashing is really strange. I'm not aware of a ceph bug in 18.2.7.

maybe ask to the ceph mailing list, but for me, it's look like an hardware problem (ram or maybe cpu)

I was initially thinking hardware issue, but having run memtest86, and various cpu benchmarks nothing weird happened. I've also run various storage benchmarks to stress test ceph. I have another identical node without storage, new ram, new cpu, i'm going to try swapping the hardware. Unfortunately i'm gonna be out of the country for the next week, so i'll just have to hope for the best and follow up with the ceph mailing list then.
 
mmm, both ceph-mon && ceph-mgr crashing is really strange. I'm not aware of a ceph bug in 18.2.7.

maybe ask to the ceph mailing list, but for me, it's look like an hardware problem (ram or maybe cpu)
I think you were correct. It hasn't crashed since i swapped out the entire system about 50 hours ago, i was getting daily crashes of ceph and hard lockups on the system...
 
  • Like
Reactions: wbedard