Hello, I have a 3 node proxmox cluster using Ceph as storage backend. I've also deployed a CephFS filesystem from this same Ceph cluster. The filesystem recently became unavailable because any MDS daemon that attempts to run for this filesystem fails with segfault during journal playback.
Log from crashing MDS:
-8> 2023-05-26T11:57:01.469-0500 7f6559f16700 1 mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 2388235687 (header had 2388213543). recovered.
-7> 2023-05-26T11:57:01.469-0500 7f6559715700 4 mds.0.log Journal 0x200 recovered.
-6> 2023-05-26T11:57:01.469-0500 7f6559715700 4 mds.0.log Recovered journal 0x200 in format 1
-5> 2023-05-26T11:57:01.469-0500 7f6559715700 2 mds.0.6319 Booting: 1: loading/discovering base inodes
-4> 2023-05-26T11:57:01.469-0500 7f6559715700 0 mds.0.cache creating system inode with ino:0x100
-3> 2023-05-26T11:57:01.469-0500 7f6559715700 0 mds.0.cache creating system inode with ino:0x1
-2> 2023-05-26T11:57:01.473-0500 7f6559f16700 2 mds.0.6319 Booting: 2: replaying mds log
-1> 2023-05-26T11:57:01.473-0500 7f6559f16700 2 mds.0.6319 Booting: 2: waiting for purge queue recovered
0> 2023-05-26T11:57:01.501-0500 7f6558713700 -1 *** Caught signal (Segmentation fault) **
in thread 7f6558713700 thread_name:md_log_replay
ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f6565170140]
2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x66c2) [0x5565fca37372]
3: (EUpdate::replay(MDSRank*)+0x3c) [0x5565fca38abc]
4: (MDLog::_replay_thread()+0x7cb) [0x5565fc9bd0fb]
5: (MDLog::ReplayThread::entry()+0xd) [0x5565fc68fbfd]
6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f6565164ea7]
7: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Log from crashing MDS:
-8> 2023-05-26T11:57:01.469-0500 7f6559f16700 1 mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 2388235687 (header had 2388213543). recovered.
-7> 2023-05-26T11:57:01.469-0500 7f6559715700 4 mds.0.log Journal 0x200 recovered.
-6> 2023-05-26T11:57:01.469-0500 7f6559715700 4 mds.0.log Recovered journal 0x200 in format 1
-5> 2023-05-26T11:57:01.469-0500 7f6559715700 2 mds.0.6319 Booting: 1: loading/discovering base inodes
-4> 2023-05-26T11:57:01.469-0500 7f6559715700 0 mds.0.cache creating system inode with ino:0x100
-3> 2023-05-26T11:57:01.469-0500 7f6559715700 0 mds.0.cache creating system inode with ino:0x1
-2> 2023-05-26T11:57:01.473-0500 7f6559f16700 2 mds.0.6319 Booting: 2: replaying mds log
-1> 2023-05-26T11:57:01.473-0500 7f6559f16700 2 mds.0.6319 Booting: 2: waiting for purge queue recovered
0> 2023-05-26T11:57:01.501-0500 7f6558713700 -1 *** Caught signal (Segmentation fault) **
in thread 7f6558713700 thread_name:md_log_replay
ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f6565170140]
2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x66c2) [0x5565fca37372]
3: (EUpdate::replay(MDSRank*)+0x3c) [0x5565fca38abc]
4: (MDLog::_replay_thread()+0x7cb) [0x5565fc9bd0fb]
5: (MDLog::ReplayThread::entry()+0xd) [0x5565fc68fbfd]
6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f6565164ea7]
7: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.