I was messing with my ipv6 ceph setup (moving it from one IPv6 subnet to another).
Sometime later cephFS failed, with dmesg flooded with corrupt OSD errors like this, i really thought i had corrupted the filesystem... i hadn't....
with the help of chatgpt:
chatgpt correctly identified that:
https://chatgpt.com/share/680dc0f7-5784-800d-807a-53aef7cd28b8
I am instantly marking this as resolved, this is a warning, a tutorial and to log a real weird implication of doing pure IPv6..... so if anyone else hits this they may be saved from my lunacy
Sometime later cephFS failed, with dmesg flooded with corrupt OSD errors like this, i really thought i had corrupted the filesystem... i hadn't....
Code:
[ 702.716292] mdsmap: 000004b0: 00 00 01 00 00 00 00 00 00 00 ad 0c 00 00 01 00 ................
[ 702.716293] mdsmap: 000004c0: 00 00 00 00 00 00 59 f3 bf 02 00 00 00 00 00 00 ......Y.........
[ 702.716295] mdsmap: 000004d0: 00 00 00 00 00 00 e7 6d 00 00 00 00 00 01 0e 00 .......m........
[ 702.716296] mdsmap: 000004e0: 00 00 49 53 4f 73 2d 54 65 6d 70 6c 61 74 65 73 ..ISOs-Templates
[ 702.716297] mdsmap: 000004f0: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
[ 702.716298] mdsmap: 00000500: 00 00 00 00 00 02 00 00 00 2d 31 00 00 01 00 00 .........-1.....
[ 702.716300] mdsmap: 00000510: 00 00 00 59 f3 bf 02 00 00 00 00 01 00 00 00 59 ...Y...........Y
[ 702.716301] mdsmap: 00000520: f3 bf 02 00 00 00 00 .......
[ 702.716303] ceph: [5e55fd50-d135-413d-bffe-9d0fae0ef5fa 46134264]: error decoding mdsmap -22. Shutting down mount.
[ 702.716306] header: 00000000: 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 702.716308] header: 00000010: 15 00 c4 00 02 00 43 05 00 00 00 00 00 00 00 00 ......C.........
[ 702.716309] header: 00000020: 00 00 00 00 01 00 00 00 00 00 00 00 00 01 00 00 ................
[ 702.716311] header: 00000030: 00 fe 4d 85 46 ..M.F
[ 702.716312] front: 00000000: 5e 55 fd 50 d1 35 41 3d bf fe 9d 0f ae 0e f5 fa ^U.P.5A=........
[ 702.716314] front: 00000010: ad 0c 00 00 27 05 00 00 05 04 21 05 00 00 ad 0c ....'.....!.....
[ 702.716315] front: 00000020: 00 00 12 00 00 00 00 00 00 00 00 00 00 00 3c 00 ..............<.
[ 702.716316] front: 00000030: 00 00 2c 01 00 00 00 00 00 00 00 01 00 00 01 00 ..,.............
[ 702.716318] front: 00000040: 00 00 01 00 00 00 59 f3 bf 02 00 00 00 00 0a 04 ......Y.........
[ 702.716319] front: 00000050: 9a 02 00 00 59 f3 bf 02 00 00 00 00 06 00 00 00 ....Y...........
with the help of chatgpt:
- i narrowed down the place the error started (dmesg was so full it couldn't show that far back) - journactl -b was the key and i found this
- with this being the key
libceph: mon0 (1)[fc00:81::1]:6789 session established
Apr 26 21:28:40 pve1 kernel: libceph: another match of type 1 in addrvec
Apr 26 21:28:40 pve1 kernel: libceph: corrupt full osdmap (-22) epoch 28158 off 3549 (0000000085bec616 of 0000000051a0555c-00000000a0d640c9)
Apr 26 21:28:39 pve1 chronyd[1322]: System clock TAI offset set to 37 seconds
Apr 26 21:28:40 pve1 sh[1540]: Running command: /usr/sbin/ceph-volume lvm trigger 0-1ffc7a4c-2603-4a29-ab70-8ac78d7d1e3e
Apr 26 21:28:40 pve1 sh[1544]: Running command: /usr/sbin/ceph-volume lvm trigger 0-e0c3acf4-83b2-4264-96e7-279b3dbb4fe9
Apr 26 21:28:40 pve1 sh[1547]: Running command: /usr/sbin/ceph-volume lvm trigger 0-ea4a641e-3606-486a-9862-9838834128cf
Apr 26 21:28:40 pve1 ceph-osd[1894]: 2025-04-26T21:28:40.151-0700 79ef4485b880 -1 osd.1 28154 log_to_monitors true
Apr 26 21:28:40 pve1 systemd[1]: Mounting mnt-pve-ISOs\x2dTemplates.mount - /mnt/pve/ISOs-Templates...
Apr 26 21:28:40 pve1 kernel: netfs: FS-Cache loaded
Apr 26 21:28:40 pve1 kernel: Key type ceph registered
Apr 26 21:28:40 pve1 kernel: libceph: loaded (mon/osd proto 15/24)
Apr 26 21:28:40 pve1 kernel: ceph: loaded (mds proto 32)
Apr 26 21:28:40 pve1 kernel: libceph: mon0 (1)[fc00:81::1]:6789 session established
Apr 26 21:28:40 pve1 kernel: libceph: another match of type 1 in addrvec
Apr 26 21:28:40 pve1 kernel: libceph: corrupt full osdmap (-22) epoch 28158 off 3549 (0000000085bec616 of 0000000051a0555c-00000000a0d640c9)
Apr 26 21:28:40 pve1 kernel: osdmap: 00000000: 08 07 a3 3e 00 00 09 01 d0 16 00 00 5e 55 fd 50 ...>........^U.P
Apr 26 21:28:40 pve1 kernel: osdmap: 00000010: d1 35 41 3d bf fe 9d 0f ae 0e f5 fa fe 6d 00 00 .5A=.........m..
Apr 26 21:28:40 pve1 kernel: osdmap: 00000020: ad 2a e2 64 3d b9 9b 02 72 b2 0d 68 23 b0 e9 23 .*.d=...r..h#..#
Apr 26 21:28:40 pve1 kernel: osdmap: 00000030: 08 00 00 00 0e 00 00 00 00 00 00 00 1d 05 64 01 ..............d.
Apr 26 21:28:40 pve1 kernel: osdmap: 00000040: 00 00 01 03 00 02 01 00 00 00 01 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000050: 00 00 00 00 00 00 dc 0a 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000070: 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000080: 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000090: 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000b0: 00 00 00 00 01 01 01 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000d0: 00 00 00 00 00 00 00 80 1a 06 00 00 35 0c 00 00 ............5...
Apr 26 21:28:40 pve1 kernel: osdmap: 000000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000f0: 00 00 00 00 00 00 00 00 00 00 00 c0 27 09 00 00 ............'...
Apr 26 21:28:40 pve1 sh[1540]: Running command: /usr/sbin/ceph-volume lvm trigger 0-1ffc7a4c-2603-4a29-ab70-8ac78d7d1e3e
Apr 26 21:28:40 pve1 sh[1544]: Running command: /usr/sbin/ceph-volume lvm trigger 0-e0c3acf4-83b2-4264-96e7-279b3dbb4fe9
Apr 26 21:28:40 pve1 sh[1547]: Running command: /usr/sbin/ceph-volume lvm trigger 0-ea4a641e-3606-486a-9862-9838834128cf
Apr 26 21:28:40 pve1 ceph-osd[1894]: 2025-04-26T21:28:40.151-0700 79ef4485b880 -1 osd.1 28154 log_to_monitors true
Apr 26 21:28:40 pve1 systemd[1]: Mounting mnt-pve-ISOs\x2dTemplates.mount - /mnt/pve/ISOs-Templates...
Apr 26 21:28:40 pve1 kernel: netfs: FS-Cache loaded
Apr 26 21:28:40 pve1 kernel: Key type ceph registered
Apr 26 21:28:40 pve1 kernel: libceph: loaded (mon/osd proto 15/24)
Apr 26 21:28:40 pve1 kernel: ceph: loaded (mds proto 32)
Apr 26 21:28:40 pve1 kernel: libceph: mon0 (1)[fc00:81::1]:6789 session established
Apr 26 21:28:40 pve1 kernel: libceph: another match of type 1 in addrvec
Apr 26 21:28:40 pve1 kernel: libceph: corrupt full osdmap (-22) epoch 28158 off 3549 (0000000085bec616 of 0000000051a0555c-00000000a0d640c9)
Apr 26 21:28:40 pve1 kernel: osdmap: 00000000: 08 07 a3 3e 00 00 09 01 d0 16 00 00 5e 55 fd 50 ...>........^U.P
Apr 26 21:28:40 pve1 kernel: osdmap: 00000010: d1 35 41 3d bf fe 9d 0f ae 0e f5 fa fe 6d 00 00 .5A=.........m..
Apr 26 21:28:40 pve1 kernel: osdmap: 00000020: ad 2a e2 64 3d b9 9b 02 72 b2 0d 68 23 b0 e9 23 .*.d=...r..h#..#
Apr 26 21:28:40 pve1 kernel: osdmap: 00000030: 08 00 00 00 0e 00 00 00 00 00 00 00 1d 05 64 01 ..............d.
Apr 26 21:28:40 pve1 kernel: osdmap: 00000040: 00 00 01 03 00 02 01 00 00 00 01 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000050: 00 00 00 00 00 00 dc 0a 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000070: 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000080: 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 00000090: 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000b0: 00 00 00 00 01 01 01 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000d0: 00 00 00 00 00 00 00 80 1a 06 00 00 35 0c 00 00 ............5...
Apr 26 21:28:40 pve1 kernel: osdmap: 000000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Apr 26 21:28:40 pve1 kernel: osdmap: 000000f0: 00 00 00 00 00 00 00 00 00 00 00 c0 27 09 00 00 ............'...
chatgpt correctly identified that:
- all the OSDs and data were fine - mounting with fuse client worked (it apparently ignores IP addresees that don't work for mon hosts)
- that fuse client didn't care about being give invalid addresses (0.0.0.0) and some stale addresses it was picking up from loopbacks - i hadn't cleaned up the loopback interfaces
- it incorrecttly had me zap the metadata for an OSD to try and fix this (i guess this might work in other scnenarios)
- I eventually realised that the 0.0.0.0 was IPv4 and that i didn't want to use IPv4
- thats when i realized that somehow i had deleted
ms_bind_ipv4 = falsefrom ceph.conf - i think because chatgpt told me it was best practice to leave it as default (oops) - that fixed the issue
- don't panic when ceph seems to be bad - it really is very very robust, i have treated it terribly over the last 4 days and my cluster is working
- chatgpt can help one design and troubleshoot...
- ....but it absolutely has no real logical rasoning (i already knew this) even if it looks like it - so the key is to take it very very illustratively, challenge it if you think it is wrong (pointing it at a docs page can get it to revise itself)
- dont remove
ms_bind_ipv4 = falseon an IPv6 system
https://chatgpt.com/share/680dc0f7-5784-800d-807a-53aef7cd28b8
I am instantly marking this as resolved, this is a warning, a tutorial and to log a real weird implication of doing pure IPv6..... so if anyone else hits this they may be saved from my lunacy