We had to take down our cluster of 3 today due to a regular electrical equipment check. Its a cluster of three nodes with 4 OSDs each. After all nodes were started again, the OSDs on one node did not come back up, the OSDs on the other two servers are fine.
I'm a little bit lost on what to do here.. it looks like one disk on this node didn't survive the outage (SMART is FAILED) but that shouldn't cause all OSDs to fail. Any idea what to do, how to fix this issue? Thanks!
This is the current OSD tree:
The logs for all the OSD daemons show various stack traces:
The entire log is attached.
Versions (pveversion -v):
I'm a little bit lost on what to do here.. it looks like one disk on this node didn't survive the outage (SMART is FAILED) but that shouldn't cause all OSDs to fail. Any idea what to do, how to fix this issue? Thanks!
This is the current OSD tree:
Code:
root@proxmox01:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 10.47949 root default
-3 3.49316 host proxmox01
0 ssd 0.87329 osd.0 down 0 1.00000
1 ssd 0.87329 osd.1 down 0 1.00000
2 ssd 0.87329 osd.2 down 0 1.00000
3 ssd 0.87329 osd.3 down 0 1.00000
-5 3.49316 host proxmox02
4 ssd 0.87329 osd.4 up 1.00000 1.00000
5 ssd 0.87329 osd.5 up 1.00000 1.00000
6 ssd 0.87329 osd.6 up 1.00000 1.00000
7 ssd 0.87329 osd.7 up 1.00000 1.00000
-7 3.49316 host proxmox03
8 ssd 0.87329 osd.8 up 1.00000 1.00000
9 ssd 0.87329 osd.9 up 1.00000 1.00000
10 ssd 0.87329 osd.10 up 1.00000 1.00000
11 ssd 0.87329 osd.11 up 1.00000 1.00000
The logs for all the OSD daemons show various stack traces:
Code:
Dec 28 17:13:51 proxmox01 systemd[1]: Starting ceph-osd@1.service - Ceph object storage daemon osd.1...
Dec 28 17:13:51 proxmox01 systemd[1]: Started ceph-osd@1.service - Ceph object storage daemon osd.1.
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: *** Caught signal (Bus error) **
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: in thread 7c518f8376c0 thread_name:ceph-osd
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7c519025b050]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 2: /lib64/ld-linux-x86-64.so.2(+0x22f58) [0x7c51912b6f58]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 3: /lib64/ld-linux-x86-64.so.2(+0x8bd7) [0x7c519129cbd7]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 4: /lib64/ld-linux-x86-64.so.2(+0x90ca) [0x7c519129d0ca]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 5: /lib64/ld-linux-x86-64.so.2(+0x9a48) [0x7c519129da48]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 6: /lib64/ld-linux-x86-64.so.2(+0xe411) [0x7c51912a2411]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 7: /lib64/ld-linux-x86-64.so.2(+0xbc0a) [0x7c519129fc0a]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 8: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 9: /lib64/ld-linux-x86-64.so.2(+0xb1c6) [0x7c519129f1c6]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 10: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 11: /lib64/ld-linux-x86-64.so.2(+0xb5b8) [0x7c519129f5b8]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x85438) [0x7c51902a4438]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 13: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 14: _dl_catch_error()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x84f27) [0x7c51902a3f27]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 16: dlopen()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 17: (ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::ErasureCodePlugin**, std::ostream*)+0x1f7) [0x633951fc21c7]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 18: (ceph::ErasureCodePluginRegistry::preload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0x9f) [0x633951fc293f]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 19: (global_init_preload_erasure_code(ceph::common::CephContext const*)+0x8a8) [0x633951607848]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 20: main()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 21: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7c519024624a]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 22: __libc_start_main()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 23: _start()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 2024-12-28T17:13:53.583+0100 7c518f8376c0 -1 *** Caught signal (Bus error) **
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: in thread 7c518f8376c0 thread_name:ceph-osd
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7c519025b050]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 2: /lib64/ld-linux-x86-64.so.2(+0x22f58) [0x7c51912b6f58]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 3: /lib64/ld-linux-x86-64.so.2(+0x8bd7) [0x7c519129cbd7]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 4: /lib64/ld-linux-x86-64.so.2(+0x90ca) [0x7c519129d0ca]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 5: /lib64/ld-linux-x86-64.so.2(+0x9a48) [0x7c519129da48]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 6: /lib64/ld-linux-x86-64.so.2(+0xe411) [0x7c51912a2411]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 7: /lib64/ld-linux-x86-64.so.2(+0xbc0a) [0x7c519129fc0a]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 8: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 9: /lib64/ld-linux-x86-64.so.2(+0xb1c6) [0x7c519129f1c6]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 10: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 11: /lib64/ld-linux-x86-64.so.2(+0xb5b8) [0x7c519129f5b8]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x85438) [0x7c51902a4438]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 13: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 14: _dl_catch_error()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x84f27) [0x7c51902a3f27]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 16: dlopen()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 17: (ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::ErasureCodePlugin**, std::ostream*)+0x1f7) [0x633951fc21c7]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 18: (ceph::ErasureCodePluginRegistry::preload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0x9f) [0x633951fc293f]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 19: (global_init_preload_erasure_code(ceph::common::CephContext const*)+0x8a8) [0x633951607848]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 20: main()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 21: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7c519024624a]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 22: __libc_start_main()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 23: _start()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 0> 2024-12-28T17:13:53.583+0100 7c518f8376c0 -1 *** Caught signal (Bus error) **
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: in thread 7c518f8376c0 thread_name:ceph-osd
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7c519025b050]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 2: /lib64/ld-linux-x86-64.so.2(+0x22f58) [0x7c51912b6f58]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 3: /lib64/ld-linux-x86-64.so.2(+0x8bd7) [0x7c519129cbd7]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 4: /lib64/ld-linux-x86-64.so.2(+0x90ca) [0x7c519129d0ca]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 5: /lib64/ld-linux-x86-64.so.2(+0x9a48) [0x7c519129da48]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 6: /lib64/ld-linux-x86-64.so.2(+0xe411) [0x7c51912a2411]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 7: /lib64/ld-linux-x86-64.so.2(+0xbc0a) [0x7c519129fc0a]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 8: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 9: /lib64/ld-linux-x86-64.so.2(+0xb1c6) [0x7c519129f1c6]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 10: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 11: /lib64/ld-linux-x86-64.so.2(+0xb5b8) [0x7c519129f5b8]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x85438) [0x7c51902a4438]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 13: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 14: _dl_catch_error()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x84f27) [0x7c51902a3f27]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 16: dlopen()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 17: (ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::ErasureCodePlugin**, std::ostream*)+0x1f7) [0x633951fc21c7]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 18: (ceph::ErasureCodePluginRegistry::preload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0x9f) [0x633951fc293f]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 19: (global_init_preload_erasure_code(ceph::common::CephContext const*)+0x8a8) [0x633951607848]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 20: main()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 21: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7c519024624a]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 22: __libc_start_main()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 23: _start()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 0> 2024-12-28T17:13:53.583+0100 7c518f8376c0 -1 *** Caught signal (Bus error) **
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: in thread 7c518f8376c0 thread_name:ceph-osd
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7c519025b050]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 2: /lib64/ld-linux-x86-64.so.2(+0x22f58) [0x7c51912b6f58]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 3: /lib64/ld-linux-x86-64.so.2(+0x8bd7) [0x7c519129cbd7]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 4: /lib64/ld-linux-x86-64.so.2(+0x90ca) [0x7c519129d0ca]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 5: /lib64/ld-linux-x86-64.so.2(+0x9a48) [0x7c519129da48]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 6: /lib64/ld-linux-x86-64.so.2(+0xe411) [0x7c51912a2411]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 7: /lib64/ld-linux-x86-64.so.2(+0xbc0a) [0x7c519129fc0a]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 8: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 9: /lib64/ld-linux-x86-64.so.2(+0xb1c6) [0x7c519129f1c6]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 10: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 11: /lib64/ld-linux-x86-64.so.2(+0xb5b8) [0x7c519129f5b8]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x85438) [0x7c51902a4438]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 13: _dl_catch_exception()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 14: _dl_catch_error()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x84f27) [0x7c51902a3f27]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 16: dlopen()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 17: (ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::ErasureCodePlugin**, std::ostream*)+0x1f7) [0x633951fc21c7]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 18: (ceph::ErasureCodePluginRegistry::preload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0x9f) [0x633951fc293f]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 19: (global_init_preload_erasure_code(ceph::common::CephContext const*)+0x8a8) [0x633951607848]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 20: main()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 21: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7c519024624a]
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 22: __libc_start_main()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: 23: _start()
Dec 28 17:13:53 proxmox01 ceph-osd[3095]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Dec 28 17:13:53 proxmox01 systemd[1]: ceph-osd@1.service: Main process exited, code=killed, status=7/BUS
Dec 28 17:13:53 proxmox01 systemd[1]: ceph-osd@1.service: Failed with result 'signal'.
The entire log is attached.
Versions (pveversion -v):
Code:
pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-2-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.8-2
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph: 18.2.2-pve1
ceph-fuse: 18.2.2-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
dnsmasq: 2.89-1
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.0-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1