Ceph osd crash

innerhippy

New Member
Mar 29, 2024
8
0
1
My 4 node cluster has been rock solid for 6 months, until now.
Bash:
$ root@pve02:~# ceph -s
  cluster:
    id:     3e788c55-0a22-4edc-af28-94b8e4ff1cac
    health: HEALTH_WARN
            Degraded data redundancy: 1105460/3316380 objects degraded (33.333%), 193 pgs degraded, 193 pgs undersized
            12 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum pve01,pve03,pve02 (age 80m)
    mgr: pve01(active, since 45h), standbys: pve03, pve02
    mds: 1/1 daemons up, 2 standby
    osd: 4 osds: 2 up (since 9h), 2 in (since 22m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 193 pgs
    objects: 1.11M objects, 1.5 TiB
    usage:   3.0 TiB used, 696 GiB / 3.6 TiB avail
    pgs:     1105460/3316380 objects degraded (33.333%)
             193 active+undersized+degraded
 
  io:
    client:   9.7 KiB/s wr, 0 op/s rd, 1 op/s wr

Bash:
$ ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         7.27759  root default                             
-3         1.81940      host pve01                           
 0   nvme  1.81940          osd.0       up   1.00000  1.00000
-5         1.81940      host pve02                           
 1   nvme  1.81940          osd.1     down         0  1.00000
-7         1.81940      host pve03                           
 2   nvme  1.81940          osd.2     down         0  1.00000
-9         1.81940      host pve04                           
 3   nvme  1.81940          osd.3       up   1.00000  1.00000

The crash reports are not that revealing

Bash:
ceph crash info 2024-11-07T10:37:23.263067Z_3e208c58-2cd5-409f-ab96-e13ea861a27a
{
    "assert_condition": "false",
    "assert_file": "./src/os/bluestore/HybridAllocator.cc",
    "assert_func": "HybridAllocator::init_rm_free(uint64_t, uint64_t)::<lambda(uint64_t, uint64_t, bool)>",
    "assert_line": 178,
    "assert_msg": "./src/os/bluestore/HybridAllocator.cc: In function 'HybridAllocator::init_rm_free(uint64_t, uint64_t)::<lambda(uint64_t, uint64_t, bool)>' thread 78e76870d840 time 2024-11-07T10:37:23.250154+0000\n./src/os/bluestore/HybridAllocator.cc: 178: FAILED ceph_assert(false)\n",
    "assert_thread_name": "ceph-osd",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x78e76925b050]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x8ae3c) [0x78e7692a9e3c]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x5b4b3ec42362]",
        "/usr/bin/ceph-osd(+0x6334a2) [0x5b4b3ec424a2]",
        "/usr/bin/ceph-osd(+0xd70dd7) [0x5b4b3f37fdd7]",
        "(AvlAllocator::_try_remove_from_tree(unsigned long, unsigned long, std::function<void (unsigned long, unsigned long, bool)>)+0x230) [0x5b4b3f3724e0]",
        "(HybridAllocator::init_rm_free(unsigned long, unsigned long)+0xc9) [0x5b4b3f3800a9]",
        "(BlueFS::mount()+0x1e9) [0x5b4b3f353269]",
        "(BlueStore::_open_bluefs(bool, bool)+0x2dd) [0x5b4b3f2551fd]",
        "(BlueStore::_prepare_db_environment(bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x27c) [0x5b4b3f255d1c]",
        "(BlueStore::_open_db(bool, bool, bool)+0x37c) [0x5b4b3f26e71c]",
        "(BlueStore::_open_db_and_around(bool, bool)+0x48e) [0x5b4b3f2d6c4e]",
        "(BlueStore::_mount()+0x347) [0x5b4b3f2d9017]",
        "(OSD::init()+0x4b1) [0x5b4b3ed9e3d1]",
        "main()",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x78e76924624a]",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "18.2.4",
    "crash_id": "2024-11-07T10:37:23.263067Z_3e208c58-2cd5-409f-ab96-e13ea861a27a",
    "entity_name": "osd.1",
    "os_id": "12",
    "os_name": "Debian GNU/Linux 12 (bookworm)",
    "os_version": "12 (bookworm)",
    "os_version_id": "12",
    "process_name": "ceph-osd",
    "stack_sig": "5b64dfaf5da18c15ac63e445825a3bfa2cfaab78c531a1ef2cd84f73ebfad950",
    "timestamp": "2024-11-07T10:37:23.263067Z",
    "utsname_hostname": "pve02",
    "utsname_machine": "x86_64",
    "utsname_release": "6.8.12-3-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PMX 6.8.12-3 (2024-10-23T11:41Z)"
}

This is happening on 2 nodes, pve02 and pve03. Any idea how to recover from this?
 
Is there anything in the syslogs around that time regarding the physical disk? Any I/O errors for example?

Have you tried (re)starting the OSD services?
 
Service restart enters failed state after a minute
Bash:
systemctl status ceph-osd@1.service
× ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: failed (Result: signal) since Thu 2024-11-07 11:17:55 GMT; 2min 33s ago
   Duration: 23.563s
    Process: 49073 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 1 (code=exited, status=0/SUCCESS)
    Process: 49077 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 1 --setuser ceph --setgroup ceph (code=killed, signal=ABR>
   Main PID: 49077 (code=killed, signal=ABRT)
        CPU: 17.355s

Nov 07 11:17:55 pve02 systemd[1]: ceph-osd@1.service: Scheduled restart job, restart counter is at 3.
Nov 07 11:17:55 pve02 systemd[1]: Stopped ceph-osd@1.service - Ceph object storage daemon osd.1.
Nov 07 11:17:55 pve02 systemd[1]: ceph-osd@1.service: Consumed 17.355s CPU time.
Nov 07 11:17:55 pve02 systemd[1]: ceph-osd@1.service: Start request repeated too quickly.
Nov 07 11:17:55 pve02 systemd[1]: ceph-osd@1.service: Failed with result 'signal'.
Nov 07 11:17:55 pve02 systemd[1]: Failed to start ceph-osd@1.service - Ceph object storage daemon osd.1.

Nothing obvious in syslogs
Bash:
journalctl -p err -b
Nov 07 09:38:06 pve02 kernel: x86/cpu: SGX disabled by BIOS.
Nov 07 09:38:11 pve02 kernel: ipmi_si hardcode-ipmi-si.0: Interface detection failed
Nov 07 09:38:11 pve02 pmxcfs[1748]: [quorum] crit: quorum_initialize failed: 2
Nov 07 09:38:11 pve02 pmxcfs[1748]: [quorum] crit: can't initialize service
Nov 07 09:38:11 pve02 pmxcfs[1748]: [confdb] crit: cmap_initialize failed: 2
Nov 07 09:38:11 pve02 pmxcfs[1748]: [confdb] crit: can't initialize service
Nov 07 09:38:11 pve02 pmxcfs[1748]: [dcdb] crit: cpg_initialize failed: 2
Nov 07 09:38:11 pve02 pmxcfs[1748]: [dcdb] crit: can't initialize service
Nov 07 09:38:11 pve02 pmxcfs[1748]: [status] crit: cpg_initialize failed: 2
Nov 07 09:38:11 pve02 pmxcfs[1748]: [status] crit: can't initialize service
Nov 07 09:39:58 pve02 systemd[1]: Failed to start ceph-osd@1.service - Ceph object storage daemon osd.1.
Nov 07 10:37:33 pve02 systemd[1]: Failed to start ceph-osd@1.service - Ceph object storage daemon osd.1.
Nov 07 11:17:55 pve02 systemd[1]: Failed to start ceph-osd@1.service - Ceph object storage daemon osd.1.
 
Only thing vaguely interesting is this

systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
 
Maybe this indicates something?
Bash:
Nov 07 01:44:40 pve03 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Nov 07 01:45:00 pve03 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Nov 07 01:45:00 pve03 kernel: Buffer I/O error on dev dm-0, logical block 488378352, async page read
Nov 07 01:45:00 pve03 kernel: Buffer I/O error on dev dm-0, logical block 488378352, async page read
Nov 07 01:45:00 pve03 kernel: Buffer I/O error on dev dm-0, logical block 488378352, async page read
Nov 07 01:45:11 pve03 kernel: Buffer I/O error on dev dm-0, logical block 0, async page read
 
Getting closer
Code:
smartctl -c /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-3-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Input/output error
 
Yeah, those I/O errors do not look good!

What kind of SSD were those? What if you reboot the host or put them into another machine? It is likely that the SSDs are dead, but maybe it as something else that causes these I/O errors.
 
I've tested both nmve drives externally and they seem fine. pve03 is now working.
pve02 smartctl seems ok no

Bash:
smartctl -i /dev/nvme0n1

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-3-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 2TB
Serial Number:                      S7DNNJ0WC78622Z
Firmware Version:                   0B2QJXG7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,941,206,802,432 [1.94 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4c3143fd51
Local Time is:                      Thu Nov  7 19:44:45 2024 GMT


smartctl -H /dev/nvme0n1

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-3-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

But starting the osd service directly produces the following stack trace
Bash:
/usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph

./src/os/bluestore/HybridAllocator.cc: In function 'HybridAllocator::init_rm_free(uint64_t, uint64_t)::<lambda(uint64_t, uint64_t, bool)>' thread 7085fbbca840 time 2024-11-07T19:43:35.271394+0000
./src/os/bluestore/HybridAllocator.cc: 178: FAILED ceph_assert(false)
2024-11-07T19:43:35.270+0000 7085fbbca840 -1 HybridAllocator init_rm_free lambda Uexpected extent:  0xa7c74000~10000
 ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x5a6a727d7307]
 2: /usr/bin/ceph-osd(+0x6334a2) [0x5a6a727d74a2]
 3: /usr/bin/ceph-osd(+0xd70dd7) [0x5a6a72f14dd7]
 4: (AvlAllocator::_try_remove_from_tree(unsigned long, unsigned long, std::function<void (unsigned long, unsigned long, bool)>)+0x230) [0x5a6a72f074e0]
 5: (HybridAllocator::init_rm_free(unsigned long, unsigned long)+0xc9) [0x5a6a72f150a9]
 6: (BlueFS::mount()+0x1e9) [0x5a6a72ee8269]
 7: (BlueStore::_open_bluefs(bool, bool)+0x2dd) [0x5a6a72dea1fd]
 8: (BlueStore::_prepare_db_environment(bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x27c) [0x5a6a72dead1c]
 9: (BlueStore::_open_db(bool, bool, bool)+0x37c) [0x5a6a72e0371c]
 10: (BlueStore::_open_db_and_around(bool, bool)+0x48e) [0x5a6a72e6bc4e]
 11: (BlueStore::_mount()+0x347) [0x5a6a72e6e017]
 12: (OSD::init()+0x4b1) [0x5a6a729333d1]
 13: main()
 14: /lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7085fc64624a]
 15: __libc_start_main()
 16: _start()
*** Caught signal (Aborted) **
 in thread 7085fbbca840 thread_name:ceph-osd
2024-11-07T19:43:35.277+0000 7085fbbca840 -1 ./src/os/bluestore/HybridAllocator.cc: In function 'HybridAllocator::init_rm_free(uint64_t, uint64_t)::<lambda(uint64_t, uint64_t, bool)>' thread 7085fbbca840 time 2024-11-07T19:43:35.271394+0000
./src/os/bluestore/HybridAllocator.cc: 178: FAILED ceph_assert(false)
 
So looks like my Samsung 990 Pro are a bit crap and will need replacing. Is there a definitive guide on replacing osd drives for this situation? It would have been nice had ceph-osd caught and reported the issue and not just barf, but I guess that's for Ceph and not proxmox.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!