Proxmox 7.3.3 / Ceph 17.2.5 - OSDs crashing while rebooting

fstrankowski

Well-Known Member
Nov 28, 2016
72
13
48
39
Hamburg
We've recently (yesterday) updated our test-cluster to the latest PVE-Version. While rebooting the system (upgrade finished without any incidents), all OSDs on each system crashed:
** File Read Latency Histogram By Level [default] ** 2023-01-30T10:21:52.827+0100 7f5f16fd1700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 2023-01-30T10:21:52.827+0100 7f5f16fd1700 -1 osd.12 1500643 *** Got signal Terminated *** 2023-01-30T10:21:52.827+0100 7f5f16fd1700 0 osd.12 1500643 Fast Shutdown: - cct->_conf->osd_fast_shutdown = 1, null-fm = 1 2023-01-30T10:21:52.827+0100 7f5f16fd1700 -1 osd.12 1500643 *** Immediate shutdown (osd_fast_shutdown=true) *** 2023-01-30T10:21:52.827+0100 7f5f16fd1700 0 osd.12 1500643 prepare_to_stop telling mon we are shutting down and dead 2023-01-30T10:21:57.827+0100 7f5f16fd1700 0 osd.12 1500643 prepare_to_stop starting shutdown 2023-01-30T10:21:57.835+0100 7f5f0ad12700 0 bluestore(/var/lib/ceph/osd/ceph-12) allocation stats probe 19: cnt: 467772 frags: 490381 size: 3361779712 2023-01-30T10:21:57.835+0100 7f5f0ad12700 0 bluestore(/var/lib/ceph/osd/ceph-12) probe -1: 3894856, 4409481, 24518307840 2023-01-30T10:21:57.835+0100 7f5f0ad12700 0 bluestore(/var/lib/ceph/osd/ceph-12) probe -3: 3473756, 3764265, 22888685568 2023-01-30T10:21:57.835+0100 7f5f0ad12700 0 bluestore(/var/lib/ceph/osd/ceph-12) probe -7: 3478652, 3500519, 24553218048 2023-01-30T10:21:57.835+0100 7f5f0ad12700 0 bluestore(/var/lib/ceph/osd/ceph-12) probe -11: 3447472, 3566519, 23551668224 2023-01-30T10:21:57.835+0100 7f5f0ad12700 0 bluestore(/var/lib/ceph/osd/ceph-12) probe -19: 3310749, 3468410, 22454980608 2023-01-30T10:21:57.835+0100 7f5f0ad12700 0 bluestore(/var/lib/ceph/osd/ceph-12) ------------ 2023-01-30T10:21:57.835+0100 7f5f16fd1700 4 rocksdb: [db/db_impl/db_impl.cc:446] Shutdown: canceling all background work 2023-01-30T10:21:57.835+0100 7f5f16fd1700 4 rocksdb: [db/db_impl/db_impl.cc:625] Shutdown complete 2023-01-30T10:21:58.575+0100 7f5f16fd1700 1 bluefs umount 2023-01-30T10:21:58.575+0100 7f5f16fd1700 1 bdev(0x56408272dc00 /var/lib/ceph/osd/ceph-12/block) close 2023-01-30T10:21:58.859+0100 7f5f16fd1700 1 freelist shutdown 2023-01-30T10:21:59.083+0100 7f5f16fd1700 1 fbmap_alloc 0x564081c61440 shutdown 2023-01-30T10:21:59.083+0100 7f5f16fd1700 1 bdev(0x56408272c000 /var/lib/ceph/osd/ceph-12/block) close 2023-01-30T10:22:15.983+0100 7f5f16fd1700 -1 ./src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 7f5f16fd1700 time 2023-01-30T10:22:15.981875+0100 ./src/osd/OSD.cc: 4340: FAILED ceph_assert(end_time - start_time_func < cct->_conf->osd_fast_shutdown_timeout) ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x56407edccf70] 2: /usr/bin/ceph-osd(+0xc2310e) [0x56407edcd10e] 3: (OSD::shutdown()+0x135d) [0x56407eec287d] 4: (SignalHandler::entry()+0x648) [0x56407f548408] 5: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f5f1ae83ea7] 6: clone() 2023-01-30T10:22:15.987+0100 7f5f16fd1700 -1 *** Caught signal (Aborted) ** in thread 7f5f16fd1700 thread_name:signal_handler ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable) 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f5f1ae8f140] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x56407edccfca] 5: /usr/bin/ceph-osd(+0xc2310e) [0x56407edcd10e] 6: (OSD::shutdown()+0x135d) [0x56407eec287d] 7: (SignalHandler::entry()+0x648) [0x56407f548408] 8: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f5f1ae83ea7] 9: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

More

-8> 2023-01-30T10:22:12.663+0100 7f5f08fd1700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2023-01-30T10:21:42.665682+0100) -7> 2023-01-30T10:22:13.663+0100 7f5f08fd1700 10 monclient: tick -6> 2023-01-30T10:22:13.663+0100 7f5f08fd1700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2023-01-30T10:21:43.665865+0100) -5> 2023-01-30T10:22:14.663+0100 7f5f08fd1700 10 monclient: tick -4> 2023-01-30T10:22:14.663+0100 7f5f08fd1700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2023-01-30T10:21:44.666015+0100) -3> 2023-01-30T10:22:15.663+0100 7f5f08fd1700 10 monclient: tick -2> 2023-01-30T10:22:15.663+0100 7f5f08fd1700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2023-01-30T10:21:45.666166+0100) -1> 2023-01-30T10:22:15.983+0100 7f5f16fd1700 -1 ./src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 7f5f16fd1700 time 2023-01-30T10:22:15.981875+0100 ./src/osd/OSD.cc: 4340: FAILED ceph_assert(end_time - start_time_func < cct->_conf->osd_fast_shutdown_timeout) ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x56407edccf70] 2: /usr/bin/ceph-osd(+0xc2310e) [0x56407edcd10e] 3: (OSD::shutdown()+0x135d) [0x56407eec287d] 4: (SignalHandler::entry()+0x648) [0x56407f548408] 5: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f5f1ae83ea7] 6: clone() 0> 2023-01-30T10:22:15.987+0100 7f5f16fd1700 -1 *** Caught signal (Aborted) ** in thread 7f5f16fd1700 thread_name:signal_handler ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable) 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f5f1ae8f140] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x56407edccfca] 5: /usr/bin/ceph-osd(+0xc2310e) [0x56407edcd10e] 6: (OSD::shutdown()+0x135d) [0x56407eec287d] 7: (SignalHandler::entry()+0x648) [0x56407f548408] 8: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f5f1ae83ea7] 9: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels ---

I'd be happy to provide full log if necessary.
 
We have seen the same sort of thing happen. Upon reboot one (or more) OSDs may (or may not) crash in this manner. So far, it seems random.
There is a ceph issue for just such a crash here:
https://tracker.ceph.com/issues/56292
But there does not seem much (if any) activity as of now.

Code:
ceph crash info 2023-03-10T23:22:29.784388Z_f90e9155-1bf1-4398-b5a1-0cf591322072
{
    "assert_condition": "end_time - start_time_func < cct->_conf->osd_fast_shutdown_timeout",
    "assert_file": "./src/osd/OSD.cc",
    "assert_func": "int OSD::shutdown()",
    "assert_line": 4340,
    "assert_msg": "./src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 7f8f317eb700 time 2023-03-10T23:22:29.778913+0000\n./src/osd/OSD.cc: 4340: FAILED ceph_assert(end_time - start_time_func < cct->_conf->osd_fast_shutdown_timeout)\n",
    "assert_thread_name": "signal_handler",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f8f3599e140]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x564348ab6fca]",
        "/usr/bin/ceph-osd(+0xc2310e) [0x564348ab710e]",
        "(OSD::shutdown()+0x135d) [0x564348bac87d]",
        "(SignalHandler::entry()+0x648) [0x564349232408]",
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f8f35992ea7]",
        "clone()"
    ],
    "ceph_version": "17.2.5",
    "crash_id": "2023-03-10T23:22:29.784388Z_f90e9155-1bf1-4398-b5a1-0cf591322072",
    "entity_name": "osd.8",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-osd",
    "stack_sig": "dfb0b495bd77f684152c94fe9f9975ec7154f5d2df8322bcc4bd0150fe974c6a",
    "timestamp": "2023-03-10T23:22:29.784388Z",
    "utsname_hostname": "srv04",
    "utsname_machine": "x86_64",
    "utsname_release": "6.1.10-1-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC PVE 6.1.10-1 (2023-02-07T00:00Z)"
}
 
  • Like
Reactions: fstrankowski
So we can atleast say that our problem is unrelated to the kernel version. You're running 6.1 while we're at 5.15. Same issue on both systems.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!