CEPH Health Warn

Mino

Member
Dec 31, 2016
13
1
6
36
Hello,

Yesterday, CEPH crashed on one server and now I have a health warning on the dashboard.
Please see logs attached.

Code:
root@proxmox01 [~] # pveversion
pve-manager/6.1-5/9bf06119 (running kernel: 5.3.13-1-pve)

What can I do about this problem, and how to clear the warning ?

ceph-warning.png
 

Attachments

  • ceph-mon_logs.txt
    21.6 KB · Views: 3
Last edited:
Thanks for the reply, i was able to clear the warning.
By the way I discovered there was another crash for the same reason some days earlier :

Code:
{
    "os_version_id": "10",
    "assert_condition": "z >= signedspan::zero()",
    "utsname_release": "5.0.21-2-pve",
    "os_name": "Debian GNU/Linux 10 (buster)",
    "entity_name": "mon.proxmox03",
    "assert_file": "/mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/common/ceph_time.h",
    "timestamp": "2019-11-23 08:25:41.759708Z",
    "process_name": "ceph-mon",
    "utsname_machine": "x86_64",
    "assert_line": 485,
    "utsname_sysname": "Linux",
    "os_version": "10 (buster)",
    "os_id": "10",
    "assert_thread_name": "ms_dispatch",
    "utsname_version": "#1 SMP PVE 5.0.21-6 (Fri, 27 Sep 2019 17:17:02 +0200)",
    "backtrace": [
        "(()+0x12730) [0x7f9a7a3a8730]",
        "(gsignal()+0x10b) [0x7f9a79e8b7bb]",
        "(abort()+0x121) [0x7f9a79e76535]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7f9a7b4e765f]",
        "(()+0x2807e6) [0x7f9a7b4e77e6]",
        "(Paxos::do_refresh()+0x1a4) [0x55ad55800c14]",
        "(Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55ad558061c2]",
        "(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55ad5580b973]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55ad5571a1dc]",
        "(Monitor::_ms_dispatch(Message*)+0x4aa) [0x55ad5571a7ca]",
        "(Monitor::ms_dispatch(Message*)+0x26) [0x55ad55749f76]",
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55ad557464a6]",
        "(DispatchQueue::entry()+0x1a49) [0x7f9a7b71daa9]",
        "(DispatchQueue::DispatchThread::entry()+0xd) [0x7f9a7b7cb4dd]",
        "(()+0x7fa3) [0x7f9a7a39dfa3]",
        "(clone()+0x3f) [0x7f9a79f4d4cf]"
    ],
    "utsname_hostname": "proxmox03",
    "assert_msg": "/mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7f9a7146c700 time 2019-11-23 09:25:41.753470\n/mnt/pve/ceph-dev/ceph/ceph-14.2.4/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())\n",
    "crash_id": "2019-11-23_08:25:41.759708Z_c2da2bf7-0136-4a75-b9a2-2e11d01c27e4",
    "assert_func": "ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)",
    "ceph_version": "14.2.4"
}

Is it a known bug ?
 
Is it a known bug ?

No. This is highly weird!
You run into a assertion for a timespan helper method. This helper method is only used a few times by the ceph monitors when doing paxos (quorum) stuff. The usage always looks like:
Code:
auto start = ceph::coarse_mono_clock::now();  // get start time
get_store()->apply_transaction(t); // do some work, e.g., apply transaction
auto end = ceph::coarse_mono_clock::now();  // get end time

logger->tinc("foo", to_timespan(end-start));  // log with duration

Now, the start and end times are being fed from a monotonic clock, the main feature (or guarantee) of a monotonic clock is that it never jumps back (which others can do, to correct time drift or summer/winter time (daylight savings time) changes).

This guarantee normally should ensure that the timespan never gets zero, as end hast to be bigger or at least the same as start, at least as long causality holds up. But, monotonic clocks normally require a RTC (real time clock) to work.
As the code seems pretty simple and I do not remember any similar report I'd say that it rather is a issue with your HW/Setup.
It could still be a bug, that's never ruled out 100%, but it seems rather unlikely.

Does your servers have a RTC, how old and what model are they?
 
I have 3 servers like these :
- they're synchronized with chrony to public NTP, and I never noticed any NTP problem. Chrony synchronizes system clock to hardware clock every 11 minutes by default.
- the hardware is an ASROCKRACK X470D4U motherboard with an AMD Ryzen 5 2600 CPU for each
 
they're synchronized with chrony to public NTP, and I never noticed any NTP problem. Chrony synchronizes system clock to hardware clock every 11 minutes by default.

This isn't a NTP problem directly, as in a problem with not synchronized time. On the contrary, it can be caused by time synchronization if there is no (working) RTC HW clock, as then the monotonic clock can be broken by time changes..

That's why systemd-timesyncd does not immediately applies time drift compensation if there's no RTC, but only on reboot.
systemd-timesyncd [...] also saves the local time to disk every time the clock has been synchronized and uses this to possibly advance the system realtime clock on subsequent reboots to ensure it monotonically advances even if the system lacks a battery-buffered RTC chip.
man systemd-timesyncd

I'm just trying to make sense to the symptoms you experience, and for now for me that's the most plausible cause.
 
As timesyncd is installed by default, it was my NTP daemon at the beginning.
But I always had warnings with CEPH complaining that there was time drifts between the servers.
Timesyncd wasn't able to keep the servers synchronized.
When i switched to chrony, no more problem. Well at least until I see these 2 crashes.

Logs for chrony were disabled by default. I just activated them today. I'll see if I get some logs the next time a ceph-mon crash.
 
Little update, a ceph-mon crashed again :

Code:
{
    "os_version_id": "10",
    "assert_condition": "z >= signedspan::zero()",
    "utsname_release": "5.3.13-1-pve",
    "os_name": "Debian GNU/Linux 10 (buster)",
    "entity_name": "mon.proxmox03",
    "assert_file": "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h",
    "timestamp": "2020-02-08 07:10:40.476912Z",
    "process_name": "ceph-mon",
    "utsname_machine": "x86_64",
    "assert_line": 485,
    "utsname_sysname": "Linux",
    "os_version": "10 (buster)",
    "os_id": "10",
    "assert_thread_name": "ms_dispatch",
    "utsname_version": "#1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100)",
    "backtrace": [
        "(()+0x12730) [0x7f400dd79730]",
        "(gsignal()+0x10b) [0x7f400d85c7bb]",
        "(abort()+0x121) [0x7f400d847535]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7f400eeb3e79]",
        "(()+0x282000) [0x7f400eeb4000]",
        "(Paxos::do_refresh()+0x1a4) [0x55b2c25494b4]",
        "(Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55b2c254ea62]",
        "(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55b2c2554213]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55b2c2489b1c]",
        "(Monitor::_ms_dispatch(Message*)+0x4aa) [0x55b2c248a10a]",
        "(Monitor::ms_dispatch(Message*)+0x26) [0x55b2c24b9a36]",
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55b2c24b5f66]",
        "(DispatchQueue::entry()+0x1a49) [0x7f400f0ece69]",
        "(DispatchQueue::DispatchThread::entry()+0xd) [0x7f400f19a9ed]",
        "(()+0x7fa3) [0x7f400dd6efa3]",
        "(clone()+0x3f) [0x7f400d91e4cf]"
    ],
    "utsname_hostname": "proxmox03",
    "assert_msg": "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7f4004e34700 time 2020-02-08 08:10:40.470081\n/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())\n",
    "crash_id": "2020-02-08_07:10:40.476912Z_067d39ad-8f95-470c-bb24-aae841b42f40",
    "assert_func": "ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)",
    "ceph_version": "14.2.5"
}

Here are the logs from chrony, at 07:10:40 something happens :

tracking :
Code:
===================================================================================================================================
   Date (UTC) Time     IP Address   St   Freq ppm   Skew ppm     Offset L Co  Offset sd Rem. corr. Root delay Root disp. Max. error
===================================================================================================================================
2020-02-08 06:01:50 216.239.35.0     2     10.116      0.016 -2.117e-04 N  2  8.667e-04  1.696e-04  6.637e-03  2.364e-04  4.952e-03
2020-02-08 06:19:03 216.239.35.0     2     10.110      0.021 -1.617e-04 N  2  8.921e-04  2.688e-04  6.871e-03  2.295e-04  4.986e-03
2020-02-08 06:36:16 216.239.35.0     2     10.100      0.022 -1.110e-04 N  2  9.076e-04  2.920e-04  6.383e-03  2.226e-04  5.153e-03
2020-02-08 06:53:32 216.239.35.0     2     10.096      0.020 -1.216e-04 N  2  9.870e-04  2.833e-04  6.383e-03  3.024e-04  4.878e-03
2020-02-08 07:10:40 216.239.35.0     2     10.068      0.078 -3.288e-04 N  2  7.077e-04  2.951e-04  6.383e-03  3.738e-04  4.948e-03
2020-02-08 07:28:00 216.239.35.0     2      9.983      0.092 -1.667e-04 N  2  6.608e-04  2.527e-04  6.383e-03  4.094e-04  5.401e-03
2020-02-08 07:45:05 216.239.35.0     2      9.967      0.053 -3.149e-05 N  2  6.754e-04  2.446e-04  6.383e-03  4.257e-04  5.154e-03
2020-02-08 08:02:26 216.239.35.0     2      9.959      0.040 -2.134e-05 N  2  7.036e-04  1.997e-04  6.383e-03  4.108e-04  4.990e-03
2020-02-08 08:19:37 216.239.35.0     2      9.960      0.031  2.027e-05 N  2  7.289e-04  1.751e-04  6.383e-03  4.145e-04  4.898e-03

measurements :
Code:
========================================================================================================================================
   Date (UTC) Time     IP Address   L St 123 567 ABCD  LP RP Score    Offset  Peer del. Peer disp.  Root del. Root disp. Refid     MTxRx
========================================================================================================================================
2020-02-08 06:57:04 216.239.35.12   N  1 111 111 1111  10 10 1.00 -5.434e-04  1.574e-02  9.739e-07  0.000e+00  1.678e-04 474F4F47 4B K K
2020-02-08 07:01:50 162.159.200.123 N  3 111 111 1111  10 10 0.91  1.120e-03  8.287e-03  3.623e-08  1.607e-02  4.120e-04 0AA90804 4B K K
2020-02-08 07:06:06 162.159.200.1   N  3 111 111 1111  10 10 0.91  1.073e-03  7.581e-03  3.584e-08  1.607e-02  7.324e-04 0AA90804 4B K K
2020-02-08 07:06:45 79.160.13.250   N  2 111 111 1111  10 10 0.47  4.837e-03  3.958e-02  3.751e-08  2.579e-03  7.019e-04 51AF05B6 4B K K
2020-02-08 07:10:40 216.239.35.0    N  1 111 111 1111  10 10 0.44  8.825e-04  6.841e-03  9.742e-07  0.000e+00  1.831e-04 474F4F47 4B K K
2020-02-08 07:11:42 5.103.139.163   N  1 111 111 1111  10 10 0.59  4.291e-05  2.643e-02  5.091e-08  0.000e+00  0.000e+00 47505300 4B K K
2020-02-08 07:14:23 216.239.35.12   N  1 111 111 1111  10 10 0.97 -8.441e-04  1.606e-02  9.759e-07  0.000e+00  1.984e-04 474F4F47 4B K K
2020-02-08 07:19:07 162.159.200.123 N  3 111 111 1111  10 10 0.84  1.082e-03  7.985e-03  3.585e-08  1.588e-02  4.730e-04 0AA90804 4B K K
2020-02-08 07:23:21 162.159.200.1   N  3 111 111 1111  10 10 0.88  8.381e-04  7.831e-03  3.651e-08  1.584e-02  7.935e-04 0AA90804 4B K K

statistics:
Code:
====================================================================================================================
   Date (UTC) Time     IP Address    Std dev'n Est offset  Offset sd  Diff freq   Est skew  Stress  Ns  Bs  Nr  Asym
====================================================================================================================
2020-02-08 06:54:21 5.103.139.163    3.300e-05 -5.872e-04  2.386e-05 -1.639e-07  4.199e-08 1.9e-02   7   0   6 -0.12
2020-02-08 06:57:04 216.239.35.12    2.103e-04  6.351e-04  1.324e-04 -8.530e-08  1.393e-07 5.7e+00  33  24   5  0.00
2020-02-08 07:01:50 162.159.200.123  9.442e-05 -1.102e-03  7.385e-05 -1.899e-07  1.171e-07 2.4e-02   7   0   4  0.00
2020-02-08 07:06:06 162.159.200.1    1.024e-04 -1.063e-03  6.399e-05 -1.728e-07  2.040e-07 2.6e-01   8   2   4  0.00
2020-02-08 07:06:45 79.160.13.250    4.592e-05 -4.902e-03  2.941e-05 -1.632e-07  6.124e-08 1.4e-01   7   0   5  0.00
2020-02-08 07:10:40 216.239.35.0     1.194e-04 -7.127e-04  6.623e-05 -1.214e-07  6.679e-08 1.2e-01  10   0   7  0.00
2020-02-08 07:11:42 5.103.139.163    1.181e-04 -1.179e-04  6.871e-05 -6.905e-08  2.203e-07 1.6e+00   8   2   3 -0.11
2020-02-08 07:14:23 216.239.35.12    1.694e-04  7.573e-04  1.270e-04 -9.549e-08  2.321e-07 2.7e-01  10   3   5  0.00
2020-02-08 07:19:07 162.159.200.123  1.026e-04 -1.077e-03  7.411e-05 -2.125e-07  2.302e-07 4.3e-01   8   2   5  0.00
2020-02-08 07:23:21 162.159.200.1    8.585e-05 -8.727e-04  5.442e-05 -1.381e-07  1.187e-07 3.6e-02   7   0   5  0.00
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!