CEPH in WARN state for 54 min

Binary Bandit · Jul 6, 2019

Hi All,
So I noticed out 3 node CEPH / Proxmox cluster in a WARN state a few minutes ago. By the time I started investigating the issue had resolved itself.
---------------------
Problem started at 04:03:42 on 2019.07.06
Problem name: Ceph cluster in WARN state
Host: AIT2 (ProxMox Cluster Node 2)
Severity: High

Original problem ID: 5742
-----------------------------
and then 54 min later ...
-----------------------------
Problem has been resolved at 04:57:43 on 2019.07.06
Problem name: Ceph cluster in WARN state
Host: AIT2 (ProxMox Cluster Node 2)
Severity: High

Original problem ID: 5742
-----------------------------

We monitor the underlying dell hardware. This didn't / doesn't show any issues.

I'm looking for suggestions at to what might have caused this / what logs to look into, etc.

thanks all,

James

sb-jw · Jul 6, 2019

Mhh, you know how to manage CEPH? It's easy to check what's happend use 'ceph -s'. So check the CEPH Tab in PVE und go to logs or check the Logs at the CLI.

What should your Monitoring messages say? Nothing, absolutely nothing. It isn't helpful.

Binary Bandit · Jul 6, 2019

I should have noted that those emails were from Zabbix at 9:03 and 9:57 PM local time ... about 30 minutes after the 7.1 quake that hit Southern California. These servers are about a 1.5 hour drive from the epicenter but there was still plenty of movement.

I did check PVE / the CLI. It showed everything OK by the time that I got to it.

Well shoot ... it's not nearly as fun to think about but here's what's in the logs:

2019-07-05 21:00:00.000168 mon.ait1 mon.0 172.20.4.12:6789/0 129792 : cluster [WRN] overall HEALTH_WARN clock skew detected on mon.ait3

2019-07-05 21:02:00.511592 mon.ait1 mon.0 172.20.4.12:6789/0 129809 : cluster [WRN] mon.2 172.20.4.14:6789/0 clock skew 0.065839s > max 0.05s

...

2019-07-05 21:52:00.523683 mon.ait1 mon.0 172.20.4.12:6789/0 130209 : cluster [WRN] mon.2 172.20.4.14:6789/0 clock skew 0.0799283s > max 0.05s

2019-07-05 21:57:04.885399 mon.ait1 mon.0 172.20.4.12:6789/0 130239 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.ait3)
2019-07-05 21:57:04.885446 mon.ait1 mon.0 172.20.4.12:6789/0 130240 : cluster [INF] Cluster is now healthy

This cluster has been running about a year ... no time sync issues. Any thoughts on this? Wait and watch? Check / change how it syncs time? other thoughts?

MertsA · Jul 6, 2019

That's fairly typical with Ceph and it only take 50ms of skew between any two monitors to throw a warning. Ideally what you should be doing is having all three monitor nodes syncing time via NTP to some NTP server that's local and have that NTP server sync to e.g. pool.ntp.org rather than having all of your mons syncing to some remote NTP server directly. The accuracy of the time isn't super critical, what you're worried about is the consistency of the time. If all of your mons are drifting by exactly 150ms, that's not an issue, but if each of them is drifting by 30ms but one is ahead and the other is behind then now it's throwing warnings.

Binary Bandit · Jul 7, 2019

Thanks ... I can't see bringing up a local NTP server outside of this cluster right now. I'll wait and watch for the time being. I appreciate the feedback.

Search

Search

CEPH in WARN state for 54 min

Binary Bandit

Well-Known Member

sb-jw

Famous Member

Binary Bandit

Well-Known Member

MertsA

New Member

Binary Bandit

Well-Known Member

We value your privacy