CEPH in WARN state for 54 min

Binary Bandit

Well-Known Member
Dec 13, 2018
60
9
48
53
Hi All,
So I noticed out 3 node CEPH / Proxmox cluster in a WARN state a few minutes ago. By the time I started investigating the issue had resolved itself.
---------------------
Problem started at 04:03:42 on 2019.07.06
Problem name: Ceph cluster in WARN state
Host: AIT2 (ProxMox Cluster Node 2)
Severity: High

Original problem ID: 5742
-----------------------------
and then 54 min later ...
-----------------------------
Problem has been resolved at 04:57:43 on 2019.07.06
Problem name: Ceph cluster in WARN state
Host: AIT2 (ProxMox Cluster Node 2)
Severity: High

Original problem ID: 5742
-----------------------------

We monitor the underlying dell hardware. This didn't / doesn't show any issues.

I'm looking for suggestions at to what might have caused this / what logs to look into, etc.

thanks all,

James
 
Mhh, you know how to manage CEPH? It's easy to check what's happend use 'ceph -s'. So check the CEPH Tab in PVE und go to logs or check the Logs at the CLI.

What should your Monitoring messages say? Nothing, absolutely nothing. It isn't helpful.
 
I should have noted that those emails were from Zabbix at 9:03 and 9:57 PM local time ... about 30 minutes after the 7.1 quake that hit Southern California. These servers are about a 1.5 hour drive from the epicenter but there was still plenty of movement.

I did check PVE / the CLI. It showed everything OK by the time that I got to it.

Well shoot ... it's not nearly as fun to think about but here's what's in the logs:

2019-07-05 21:00:00.000168 mon.ait1 mon.0 172.20.4.12:6789/0 129792 : cluster [WRN] overall HEALTH_WARN clock skew detected on mon.ait3

2019-07-05 21:02:00.511592 mon.ait1 mon.0 172.20.4.12:6789/0 129809 : cluster [WRN] mon.2 172.20.4.14:6789/0 clock skew 0.065839s > max 0.05s

...

2019-07-05 21:52:00.523683 mon.ait1 mon.0 172.20.4.12:6789/0 130209 : cluster [WRN] mon.2 172.20.4.14:6789/0 clock skew 0.0799283s > max 0.05s

2019-07-05 21:57:04.885399 mon.ait1 mon.0 172.20.4.12:6789/0 130239 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.ait3)
2019-07-05 21:57:04.885446 mon.ait1 mon.0 172.20.4.12:6789/0 130240 : cluster [INF] Cluster is now healthy


This cluster has been running about a year ... no time sync issues. Any thoughts on this? Wait and watch? Check / change how it syncs time? other thoughts?
 
That's fairly typical with Ceph and it only take 50ms of skew between any two monitors to throw a warning. Ideally what you should be doing is having all three monitor nodes syncing time via NTP to some NTP server that's local and have that NTP server sync to e.g. pool.ntp.org rather than having all of your mons syncing to some remote NTP server directly. The accuracy of the time isn't super critical, what you're worried about is the consistency of the time. If all of your mons are drifting by exactly 150ms, that's not an issue, but if each of them is drifting by 30ms but one is ahead and the other is behind then now it's throwing warnings.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!