Ceph integration - clock skew

pvps1

Renowned Member
May 24, 2016
139
40
93
Pettenbach, Upper Austria
proserver1.at
Hi
we have a problem with a 4 node cluster running integrated ceph (meaning nodes are pve and ceph-cluster in one).

3 nodes are ceph mons and osds, 2 of them report:
health HEALTH_WARN
clock skew detected on mon.1
Monitor clock skew detected

we cannot detect why. all nodes are running NTP and have accurate time.

any hints/tips?
networking bottleneck (only 1gbit networking)?

thx in advance,
Peter
 
Hi,
sometimes after a reboot it can take a time until they are in sync again.
How log you got this message?
 
Could it be that the nodes have different timezone?
 
Could it be that the nodes have different timezone?

No.
see (run with cssh on all 4 nodes):
root@dkcpn0001:~# cat /etc/timezone ; date
Europe/Vienna
Thu Oct 27 10:11:24 CEST 2016

root@dkcpn0002:~# cat /etc/timezone ; date
Europe/Vienna
Thu Oct 27 10:11:24 CEST 2016

root@dkcpn0003:~# cat /etc/timezone ; date
Europe/Vienna
Thu Oct 27 10:11:24 CEST 2016

root@dkcpr0001:~# cat /etc/timezone ; date
Europe/Vienna
Thu Oct 27 10:11:24 CEST 2016
 
May try an other ntp server which is nearby you cluster.
 
This is warning when all your ceph mon nodes are not time sync. (> 50ms difference, and only between the mons).
try
Code:
date +"%T.%3N"

running cssh:

root@dkcpn0001:~# date +"%T.%3N"
14:16:05.967
root@dkcpn0002:~# date +"%T.%3N"
14:16:05.888
root@dkcpn0003:~# date +"%T.%3N"
14:16:05.967

don't know if the time difference can come from cssh runtime or the node's load.

all nodes sync to node dkcpr0001 which is part of the cluster (therefore 1 hop, switched)
 
realized -> it is allways mon.1 that is reported. this is node pn0002 which really has a different time of ~100ms (see last reply).
did a manual resync with the internal timeserver and immediatly after that a date +"%T.%3N" again shows between 50 and 100ms difference...

the node has no extraordinary load, on the contrary it's the less used node.
hmmm have to fresh up my ntpd knowledge on how to increase precision on that node or find the reason for the bias
 
I test a proxmox cluster with 3 node, 1 old HP et 2 new HPE server, i dont know why but new server's is not correctly sync despite ntpd process.
ntpq -p show a problem jitter network... i don't know why....
I have sync new hpe server's on my old HPE, same problem, jitter is high.
I test with chrony another ntp client, it's works !
chrony have a "Estimation of asymmetric jitter", maybe it was my problem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!