[SOLVED] ceph clock skew issue - no way out?

Knuuut

Member
Jun 7, 2018
91
9
8
60
Hi,

there are plenty of posts about clock skew issues within this forum. I'm affected too.

So, I've tried different actions to get 4 Nodes with identical hardware permanently in sync with no success.

Even this post https://forum.proxmox.com/threads/proxmoxve-ceph-clock-issue.20684/#post-105441 made things worse, so I switched back to timedatectl, where the count of clock skew incidents is less than with ntpd.

With ntpd, I've seen a high jitter value (>200ms) on every node.

With timedatectl, there is no way to get the jitter value afaik.

Maybe a change of the Linux clocksource from tsc to hpet would be a solution?

Any help would be appreciated.

Cheers Knuuut
 
Last edited:
Use a time source that is on the local ceph network and on hardware (not virtual).
 
I'm using the local ntp servers from my Datacenter-Provider.

Again, with the same ntp servers and with ntpd on nodes, things got worse and I can't reproduce this behavior ond other hardware.

So, my guess is an unstable clocksource (tsc) on all 4 nodes...?

Does anybody has experiences about switching the clocksource from tsc to hpet?

Cheers Knuuut
 
we install ntp on each node .
then edit /etc/ntp.conf to use the router to internet as ntp server [ the router is pfsense which is running on hardware ].
 
  • Like
Reactions: AlexLup
Try this.

Code:
echo "NTP=10.1.1.11 10.1.1.12 10.1.1.13" >> /etc/systemd/timesyncd.conf

timedatectl set-ntp true
systemctl restart systemd-timesyncd

systemctl status systemd-timesyncd

date

hwclock -w

Replace "10.1.1.11 10.1.1.12 10.1.1.13" with your own ntp server ip addresses
 
Last edited:
Try this.

Code:
echo "Servers=10.1.1.11 10.1.1.12 10.1.1.13" >> /etc/systemd/timesyncd.conf

timedatectl set-ntp true
systemctl restart systemd-timesyncd

systemctl status systemd-timesyncd

date

hwclock -w
Replace "10.1.1.11 10.1.1.12 10.1.1.13" with your own ntp server ip addresses

I think you mean "NTP=" instead of "Servers="

Anyway, thats like my current configuration.

we install ntp on each node .
then edit /etc/ntp.conf to use the router to internet as ntp server [ the router is pfsense which is running on hardware ].

I've already tried this with the ntp servers inside my Datacenter, also with the debian pool servers. This configaration never got "health ok".
 
As I wrote before, I can't reproduce this on other (older) hardware.

So my focus is on the current (new) hardware:

Intel S2600STB Mainboards with dual Xenons
Intel X520-DA2 dual 10Gb SFP+ nics
LSI 9341-4i

No exotic components at all

Any ideas anybody?
 
I had the same problem on my cluster for a few hours. My firewall was blocking 123 as I place all my servers in a management subnet with minimal access to the internet. Honestly I used the default time server, updated the time and the issue was resolved.
 
There is no firewall issue, because ntpq -pn (in case of running ntpd) and also systemctl status systemd-timesyncd.service is giving me positive output.
 
Finally, I solved this issue by myself.

What I did:

Set
Code:
NTP=0.debian.pool.ntp.org 1.debian.pool.ntp.org 2.debian.pool.ntp.org 3.debian.pool.ntp.org
in /etc/systemd/timesync.conf on every node.

But important was this:
Code:
hwclock -w
several times on every node.

No more clock skew issues since Friday.