Last proxmox cluster node updated cause critical problems in Ceph.

Dec 26, 2018
138
2
23
35
Hello.

This is the second time this has happened. I don't remember if the last time was when upgrading from Proxmox 5to6.

We got 6 proxmox servers in a cluster, using each of them for ceph storage as well.

Both times we have, migrated all VMs off the proxmox node, updated it, rebooted, migrated back.

Both times, everything went great, until the reboot of the last node. After that, ceph starts complaining about every OSD being slow.
The problem here is that Ceph just waits for the OSDs, which in turn make all VMs just hang, because of IOwait <-> Ceph.

Both times, the solution was to reboot the last server one more time.

Has anyone else experienced this?

What logs should i pull before they are gone? This was on Friday afternoon.

Thanks :)
 
What logs should i pull before they are gone? This was on Friday afternoon.
All of them. Best use a central log server for longterm storage.

Both times, everything went great, until the reboot of the last node. After that, ceph starts complaining about every OSD being slow.
The problem here is that Ceph just waits for the OSDs, which in turn make all VMs just hang, because of IOwait <-> Ceph.
This sounds like something with replica, crush or networking. Best describe your cluster setup and configuration.
 
All of them. Best use a central log server for longterm storage.


This sounds like something with replica, crush or networking. Best describe your cluster setup and configuration.

1605619541207.png

Ceph runs on redundant 10gbit. 10.10.10.0/24
Our servers run on 1gbit. 10.0.0.0/24

Each server has 4 SSDs that run Ceph.

I have pulled all logs from all 6 servers. its about 400MB total, what logs do you want?

Thanks :)

Edit: Managed to zip them all down to 156MB, (removed lastlog, which is unreadable anyways)
 
Last edited:
And how is Ceph & the network configured?
 
Each server is configured with dual network cards, both on 10gbit for storage, and 1gbit.
We use active-backup setup for the cards.

Ceph is set up using the GUI in Proxmox, only thing we have tried tinkering with is "osd_memory_target"

4x 480GB samsung sm863a in each server, one OSD pr SSD. Monitor and manager runs on all servers except Proxmox4 (which is using some hybrid sshd bootdisks that creates errors if its running ceph monitor or manager [https://forum.proxmox.com/threads/ceph-mon-proxmox4-has-slow-ops-mon-proxmox4-crashed.66804/])

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.10.0/24
fsid = 1f6d8776-39b3-44c6-b484-111d3c8b8372
mon_allow_pool_delete = true
mon_host = 10.10.10.11 10.10.10.13 10.10.10.12 10.10.10.15 10.10.10.16
osd_journal_size = 5120
osd_memory_target = 2073741824
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.10.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd
device 19 osd.19 class ssd
device 20 osd.20 class ssd
device 21 osd.21 class ssd
device 22 osd.22 class ssd
device 23 osd.23 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host proxmox1 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
id -11 class hdd # do not change unnecessarily
# weight 1.746
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.437
item osd.1 weight 0.437
item osd.12 weight 0.436
item osd.4 weight 0.436
}
host proxmox3 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 1.746
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.437
item osd.3 weight 0.437
item osd.14 weight 0.436
item osd.8 weight 0.436
}
host proxmox9 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
id -13 class hdd # do not change unnecessarily
# weight 0.000
alg straw2
hash 0 # rjenkins1
}
host proxmox2 {
id -9 # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
id -14 class hdd # do not change unnecessarily
# weight 1.746
alg straw2
hash 0 # rjenkins1
item osd.6 weight 0.436
item osd.7 weight 0.436
item osd.13 weight 0.436
item osd.5 weight 0.436
}
host proxmox4 {
id -16 # do not change unnecessarily
id -17 class ssd # do not change unnecessarily
id -18 class hdd # do not change unnecessarily
# weight 1.746
alg straw2
hash 0 # rjenkins1
item osd.9 weight 0.436
item osd.10 weight 0.436
item osd.11 weight 0.436
item osd.15 weight 0.436
}
host proxmox5 {
id -19 # do not change unnecessarily
id -20 class ssd # do not change unnecessarily
id -21 class hdd # do not change unnecessarily
# weight 1.746
alg straw2
hash 0 # rjenkins1
item osd.16 weight 0.436
item osd.17 weight 0.436
item osd.18 weight 0.436
item osd.19 weight 0.436
}
host proxmox6 {
id -22 # do not change unnecessarily
id -23 class ssd # do not change unnecessarily
id -24 class hdd # do not change unnecessarily
# weight 1.746
alg straw2
hash 0 # rjenkins1
item osd.20 weight 0.436
item osd.21 weight 0.436
item osd.22 weight 0.436
item osd.23 weight 0.436
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
id -15 class hdd # do not change unnecessarily
# weight 10.476
alg straw2
hash 0 # rjenkins1
item proxmox1 weight 1.746
item proxmox3 weight 1.746
item proxmox9 weight 0.000
item proxmox2 weight 1.746
item proxmox4 weight 1.746
item proxmox5 weight 1.746
item proxmox6 weight 1.746
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule ssdregel {
id 1
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 0 type host
step emit
}
rule hddregel {
id 2
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}

# end crush map

BTW, here is the last time this happeded:
https://forum.proxmox.com/threads/ceph-in-critical-condition-after-upgrade-in-production.65471/
 
Last edited:
Which one is the last node that you reboot?
 
Ceph runs on redundant 10gbit. 10.10.10.0/24
Are the bonds on all of the nodes configured correctly? And are all the switch ports configured the same? If there is some difference, eg MTU, then Ceph will hiccup badly.
 
Are the bonds on all of the nodes configured correctly? And are all the switch ports configured the same? If there is some difference, eg MTU, then Ceph will hiccup badly.
On all servers: bond1 which is the 10gbit have 9000MTU configured, and this is also enabled on the switch.
 
OH, sorry i forgot. Earlier that day we did a OSD swap on Proxmox1. And we changed 1/2 boot disks in both proxmox 1 and 2.
The hardware upgrade was aroud 13:00-14:00. The hardware job had no problems.

We started updating the servers was from around 16:00 i think. All on the Friday 13. of november.

From ceph.log.4, from Proxmox6:
2020-11-13 17:43:56.400096 osd.20 (osd.20) 41 : cluster [WRN] slow request osd_op(client.137799463.0:24275 15.2e0 15.f36612e0 (undecoded) ondisk+write+known_if_redirected e6938) initiated 2020-11-13 17:42:56.334132 currently delayed

I think its around 17:43 things started locking up.
 
Last edited:
All servers are set up the same, with the same ntp server in /etc/systemd/timesyncd.conf
After the reboot of a server, ceph usually complains about clock skew, but after a minute it corrects itself using the NTP server.

Proxmox6 shutdown at 17:37, back online at 17:42

I see that the clocks are important, but when I rebooted proxmox6, proxmox1 was already manager. And the clocks on the other computers was synchronized by then. And all scrubbing and deep scrubbing was done.
 
Looks like it's when proxmox6 comes back online.
Adding shorter ceph.log from the perspective from proxmox1, proxmox2, and proxmox6

Thanks for helping :)

EDIT: If proxmox6 was manager when it shut off. Does proxmox6 still think its the boss when the server comes online again?
 

Attachments

Last edited:
The clock skew starts immediately after the MONs have done their new election (MON went down).
Code:
2020-11-13 13:15:31.492797 mon.proxmox1 (mon.0) 3 : cluster [INF] mon.proxmox1 is new leader, mons proxmox1,proxmox3,proxmox2,proxmox6 in quorum (ranks 0,1,2,4)
2020-11-13 13:15:31.893849 mon.proxmox1 (mon.0) 4 : cluster [WRN] 1 clock skew 0.0575201s > max 0.05s
2020-11-13 13:15:31.893890 mon.proxmox1 (mon.0) 5 : cluster [WRN] 2 clock skew 0.0574857s > max 0.05s
2020-11-13 13:15:31.893910 mon.proxmox1 (mon.0) 6 : cluster [WRN] 4 clock skew 0.05727s > max 0.05s

How often are the nodes (MONs) restarted? It starts to look like a flapping cluster state.
Code:
2020-11-13 13:15:31.893849 mon.proxmox1 (mon.0) 4 : cluster [WRN] 1 clock skew 0.0575201s > max 0.05s
2020-11-13 13:15:31.893890 mon.proxmox1 (mon.0) 5 : cluster [WRN] 2 clock skew 0.0574857s > max 0.05s
2020-11-13 13:15:31.893910 mon.proxmox1 (mon.0) 6 : cluster [WRN] 4 clock skew 0.05727s > max 0.05s
2020-11-13 13:15:32.043621 mon.proxmox1 (mon.0) 11 : cluster [WRN] Health check failed: clock skew detected on mon.proxmox3, mon.proxmox2, mon.proxmox6 (MON_CLOCK_SKEW)
2020-11-13 13:15:35.418446 mon.proxmox1 (mon.0) 18 : cluster [WRN] 2 clock skew 0.144214s > max 0.05s
2020-11-13 13:15:35.418844 mon.proxmox1 (mon.0) 19 : cluster [WRN] 1 clock skew 0.143686s > max 0.05s
2020-11-13 13:15:35.420147 mon.proxmox1 (mon.0) 20 : cluster [WRN] 3 clock skew 0.142252s > max 0.05s
2020-11-13 13:15:35.423272 mon.proxmox1 (mon.0) 25 : cluster [WRN] Health check update: clock skew detected on mon.proxmox3, mon.proxmox2, mon.proxmox5 (MON_CLOCK_SKEW)
2020-11-13 13:15:35.734534 mon.proxmox1 (mon.0) 27 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; 4 osds down; 1 host (4 osds) down; Degraded data redundancy: 134201/785067 objects degraded (17.094%), 525 pgs degraded, 525 pgs undersized; clock skew detected on mon.proxmox3, mon.proxmox2, mon.proxmox5
2020-11-13 13:16:10.051045 mon.proxmox1 (mon.0) 78 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.proxmox3, mon.proxmox2, mon.proxmox5)
2020-11-13 13:35:23.537330 mon.proxmox1 (mon.0) 4 : cluster [WRN] 4 clock skew 0.110642s > max 0.05s
2020-11-13 13:35:23.564275 mon.proxmox1 (mon.0) 5 : cluster [WRN] 1 clock skew 0.0653874s > max 0.05s
2020-11-13 13:35:23.578514 mon.proxmox1 (mon.0) 10 : cluster [WRN] Health check failed: clock skew detected on mon.proxmox3, mon.proxmox6 (MON_CLOCK_SKEW)
2020-11-13 13:35:23.606302 mon.proxmox1 (mon.0) 12 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; 4 osds down; 1 host (4 osds) down; Degraded data redundancy: 134204/785076 objects degraded (17.094%), 525 pgs degraded, 525 pgs undersized; clock skew detected on mon.proxmox3, mon.proxmox6; 2/5 mons down, quorum proxmox1,proxmox3,proxmox6
2020-11-13 13:35:29.806526 mon.proxmox1 (mon.0) 16 : cluster [WRN] 2 clock skew 0.113288s > max 0.05s
2020-11-13 13:35:29.808487 mon.proxmox1 (mon.0) 17 : cluster [WRN] 4 clock skew 0.111696s > max 0.05s
2020-11-13 13:35:29.812144 mon.proxmox1 (mon.0) 22 : cluster [WRN] Health check update: clock skew detected on mon.proxmox2, mon.proxmox6 (MON_CLOCK_SKEW)
2020-11-13 13:35:29.821844 mon.proxmox1 (mon.0) 24 : cluster [WRN] 1 clock skew 0.110051s > max 0.05s
2020-11-13 13:35:29.871455 mon.proxmox1 (mon.0) 30 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; 4 osds down; 1 host (4 osds) down; Degraded data redundancy: 134204/785076 objects degraded (17.094%), 525 pgs degraded, 525 pgs undersized; clock skew detected on mon.proxmox2, mon.proxmox6; 1/5 mons down, quorum proxmox1,proxmox3,proxmox2,proxmox6
2020-11-13 13:35:30.821030 mon.proxmox1 (mon.0) 47 : cluster [WRN] 4 clock skew 0.11069s > max 0.05s
2020-11-13 13:35:30.821703 mon.proxmox1 (mon.0) 48 : cluster [WRN] 3 clock skew 0.110143s > max 0.05s
2020-11-13 13:35:30.826549 mon.proxmox1 (mon.0) 49 : cluster [WRN] 1 clock skew 0.104985s > max 0.05s
2020-11-13 13:35:30.832412 mon.proxmox1 (mon.0) 54 : cluster [WRN] Health check update: clock skew detected on mon.proxmox3, mon.proxmox5, mon.proxmox6 (MON_CLOCK_SKEW)
2020-11-13 13:35:30.839832 mon.proxmox1 (mon.0) 56 : cluster [WRN] 2 clock skew 0.102911s > max 0.05s
2020-11-13 13:35:30.866083 mon.proxmox1 (mon.0) 57 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; 3 osds down; Degraded data redundancy: 134204/785076 objects degraded (17.094%), 525 pgs degraded, 525 pgs undersized; clock skew detected on mon.proxmox3, mon.proxmox5, mon.proxmox6
2020-11-13 13:35:36.497114 mon.proxmox1 (mon.0) 70 : cluster [WRN] Health check update: clock skew detected on mon.proxmox3, mon.proxmox2, mon.proxmox5, mon.proxmox6 (MON_CLOCK_SKEW)
2020-11-13 13:36:01.501821 mon.proxmox1 (mon.0) 93 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.proxmox3, mon.proxmox2, mon.proxmox5, mon.proxmox6)
2020-11-13 13:56:44.101584 mon.proxmox1 (mon.0) 1251 : cluster [WRN] 2 clock skew 0.410624s > max 0.05s
2020-11-13 13:56:46.973946 mon.proxmox1 (mon.0) 1272 : cluster [WRN] Health check failed: clock skew detected on mon.proxmox2 (MON_CLOCK_SKEW)
2020-11-13 13:57:16.993482 mon.proxmox1 (mon.0) 1297 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.proxmox2)
2020-11-13 16:57:52.322886 mon.proxmox1 (mon.0) 941 : cluster [WRN] 2 clock skew 0.48884s > max 0.05s
2020-11-13 16:57:52.356645 mon.proxmox1 (mon.0) 946 : cluster [WRN] Health check failed: clock skew detected on mon.proxmox2 (MON_CLOCK_SKEW)
2020-11-13 16:57:52.386977 mon.proxmox1 (mon.0) 948 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; 4 osds down; 1 host (4 osds) down; Degraded data redundancy: 126548/784791 objects degraded (16.125%), 496 pgs degraded, 496 pgs undersized; clock skew detected on mon.proxmox2
2020-11-13 16:58:23.957961 mon.proxmox1 (mon.0) 991 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.proxmox2)
2020-11-13 17:03:11.187340 mon.proxmox1 (mon.0) 2025 : cluster [WRN] 1 clock skew 0.178602s > max 0.05s
2020-11-13 17:03:14.024831 mon.proxmox1 (mon.0) 2027 : cluster [WRN] Health check failed: clock skew detected on mon.proxmox3 (MON_CLOCK_SKEW)
2020-11-13 17:03:44.048587 mon.proxmox1 (mon.0) 2075 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.proxmox3)
2020-11-13 17:36:52.833655 mon.proxmox1 (mon.0) 2765 : cluster [WRN] 3 clock skew 0.412501s > max 0.05s
2020-11-13 17:36:52.839099 mon.proxmox1 (mon.0) 2770 : cluster [WRN] Health check failed: clock skew detected on mon.proxmox5 (MON_CLOCK_SKEW)
2020-11-13 17:36:52.860791 mon.proxmox1 (mon.0) 2772 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; 4 osds down; 1 host (4 osds) down; Degraded data redundancy: 125690/784812 objects degraded (16.015%), 492 pgs degraded, 492 pgs undersized; clock skew detected on mon.proxmox5
2020-11-13 17:37:24.722537 mon.proxmox1 (mon.0) 2813 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.proxmox5)
2020-11-13 17:42:52.669619 mon.proxmox1 (mon.0) 4054 : cluster [WRN] 4 clock skew 0.303895s > max 0.05s
2020-11-13 17:42:52.735804 mon.proxmox1 (mon.0) 4059 : cluster [WRN] Health check failed: clock skew detected on mon.proxmox6 (MON_CLOCK_SKEW)
2020-11-13 17:42:52.777560 mon.proxmox1 (mon.0) 4063 : cluster [WRN] overall HEALTH_WARN noout flag(s) set; Degraded data redundancy: 132401/784812 objects degraded (16.870%), 519 pgs degraded, 519 pgs undersized; clock skew detected on mon.proxmox6
2020-11-13 17:43:24.848291 mon.proxmox1 (mon.0) 4117 : cluster [INF] Health check cleared: MON_CLOCK_SKEW (was: clock skew detected on mon.proxmox6)
 
The clock skew starts immediately after the MONs have done their new election (MON went down).


How often are the nodes (MONs) restarted? It starts to look like a flapping cluster state.

We fixed some hardware earlier that day, around 13:00 to 14:00
Proxmox1 and 2 was restarted in this period.

The update of all the servers (one by one, upgrade, restart and waiting for ceph to go green) from around 16:00-16:30 i think. till around 18:00.
After each restart it complains about clock, but this corrects itself within a minute after boot.
 
Hm... try to sync the RTC in the BIOS with the time of the node. That may help to alleviate the clock skew.
 
I will try that, thanks.
One last question.
If the last server is master on ceph at shutdown, does that server still think it is master after coming back online?
Because if the master has a clock skew, it will tell the system that everyone els has a clock skew.
 
We fixed some hardware earlier that day, around 13:00 to 14:00
Proxmox1 and 2 was restarted in this period.

The update of all the servers (one by one, upgrade, restart and waiting for ceph to go green) from around 16:00-16:30 i think. till around 18:00.
After each restart it complains about clock, but this corrects itself within a minute after boot.
you can also try to install chrony as ntp (apt install chrony). It's really faster (maybe 1s max) than systemd-timesync to resync clock.
To avoid clock drift, you can also check in your bios than power profile always run cpu at max clock frequency.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!