Hello,
With Damien BROCHARD (Ozerim), we made some extensive CEPH tests in lab. Our running and functional lab : we configured a simulation of 2 DC cluster, with CEPH stretched cluster mode and a 3rd DC of only one PVE as tie breaker (CEPH) / quorum vote (Corosync)
Our test : We cut the link between the 2 mains DCs but they still can both talk to the 3rd DC à Netsplit
In this situation, what we reproduced several times : the PVE elected cluster partition is on one DC, and the CEPH elected cluster partition is on the other DC ! (And in some rare case, both on the same side but nothing respond)
When that append :
- If we try to shutdown the DC that has the Corosync vote but doesn’t get the working Ceph, the one with working Ceph get back on.
- If we try to shutdown the DC that doesn’t get the Corosync vote but get the working Ceph, nothing get back on.
Two algorithms, not the same result, no coordination = whole cluster down.
Did you ever had this situation ? How do you resolve it ?
Best Regards,
Damien & Gautier
Hereunder : some throubleshooting info during the problem.
With Damien BROCHARD (Ozerim), we made some extensive CEPH tests in lab. Our running and functional lab : we configured a simulation of 2 DC cluster, with CEPH stretched cluster mode and a 3rd DC of only one PVE as tie breaker (CEPH) / quorum vote (Corosync)
Our test : We cut the link between the 2 mains DCs but they still can both talk to the 3rd DC à Netsplit
In this situation, what we reproduced several times : the PVE elected cluster partition is on one DC, and the CEPH elected cluster partition is on the other DC ! (And in some rare case, both on the same side but nothing respond)
When that append :
- If we try to shutdown the DC that has the Corosync vote but doesn’t get the working Ceph, the one with working Ceph get back on.
- If we try to shutdown the DC that doesn’t get the Corosync vote but get the working Ceph, nothing get back on.
Two algorithms, not the same result, no coordination = whole cluster down.
Did you ever had this situation ? How do you resolve it ?
Best Regards,
Damien & Gautier
Hereunder : some throubleshooting info during the problem.
Code:
ssh root@training1-vsrv1
root@training1-vsrv1's password:
Linux training1-vsrv1 6.14.8-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.8-2 (2025-07-22T10:04Z) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Oct 1 20:48:16 2025 from 192.168.250.70
root@training1-vsrv1:~# pveceph status
cluster:
id: 3ee28d49-9366-463e-9296-190cfb3822d2
health: HEALTH_WARN
We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
2/5 mons down, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2
1 datacenter (16 osds) down
16 osds down
4 hosts (16 osds) down
Degraded data redundancy: 18508/37016 objects degraded (50.000%), 545 pgs degraded, 545 pgs undersized
services:
mon: 5 daemons, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2 (age 13m), out of quorum: training1-vsrv1, training1-vsrv2
mgr: training2-vsrv4(active, since 12m), standbys: training1-vsrv4, training1-vsrv2, training2-vsrv2
osd: 32 osds: 16 up (since 12m), 32 in (since 57m)
data:
pools: 3 pools, 545 pgs
objects: 9.25k objects, 28 GiB
usage: 51 GiB used, 61 GiB / 112 GiB avail
pgs: 18508/37016 objects degraded (50.000%)
545 active+undersized+degraded
root@training1-vsrv1:~# pvecm status
Cluster information
-------------------
Name: training
Config Version: 9
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Wed Oct 1 20:51:14 2025
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000008
Ring ID: 1.281
Quorate: Yes
Votequorum information
----------------------
Expected votes: 9
Highest expected: 9
Total votes: 5
Quorum: 5
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.1.2
0x00000002 1 10.10.1.3
0x00000004 1 10.10.1.4
0x00000008 1 10.10.1.1 (local)
0x00000009 1 10.10.1.31
️
DC 1 (optiplex1) :
- pvecm : quorate
- ceph mons on DC2 and DC3
- ceph OSDs on DC2
========================================================================
RQ : all PVE servers have restarted and GUI is very unstable
========================================================================
DC 2 (optiplex2) :
- pvecm : Activity blocked
- ceph mon on DC2 and DC3
- ceph OSDs on DC2
️
ssh root@training2-vsrv1
root@training2-vsrv1's password:
Permission denied, please try again.
root@training2-vsrv1's password:
Permission denied, please try again.
root@training2-vsrv1's password:
Linux training2-vsrv1 6.14.8-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.8-2 (2025-07-22T10:04Z) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Sep 17 20:28:30 2025 from 192.168.250.70
root@training2-vsrv1:~# pveceph status
cluster:
id: 3ee28d49-9366-463e-9296-190cfb3822d2
health: HEALTH_WARN
We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
2/5 mons down, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2
1 datacenter (16 osds) down
16 osds down
4 hosts (16 osds) down
Degraded data redundancy: 18508/37016 objects degraded (50.000%), 545 pgs degraded, 545 pgs undersized
services:
mon: 5 daemons, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2 (age 10m), out of quorum: training1-vsrv1, training1-vsrv2
mgr: training2-vsrv4(active, since 10m), standbys: training1-vsrv4, training1-vsrv2, training2-vsrv2
osd: 32 osds: 16 up (since 10m), 32 in (since 54m)
data:
pools: 3 pools, 545 pgs
objects: 9.25k objects, 28 GiB
usage: 51 GiB used, 61 GiB / 112 GiB avail
pgs: 18508/37016 objects degraded (50.000%)
545 active+undersized+degraded
root@training2-vsrv1:~# pvecm status
Cluster information
-------------------
Name: training
Config Version: 9
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Wed Oct 1 20:49:37 2025
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000007
Ring ID: 3.375
Quorate: No
Votequorum information
----------------------
Expected votes: 9
Highest expected: 9
Total votes: 4
Quorum: 5 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000003 1 10.10.1.22
0x00000005 1 10.10.1.23
0x00000006 1 10.10.1.24
0x00000007 1 10.10.1.21 (local)
root@training2-vsrv1:~# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 training2-vsrv1 3757M 3406M 0 0 0 0 exists,up
1 training2-vsrv1 3720M 3443M 0 0 0 0 exists,up
2 training2-vsrv1 3054M 4109M 0 0 0 0 exists,up
3 training2-vsrv1 3223M 3940M 0 0 0 0 exists,up
4 training2-vsrv2 3149M 4014M 0 0 0 0 exists,up
5 training2-vsrv2 3202M 3961M 0 0 0 0 exists,up
6 training2-vsrv2 3554M 3609M 0 0 0 0 exists,up
7 training2-vsrv2 3714M 3449M 0 0 0 0 exists,up
8 training2-vsrv3 3553M 3610M 0 0 0 0 exists,up
9 training2-vsrv3 3549M 3614M 0 0 0 0 exists,up
10 training2-vsrv3 3669M 3494M 0 0 0 0 exists,up
11 training2-vsrv3 2790M 4373M 0 0 0 0 exists,up
12 training2-vsrv4 2890M 4273M 0 0 0 0 exists,up
13 training2-vsrv4 2919M 4244M 0 0 0 0 exists,up
14 training2-vsrv4 2774M 4389M 0 0 0 0 exists,up
15 training2-vsrv4 3077M 4086M 0 0 0 0 exists,up
16 0 0 0 0 0 0 exists
17 0 0 0 0 0 0 exists
18 0 0 0 0 0 0 exists
19 0 0 0 0 0 0 exists
20 0 0 0 0 0 0 exists
21 0 0 0 0 0 0 exists
22 0 0 0 0 0 0 exists
23 0 0 0 0 0 0 exists
24 0 0 0 0 0 0 exists
25 0 0 0 0 0 0 exists
26 0 0 0 0 0 0 exists
27 0 0 0 0 0 0 exists
28 0 0 0 0 0 0 exists
29 0 0 0 0 0 0 exists
30 0 0 0 0 0 0 exists
31 0 0 0 0 0 0 exists
========================================================================
========================================================================
========================================================================
RQ : DC1 OSDs are stopped (daemon is not alive on DC1 servers)
OSDs crash. Systemctl tries to restart them but max_retry is reached and no restart is retried after.
Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.
Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Consumed 1.824s CPU time, 192.6M memory peak.
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 unable to obtain rotating service keys; retrying
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Failed with result 'exit-code'.
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Consumed 1.995s CPU time, 227.8M memory peak.
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 unable to obtain rotating service keys; retrying
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Failed with result 'exit-code'.
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Consumed 1.846s CPU time, 186M memory peak.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Scheduled restart job, restart counter is at 3.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Start request repeated too quickly.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: Failed to start ceph-osd@20.service - Ceph object storage daemon osd.20.
Last edited: