CEPH stretched cluster : PVE partition is not the same as the CEPH partition !

ghusson

Renowned Member
Feb 25, 2015
213
51
93
FRANCE
www.virtual-environment.fr
Hello,

With Damien BROCHARD (Ozerim), we made some extensive CEPH tests in lab. Our running and functional lab : we configured a simulation of 2 DC cluster, with CEPH stretched cluster mode and a 3rd DC of only one PVE as tie breaker (CEPH) / quorum vote (Corosync)

Our test : We cut the link between the 2 mains DCs but they still can both talk to the 3rd DC à Netsplit

In this situation, what we reproduced several times : the PVE elected cluster partition is on one DC, and the CEPH elected cluster partition is on the other DC ! (And in some rare case, both on the same side but nothing respond)

When that append :
- If we try to shutdown the DC that has the Corosync vote but doesn’t get the working Ceph, the one with working Ceph get back on.
- If we try to shutdown the DC that doesn’t get the Corosync vote but get the working Ceph, nothing get back on.

Two algorithms, not the same result, no coordination = whole cluster down.

Did you ever had this situation ? How do you resolve it ?

Best Regards,
Damien & Gautier

Hereunder : some throubleshooting info during the problem.



Code:
ssh root@training1-vsrv1
root@training1-vsrv1's password:
Linux training1-vsrv1 6.14.8-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.8-2 (2025-07-22T10:04Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Oct  1 20:48:16 2025 from 192.168.250.70
root@training1-vsrv1:~# pveceph status
  cluster:
    id:     3ee28d49-9366-463e-9296-190cfb3822d2
    health: HEALTH_WARN
            We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
            2/5 mons down, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2
            1 datacenter (16 osds) down
            16 osds down
            4 hosts (16 osds) down
            Degraded data redundancy: 18508/37016 objects degraded (50.000%), 545 pgs degraded, 545 pgs undersized
 
  services:
    mon: 5 daemons, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2 (age 13m), out of quorum: training1-vsrv1, training1-vsrv2
    mgr: training2-vsrv4(active, since 12m), standbys: training1-vsrv4, training1-vsrv2, training2-vsrv2
    osd: 32 osds: 16 up (since 12m), 32 in (since 57m)
 
  data:
    pools:   3 pools, 545 pgs
    objects: 9.25k objects, 28 GiB
    usage:   51 GiB used, 61 GiB / 112 GiB avail
    pgs:     18508/37016 objects degraded (50.000%)
             545 active+undersized+degraded
 
root@training1-vsrv1:~# pvecm status
Cluster information
-------------------
Name:             training
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Oct  1 20:51:14 2025
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000008
Ring ID:          1.281
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      5
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.1.2
0x00000002          1 10.10.1.3
0x00000004          1 10.10.1.4
0x00000008          1 10.10.1.1 (local)
0x00000009          1 10.10.1.31


️
DC 1 (optiplex1) :
- pvecm : quorate
- ceph mons on DC2 and DC3
- ceph OSDs on DC2

========================================================================
RQ : all PVE servers have restarted and GUI is very unstable
========================================================================

DC 2 (optiplex2) :
- pvecm : Activity blocked
- ceph mon on DC2 and DC3
- ceph OSDs on DC2
️


ssh root@training2-vsrv1
root@training2-vsrv1's password:
Permission denied, please try again.
root@training2-vsrv1's password:
Permission denied, please try again.
root@training2-vsrv1's password:
Linux training2-vsrv1 6.14.8-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.8-2 (2025-07-22T10:04Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Sep 17 20:28:30 2025 from 192.168.250.70
root@training2-vsrv1:~# pveceph status
  cluster:
    id:     3ee28d49-9366-463e-9296-190cfb3822d2
    health: HEALTH_WARN
            We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
            2/5 mons down, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2
            1 datacenter (16 osds) down
            16 osds down
            4 hosts (16 osds) down
            Degraded data redundancy: 18508/37016 objects degraded (50.000%), 545 pgs degraded, 545 pgs undersized
 
  services:
    mon: 5 daemons, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2 (age 10m), out of quorum: training1-vsrv1, training1-vsrv2
    mgr: training2-vsrv4(active, since 10m), standbys: training1-vsrv4, training1-vsrv2, training2-vsrv2
    osd: 32 osds: 16 up (since 10m), 32 in (since 54m)
 
  data:
    pools:   3 pools, 545 pgs
    objects: 9.25k objects, 28 GiB
    usage:   51 GiB used, 61 GiB / 112 GiB avail
    pgs:     18508/37016 objects degraded (50.000%)
             545 active+undersized+degraded
 
root@training2-vsrv1:~# pvecm status
Cluster information
-------------------
Name:             training
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Oct  1 20:49:37 2025
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000007
Ring ID:          3.375
Quorate:          No

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      4
Quorum:           5 Activity blocked
Flags:          

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 10.10.1.22
0x00000005          1 10.10.1.23
0x00000006          1 10.10.1.24
0x00000007          1 10.10.1.21 (local)

root@training2-vsrv1:~# ceph osd status
ID  HOST              USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE    
 0  training2-vsrv1  3757M  3406M      0        0       0        0   exists,up
 1  training2-vsrv1  3720M  3443M      0        0       0        0   exists,up
 2  training2-vsrv1  3054M  4109M      0        0       0        0   exists,up
 3  training2-vsrv1  3223M  3940M      0        0       0        0   exists,up
 4  training2-vsrv2  3149M  4014M      0        0       0        0   exists,up
 5  training2-vsrv2  3202M  3961M      0        0       0        0   exists,up
 6  training2-vsrv2  3554M  3609M      0        0       0        0   exists,up
 7  training2-vsrv2  3714M  3449M      0        0       0        0   exists,up
 8  training2-vsrv3  3553M  3610M      0        0       0        0   exists,up
 9  training2-vsrv3  3549M  3614M      0        0       0        0   exists,up
10  training2-vsrv3  3669M  3494M      0        0       0        0   exists,up
11  training2-vsrv3  2790M  4373M      0        0       0        0   exists,up
12  training2-vsrv4  2890M  4273M      0        0       0        0   exists,up
13  training2-vsrv4  2919M  4244M      0        0       0        0   exists,up
14  training2-vsrv4  2774M  4389M      0        0       0        0   exists,up
15  training2-vsrv4  3077M  4086M      0        0       0        0   exists,up
16                      0      0       0        0       0        0   exists    
17                      0      0       0        0       0        0   exists    
18                      0      0       0        0       0        0   exists    
19                      0      0       0        0       0        0   exists    
20                      0      0       0        0       0        0   exists    
21                      0      0       0        0       0        0   exists    
22                      0      0       0        0       0        0   exists    
23                      0      0       0        0       0        0   exists    
24                      0      0       0        0       0        0   exists    
25                      0      0       0        0       0        0   exists    
26                      0      0       0        0       0        0   exists    
27                      0      0       0        0       0        0   exists    
28                      0      0       0        0       0        0   exists    
29                      0      0       0        0       0        0   exists    
30                      0      0       0        0       0        0   exists    
31                      0      0       0        0       0        0   exists    


========================================================================

========================================================================

========================================================================


RQ : DC1 OSDs are stopped (daemon is not alive on DC1 servers)

OSDs crash. Systemctl tries to restart them but max_retry is reached and no restart is retried after.


Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.
Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Consumed 1.824s CPU time, 192.6M memory peak.
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 unable to obtain rotating service keys; retrying
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Failed with result 'exit-code'.
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Consumed 1.995s CPU time, 227.8M memory peak.
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 unable to obtain rotating service keys; retrying
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Failed with result 'exit-code'.
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Consumed 1.846s CPU time, 186M memory peak.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Scheduled restart job, restart counter is at 3.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Start request repeated too quickly.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: Failed to start ceph-osd@20.service - Ceph object storage daemon osd.20.
 
Last edited:
And here the case with same cluster partition on the two DCs:

Code:
========================================================================

========================================================================

========================================================================


RQ : DC1 OSDs are stopped (daemon is not alive on DC1 servers)

OSDs crash. Systemctl tries to restart them but max_retry is reached and no restart is retried after.


Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Main process exited, code=exited, status=1/FAILURE

Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.

Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Consumed 1.824s CPU time, 192.6M memory peak.

Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 unable to obtain rotating service keys; retrying

Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Main process exited, code=exited, status=1/FAILURE

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Failed with result 'exit-code'.

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Consumed 1.995s CPU time, 227.8M memory peak.

Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 unable to obtain rotating service keys; retrying

Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Main process exited, code=exited, status=1/FAILURE

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Failed with result 'exit-code'.

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Consumed 1.846s CPU time, 186M memory peak.

Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Scheduled restart job, restart counter is at 3.

Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Start request repeated too quickly.

Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.

Oct 01 20:55:08 training1-vsrv3 systemd[1]: Failed to start ceph-osd@20.service - Ceph object storage daemon osd.20.








========================================================================

========================================================================

========================================================================



Other case :

PVE quorum on DC2 , OSDs up on DC2 too

BUT : it is still a mess !


root@training2-vsrv1:~# ceph osd status

ID  HOST              USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE    

 0  training2-vsrv1  3183M  3980M      0        0       0        0   exists,up

 1  training2-vsrv1  3681M  3482M      0        0       0        0   exists,up

 2  training2-vsrv1  3094M  4069M      0        0       0        0   exists,up

 3  training2-vsrv1  3185M  3978M      0        0       0        0   exists,up

 4  training2-vsrv2  3621M  3542M      0        0       0        0   exists,up

 5  training2-vsrv2  3576M  3587M      0        0       0        0   exists,up

 6  training2-vsrv2  3156M  4007M      0        0       0        0   exists,up

 7  training2-vsrv2  3599M  3564M      0        0       0        0   exists,up

 8  training2-vsrv3  3651M  3512M      0        0       0        0   exists,up

 9  training2-vsrv3  3118M  4045M      0        0       0        0   exists,up

10  training2-vsrv3  3247M  3916M      0        0       0        0   exists,up

11  training2-vsrv3  3262M  3901M      0        0       0        0   exists,up

12  training2-vsrv4  2947M  4216M      0        0       0        0   exists,up

13  training2-vsrv4  2993M  4170M      0        0       0        0   exists,up

14  training2-vsrv4  2857M  4306M      0        0       0        0   exists,up

15  training2-vsrv4  3124M  4039M      0        0       0        0   exists,up

16                      0      0       0        0       0        0   exists    

17                      0      0       0        0       0        0   exists    

18                      0      0       0        0       0        0   exists    

19                      0      0       0        0       0        0   exists    

20                      0      0       0        0       0        0   exists    

21                      0      0       0        0       0        0   exists    

22                      0      0       0        0       0        0   exists    

23                      0      0       0        0       0        0   exists    

24                      0      0       0        0       0        0   exists    

25                      0      0       0        0       0        0   exists    

26                      0      0       0        0       0        0   exists    

27                      0      0       0        0       0        0   exists    

28                      0      0       0        0       0        0   exists    

29                      0      0       0        0       0        0   exists    

30                      0      0       0        0       0        0   exists    

31                      0      0       0        0       0        0   exists    



root@training1-vsrv1:~# ceph osd status

==> no return



root@training1-vsrv1:~# pvecm status

Cluster information

-------------------

Name:             training

Config Version:   9

Transport:        knet

Secure auth:      on


Quorum information

------------------

Date:             Wed Oct  8 21:19:27 2025

Quorum provider:  corosync_votequorum

Nodes:            4

Node ID:          0x00000008

Ring ID:          1.b84

Quorate:          No


Votequorum information

----------------------

Expected votes:   9

Highest expected: 9

Total votes:      4

Quorum:           5 Activity blocked

Flags:          


Membership information

----------------------

    Nodeid      Votes Name

0x00000001          1 10.10.1.2

0x00000002          1 10.10.1.3

0x00000004          1 10.10.1.4

0x00000008          1 10.10.1.1 (local)





root@training2-vsrv1:~# pvecm status

Cluster information

-------------------

Name:             training

Config Version:   9

Transport:        knet

Secure auth:      on


Quorum information

------------------

Date:             Wed Oct  8 21:19:23 2025

Quorum provider:  corosync_votequorum

Nodes:            5

Node ID:          0x00000007

Ring ID:          3.a68

Quorate:          Yes


Votequorum information

----------------------

Expected votes:   9

Highest expected: 9

Total votes:      5

Quorum:           5

Flags:            Quorate


Membership information

----------------------

    Nodeid      Votes Name

0x00000003          1 10.10.1.22

0x00000005          1 10.10.1.23

0x00000006          1 10.10.1.24

0x00000007          1 10.10.1.21 (local)

0x00000009          1 10.10.1.31




OSDs crash. Systemctl tries to restart them but max_retry is reached and no restart is retried after :


Oct 08 21:29:06 training1-vsrv1 pvedaemon[1549]: <root@pam> starting task UPID:training1-vsrv1:000039CC:000272D3:68E6BB82:srvstart:osd.18:root@pam:

Oct 08 21:29:06 training1-vsrv1 systemd[1]: ceph-osd@18.service: Start request repeated too quickly.

Oct 08 21:29:06 training1-vsrv1 systemd[1]: ceph-osd@18.service: Failed with result 'exit-code'.

Oct 08 21:29:06 training1-vsrv1 systemd[1]: Failed to start ceph-osd@18.service - Ceph object storage daemon osd.18.

Oct 08 21:29:06 training1-vsrv1 pvedaemon[14796]: command '/bin/systemctl start ceph-osd@18' failed: exit code 1

Oct 08 21:29:06 training1-vsrv1 pvedaemon[1549]: <root@pam> end task UPID:training1-vsrv1:000039CC:000272D3:68E6BB82:srvstart:osd.18:root@pam: command '/bin/systemctl start ceph-osd@18' failed: exit code 1


To resolve :

==> systemctl reset-failed ceph-osd@18.service
 
Interesting... I believe ceph cluster and pve clusters should be entirely independent, but as far as I know Proxmox keeps the ceph.conf config file on a pve cluster shared filesystem, which is not accessible on the nodes that are out of quorum, and that what may prevent restarting of the ceph services. I guess there should be a way to store the ceph config on the local storage, and maybe that would help...
 
Interesting... I believe ceph cluster and pve clusters should be entirely independent, but as far as I know Proxmox keeps the ceph.conf config file on a pve cluster shared filesystem, which is not accessible on the nodes that are out of quorum, and that what may prevent restarting of the ceph services. I guess there should be a way to store the ceph config on the local storage, and maybe that would help...
Indeed. But this situation should not happen. If a node is fenced, having CEPH out of order is OK for me. The problem on the point you mention is that CEPH OSDs didn't start after cluster rejoin.
 
Our test : We cut the link between the 2 mains DCs but they still can both talk to the 3rd DC à Netsplit
This is a situation that a stretch cluster does not protect you against. The main goal of a stretch cluster is to keep the cluster functional if one location is completely down, for example, due to a fire.

The network between the locations needs to be set up as reliably as possible. How that is achieved, depends on the individual situation, of course. But ideally two direct connections that take different physical routes, so that one incident with an excavator won't bring down the site-to-site network.

And to make it clearer, I added a note about that in the rather new public Stretch Cluster guide https://pve.proxmox.com/wiki/Stretch_Cluster#Physical_Layout
 
Last edited:
This is a situation that a stretch cluster does not protect you against. The main goal of a stretch cluster is to keep the cluster functional if one location is completely down, for example, due to a fire.

The network between the locations needs to be set up as reliably as possible. How that is achieved, depends on the individual situation, of course. But ideally two direct connections that take different physical routes, so that one incident with an excavator won't bring down the site-to-site network.

And to make it clearer, I added a note about that in the rather new public Stretch Cluster guide https://pve.proxmox.com/wiki/Stretch_Cluster#Physical_Layout
Hello Aaron.

Thank you very much for this explanation.
You saved us several hours of extensive tests .
Now, do you think there is a workaround if this case occurs, like shutting down each server in a datacenter ?

Good job for the documentation.
Suggestion of amelioration :
- tell that a CEPH cluster that goes in stretched mode cannot be put back in normal mode
- give the procedure to add a new node into the stretched cluster, example :

Code:
pveceph mon create
systemctl start ceph-mon@$(hostname).service
systemctl disable ceph-mon@$(hostname).service
systemctl stop ceph-mon@$(hostname).service
/usr/bin/ceph-mon -f --cluster ceph --id $(hostname) --setuser ceph --setgroup ceph --set-crush-location datacenter="__MY_DC_LOCATION__"
# Ctrl + C
systemctl enable ceph-mon@$(hostname).service
systemctl start ceph-mon@$(hostname).service
ceph mon dump

And eventually :
Code:
ceph osd crush reweight-all


BR
Gautier
 
Last edited:
- If we try to shutdown the DC that has the Corosync vote but doesn’t get the working Ceph, the one with working Ceph get back on.
- If we try to shutdown the DC that doesn’t get the Corosync vote but get the working Ceph, nothing get back on.
Even though PVE and Ceph operate in a hyperconverged setup, the two stacks remain independent, as you’ve discovered. Their quorum/voting mechanisms are separate, which means you can end up in a “hyperconverged split-brain” scenario. As @aaron mentioned, having two independent inter-site links is the best way to reduce this risk.

That said, the behavior you observed is concerning - particularly the fact that a manual DC shutdown didn’t allow the PVE side to recover on the other side. If it were me, I’d definitely invest more time into understanding this specific failure mode.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
Even though PVE and Ceph operate in a hyperconverged setup, the two stacks remain independent, as you’ve discovered. Their quorum/voting mechanisms are separate, which means you can end up in a “hyperconverged split-brain” scenario. As @aaron mentioned, having two independent inter-site links is the best way to reduce this risk.

That said, the behavior you observed is concerning - particularly the fact that a manual DC shutdown didn’t allow the PVE side to recover on the other side. If it were me, I’d definitely invest more time into understanding this specific failure mode.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Hello bbgeek17 !
Indeed, that's my intention. And firstly re-test and understand the manual server shutdown on one site.