CEPH stretched cluster : PVE partition is not the same as the CEPH partition !

ghusson · 2025-10-29T09:28:00+0100

Hello,

With Damien BROCHARD (Ozerim), we made some extensive CEPH tests in lab. Our running and functional lab : we configured a simulation of 2 DC cluster, with CEPH stretched cluster mode and a 3rd DC of only one PVE as tie breaker (CEPH) / quorum vote (Corosync)

Our test : We cut the link between the 2 mains DCs but they still can both talk to the 3rd DC à Netsplit

In this situation, what we reproduced several times : the PVE elected cluster partition is on one DC, and the CEPH elected cluster partition is on the other DC ! (And in some rare case, both on the same side but nothing respond)

When that append :
- If we try to shutdown the DC that has the Corosync vote but doesn’t get the working Ceph, the one with working Ceph get back on.
- If we try to shutdown the DC that doesn’t get the Corosync vote but get the working Ceph, nothing get back on.

Two algorithms, not the same result, no coordination = whole cluster down.

Did you ever had this situation ? How do you resolve it ?

Best Regards,
Damien & Gautier

Hereunder : some throubleshooting info during the problem.

Code:

ssh root@training1-vsrv1
root@training1-vsrv1's password:
Linux training1-vsrv1 6.14.8-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.8-2 (2025-07-22T10:04Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Oct  1 20:48:16 2025 from 192.168.250.70
root@training1-vsrv1:~# pveceph status
  cluster:
    id:     3ee28d49-9366-463e-9296-190cfb3822d2
    health: HEALTH_WARN
            We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
            2/5 mons down, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2
            1 datacenter (16 osds) down
            16 osds down
            4 hosts (16 osds) down
            Degraded data redundancy: 18508/37016 objects degraded (50.000%), 545 pgs degraded, 545 pgs undersized
 
  services:
    mon: 5 daemons, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2 (age 13m), out of quorum: training1-vsrv1, training1-vsrv2
    mgr: training2-vsrv4(active, since 12m), standbys: training1-vsrv4, training1-vsrv2, training2-vsrv2
    osd: 32 osds: 16 up (since 12m), 32 in (since 57m)
 
  data:
    pools:   3 pools, 545 pgs
    objects: 9.25k objects, 28 GiB
    usage:   51 GiB used, 61 GiB / 112 GiB avail
    pgs:     18508/37016 objects degraded (50.000%)
             545 active+undersized+degraded
 
root@training1-vsrv1:~# pvecm status
Cluster information
-------------------
Name:             training
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Oct  1 20:51:14 2025
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000008
Ring ID:          1.281
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      5
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.1.2
0x00000002          1 10.10.1.3
0x00000004          1 10.10.1.4
0x00000008          1 10.10.1.1 (local)
0x00000009          1 10.10.1.31


️
DC 1 (optiplex1) :
- pvecm : quorate
- ceph mons on DC2 and DC3
- ceph OSDs on DC2

========================================================================
RQ : all PVE servers have restarted and GUI is very unstable
========================================================================

DC 2 (optiplex2) :
- pvecm : Activity blocked
- ceph mon on DC2 and DC3
- ceph OSDs on DC2
️


ssh root@training2-vsrv1
root@training2-vsrv1's password:
Permission denied, please try again.
root@training2-vsrv1's password:
Permission denied, please try again.
root@training2-vsrv1's password:
Linux training2-vsrv1 6.14.8-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.14.8-2 (2025-07-22T10:04Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Wed Sep 17 20:28:30 2025 from 192.168.250.70
root@training2-vsrv1:~# pveceph status
  cluster:
    id:     3ee28d49-9366-463e-9296-190cfb3822d2
    health: HEALTH_WARN
            We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
            2/5 mons down, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2
            1 datacenter (16 osds) down
            16 osds down
            4 hosts (16 osds) down
            Degraded data redundancy: 18508/37016 objects degraded (50.000%), 545 pgs degraded, 545 pgs undersized
 
  services:
    mon: 5 daemons, quorum training3-vsrv1,training2-vsrv1,training2-vsrv2 (age 10m), out of quorum: training1-vsrv1, training1-vsrv2
    mgr: training2-vsrv4(active, since 10m), standbys: training1-vsrv4, training1-vsrv2, training2-vsrv2
    osd: 32 osds: 16 up (since 10m), 32 in (since 54m)
 
  data:
    pools:   3 pools, 545 pgs
    objects: 9.25k objects, 28 GiB
    usage:   51 GiB used, 61 GiB / 112 GiB avail
    pgs:     18508/37016 objects degraded (50.000%)
             545 active+undersized+degraded
 
root@training2-vsrv1:~# pvecm status
Cluster information
-------------------
Name:             training
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Oct  1 20:49:37 2025
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000007
Ring ID:          3.375
Quorate:          No

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      4
Quorum:           5 Activity blocked
Flags:          

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 10.10.1.22
0x00000005          1 10.10.1.23
0x00000006          1 10.10.1.24
0x00000007          1 10.10.1.21 (local)

root@training2-vsrv1:~# ceph osd status
ID  HOST              USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE    
 0  training2-vsrv1  3757M  3406M      0        0       0        0   exists,up
 1  training2-vsrv1  3720M  3443M      0        0       0        0   exists,up
 2  training2-vsrv1  3054M  4109M      0        0       0        0   exists,up
 3  training2-vsrv1  3223M  3940M      0        0       0        0   exists,up
 4  training2-vsrv2  3149M  4014M      0        0       0        0   exists,up
 5  training2-vsrv2  3202M  3961M      0        0       0        0   exists,up
 6  training2-vsrv2  3554M  3609M      0        0       0        0   exists,up
 7  training2-vsrv2  3714M  3449M      0        0       0        0   exists,up
 8  training2-vsrv3  3553M  3610M      0        0       0        0   exists,up
 9  training2-vsrv3  3549M  3614M      0        0       0        0   exists,up
10  training2-vsrv3  3669M  3494M      0        0       0        0   exists,up
11  training2-vsrv3  2790M  4373M      0        0       0        0   exists,up
12  training2-vsrv4  2890M  4273M      0        0       0        0   exists,up
13  training2-vsrv4  2919M  4244M      0        0       0        0   exists,up
14  training2-vsrv4  2774M  4389M      0        0       0        0   exists,up
15  training2-vsrv4  3077M  4086M      0        0       0        0   exists,up
16                      0      0       0        0       0        0   exists    
17                      0      0       0        0       0        0   exists    
18                      0      0       0        0       0        0   exists    
19                      0      0       0        0       0        0   exists    
20                      0      0       0        0       0        0   exists    
21                      0      0       0        0       0        0   exists    
22                      0      0       0        0       0        0   exists    
23                      0      0       0        0       0        0   exists    
24                      0      0       0        0       0        0   exists    
25                      0      0       0        0       0        0   exists    
26                      0      0       0        0       0        0   exists    
27                      0      0       0        0       0        0   exists    
28                      0      0       0        0       0        0   exists    
29                      0      0       0        0       0        0   exists    
30                      0      0       0        0       0        0   exists    
31                      0      0       0        0       0        0   exists    


========================================================================

========================================================================

========================================================================


RQ : DC1 OSDs are stopped (daemon is not alive on DC1 servers)

OSDs crash. Systemctl tries to restart them but max_retry is reached and no restart is retried after.


Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.
Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Consumed 1.824s CPU time, 192.6M memory peak.
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 unable to obtain rotating service keys; retrying
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Failed with result 'exit-code'.
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Consumed 1.995s CPU time, 227.8M memory peak.
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 unable to obtain rotating service keys; retrying
Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Failed with result 'exit-code'.
Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Consumed 1.846s CPU time, 186M memory peak.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Scheduled restart job, restart counter is at 3.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Start request repeated too quickly.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.
Oct 01 20:55:08 training1-vsrv3 systemd[1]: Failed to start ceph-osd@20.service - Ceph object storage daemon osd.20.

ghusson · 2025-10-29T09:29:01+0100

And here the case with same cluster partition on the two DCs:

Code:

========================================================================

========================================================================

========================================================================


RQ : DC1 OSDs are stopped (daemon is not alive on DC1 servers)

OSDs crash. Systemctl tries to restart them but max_retry is reached and no restart is retried after.


Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Main process exited, code=exited, status=1/FAILURE

Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.

Oct 01 20:54:58 training1-vsrv3 systemd[1]: ceph-osd@20.service: Consumed 1.824s CPU time, 192.6M memory peak.

Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 unable to obtain rotating service keys; retrying

Oct 01 20:55:02 training1-vsrv3 ceph-osd[7690]: 2025-10-01T20:55:02.103+0200 7f0cc7c76680 -1 osd.26 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Main process exited, code=exited, status=1/FAILURE

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Failed with result 'exit-code'.

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@26.service: Consumed 1.995s CPU time, 227.8M memory peak.

Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 unable to obtain rotating service keys; retrying

Oct 01 20:55:02 training1-vsrv3 ceph-osd[7649]: 2025-10-01T20:55:02.183+0200 75feffa29680 -1 osd.16 2832 init wait_auth_rotating timed out -- maybe I have a clock skew against the monitors?

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Main process exited, code=exited, status=1/FAILURE

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Failed with result 'exit-code'.

Oct 01 20:55:02 training1-vsrv3 systemd[1]: ceph-osd@16.service: Consumed 1.846s CPU time, 186M memory peak.

Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Scheduled restart job, restart counter is at 3.

Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Start request repeated too quickly.

Oct 01 20:55:08 training1-vsrv3 systemd[1]: ceph-osd@20.service: Failed with result 'exit-code'.

Oct 01 20:55:08 training1-vsrv3 systemd[1]: Failed to start ceph-osd@20.service - Ceph object storage daemon osd.20.








========================================================================

========================================================================

========================================================================



Other case :

PVE quorum on DC2 , OSDs up on DC2 too

BUT : it is still a mess !


root@training2-vsrv1:~# ceph osd status

ID  HOST              USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE    

 0  training2-vsrv1  3183M  3980M      0        0       0        0   exists,up

 1  training2-vsrv1  3681M  3482M      0        0       0        0   exists,up

 2  training2-vsrv1  3094M  4069M      0        0       0        0   exists,up

 3  training2-vsrv1  3185M  3978M      0        0       0        0   exists,up

 4  training2-vsrv2  3621M  3542M      0        0       0        0   exists,up

 5  training2-vsrv2  3576M  3587M      0        0       0        0   exists,up

 6  training2-vsrv2  3156M  4007M      0        0       0        0   exists,up

 7  training2-vsrv2  3599M  3564M      0        0       0        0   exists,up

 8  training2-vsrv3  3651M  3512M      0        0       0        0   exists,up

 9  training2-vsrv3  3118M  4045M      0        0       0        0   exists,up

10  training2-vsrv3  3247M  3916M      0        0       0        0   exists,up

11  training2-vsrv3  3262M  3901M      0        0       0        0   exists,up

12  training2-vsrv4  2947M  4216M      0        0       0        0   exists,up

13  training2-vsrv4  2993M  4170M      0        0       0        0   exists,up

14  training2-vsrv4  2857M  4306M      0        0       0        0   exists,up

15  training2-vsrv4  3124M  4039M      0        0       0        0   exists,up

16                      0      0       0        0       0        0   exists    

17                      0      0       0        0       0        0   exists    

18                      0      0       0        0       0        0   exists    

19                      0      0       0        0       0        0   exists    

20                      0      0       0        0       0        0   exists    

21                      0      0       0        0       0        0   exists    

22                      0      0       0        0       0        0   exists    

23                      0      0       0        0       0        0   exists    

24                      0      0       0        0       0        0   exists    

25                      0      0       0        0       0        0   exists    

26                      0      0       0        0       0        0   exists    

27                      0      0       0        0       0        0   exists    

28                      0      0       0        0       0        0   exists    

29                      0      0       0        0       0        0   exists    

30                      0      0       0        0       0        0   exists    

31                      0      0       0        0       0        0   exists    



root@training1-vsrv1:~# ceph osd status

==> no return



root@training1-vsrv1:~# pvecm status

Cluster information

-------------------

Name:             training

Config Version:   9

Transport:        knet

Secure auth:      on


Quorum information

------------------

Date:             Wed Oct  8 21:19:27 2025

Quorum provider:  corosync_votequorum

Nodes:            4

Node ID:          0x00000008

Ring ID:          1.b84

Quorate:          No


Votequorum information

----------------------

Expected votes:   9

Highest expected: 9

Total votes:      4

Quorum:           5 Activity blocked

Flags:          


Membership information

----------------------

    Nodeid      Votes Name

0x00000001          1 10.10.1.2

0x00000002          1 10.10.1.3

0x00000004          1 10.10.1.4

0x00000008          1 10.10.1.1 (local)





root@training2-vsrv1:~# pvecm status

Cluster information

-------------------

Name:             training

Config Version:   9

Transport:        knet

Secure auth:      on


Quorum information

------------------

Date:             Wed Oct  8 21:19:23 2025

Quorum provider:  corosync_votequorum

Nodes:            5

Node ID:          0x00000007

Ring ID:          3.a68

Quorate:          Yes


Votequorum information

----------------------

Expected votes:   9

Highest expected: 9

Total votes:      5

Quorum:           5

Flags:            Quorate


Membership information

----------------------

    Nodeid      Votes Name

0x00000003          1 10.10.1.22

0x00000005          1 10.10.1.23

0x00000006          1 10.10.1.24

0x00000007          1 10.10.1.21 (local)

0x00000009          1 10.10.1.31




OSDs crash. Systemctl tries to restart them but max_retry is reached and no restart is retried after :


Oct 08 21:29:06 training1-vsrv1 pvedaemon[1549]: <root@pam> starting task UPID:training1-vsrv1:000039CC:000272D3:68E6BB82:srvstart:osd.18:root@pam:

Oct 08 21:29:06 training1-vsrv1 systemd[1]: ceph-osd@18.service: Start request repeated too quickly.

Oct 08 21:29:06 training1-vsrv1 systemd[1]: ceph-osd@18.service: Failed with result 'exit-code'.

Oct 08 21:29:06 training1-vsrv1 systemd[1]: Failed to start ceph-osd@18.service - Ceph object storage daemon osd.18.

Oct 08 21:29:06 training1-vsrv1 pvedaemon[14796]: command '/bin/systemctl start ceph-osd@18' failed: exit code 1

Oct 08 21:29:06 training1-vsrv1 pvedaemon[1549]: <root@pam> end task UPID:training1-vsrv1:000039CC:000272D3:68E6BB82:srvstart:osd.18:root@pam: command '/bin/systemctl start ceph-osd@18' failed: exit code 1


To resolve :

==> systemctl reset-failed ceph-osd@18.service

Search

Search

CEPH stretched cluster : PVE partition is not the same as the CEPH partition !

ghusson

Renowned Member

ghusson

Renowned Member

We value your privacy