4-Node-Ceph-Cluster went down

liszca

Member
May 8, 2020
60
1
13
22
On my 4-Node Cluster with Ceph I shut down one system to make some BIOS changes.

The issue is the cluster came to a complete stop while doing this.

What I checked beforehand on the shutdown node:
  • No HA rules are applied to any of the VMs, LXCs
  • All are on Ceph Storage
  • No Backup is running on that Node
Where do I have to look to figure out what caused my cluster to fail?
 
if your cluster is running again now, please post the output of ceph status and if there are errors, post also the output of ceph health and ceph health detail.
 
  • Like
Reactions: jsterr
Code:
# ceph status
  cluster:
    id:     ddfe12d5-782f-4028-b499-71f3e6763d8a
    health: HEALTH_OK
 
  services:
    mon: 4 daemons, quorum aegaeon,anthe,atlas,calypso (age 12h)
    mgr: anthe(active, since 12h), standbys: atlas, calypso, aegaeon
    mds: 2/2 daemons up, 2 standby
    osd: 4 osds: 4 up (since 12h), 4 in (since 4w)
 
  data:
    volumes: 2/2 healthy
    pools:   7 pools, 193 pgs
    objects: 98.12k objects, 377 GiB
    usage:   1.1 TiB used, 2.6 TiB / 3.7 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   112 KiB/s wr, 0 op/s rd, 18 op/s wr

Code:
# ceph health
HEALTH_OK

Code:
# ceph health detail
HEALTH_OK
 
Is there something in the syslog / journal log at that time that gives some hints what happens?

What do you mean by "complete stop" of the cluster? What happened?
 
Could you please share with us the output of `pvecm status`?
 
Could you please share with us the output of `pvecm status`?

Code:
# pvecm status

Cluster information

-------------------

Name:             saturn

Config Version:   4

Transport:        knet

Secure auth:      on


Quorum information

------------------

Date:             Mon Nov  6 23:26:38 2023

Quorum provider:  corosync_votequorum

Nodes:            4

Node ID:          0x00000002

Ring ID:          1.4b2

Quorate:          Yes


Votequorum information

----------------------

Expected votes:   4

Highest expected: 4

Total votes:      4

Quorum:           3

Flags:            Quorate


Membership information

----------------------

    Nodeid      Votes Name

0x00000001          1 192.168.0.10

0x00000002          1 192.168.0.11 (local)

0x00000003          1 192.168.0.12

0x00000004          1 192.168.0.13

looks like my voting is wrong?

The idea was to design my cluster that 2-Nodes can fail. In general my concept is I want to be able to maintain hardware while cluster is running and still one node is allowed to fail.
 
Last edited:
The idea was to design my cluster that 2-Nodes can fail. In general my concept is I want to be able to maintain hardware while cluster is running and still one node is allowed to fail.
Corosync requires n+1 votes if you have either 2n or 2n+1 nodes. So 3 votes if you have 4 nodes. If you add a QDevice, that will bump the maximum votes to 5 with 3 expected votes This allow for two devices to be down, it also has other benefits.

Do note that with half of the nodes down, from the perspective of the other half of the nodes it is impossible to know if the missing nodes are down or running perfectly well together, you need a tie breaker for this cases and this is precisely the purpose of a qdevice.

Please take a look at [1] for more info.

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support
 
Corosync requires n+1 votes if you have either 2n or 2n+1 nodes. So 3 votes if you have 4 nodes. If you add a QDevice, that will bump the maximum votes to 5 with 3 expected votes This allow for two devices to be down, it also has other benefits.
I haven't yet figured votes out completely. Here is my thoughts from a storage perspective:

Systems with Raid 1 equivalent have 4 votes
Systems with single disks have 2 votes
systems with raid 0 equivalent have 1 vote

Is this approach feasible?
 
The only way to know what happened is checking syslog on every node in the time range since you shut down one node and the other nodes "failed". Please post them so we can have somewhere to look at to shed some light.

Beside that, I don't understand why you mention "raid" and "ceph", which are mutually exclusive.

Also, the vote results you posted before are perfectly fine, why do you think it's "wrong" and what do you want to accomplish changing the votes of some nodes?
 
It took me a while to figure out what went wrong. First it was me who did things wrong.

But let me explain:
After the initial setup of my cluster consisting of 3 Nodes I made a test run and figured out it had no problems stopping one Node. Then for some reason I thought with the fourth Node I can loose another Node which turned out not to be true.

My Plan is now to have a look into the QDevice.
 
You can also go for 4:2 (size: 4 | minsize: 2) if you have enough disk-space - then you can loose 2 nodes at the same time.

SIZE = count of replicas per cluster that will be written to different host. 3 means, 3 different hosts.
MINSIZE = count of replicas that need to be online and working to have a working/running ceph

SIZE - MINSIZE = count of nodes you can loose at the same time
 
Last edited:
Do not forget about the Monitors in these calculations! They work by forming a Quorum, similar to the Proxmox VE nodes themselves. You always need a majority of Ceph MONs and Proxmox VE nodes up and running to be quorate (have the majority of votes).
 
  • Like
Reactions: jsterr
Do not forget about the Monitors in these calculations! They work by forming a Quorum, similar to the Proxmox VE nodes themselves. You always need a majority of Ceph MONs and Proxmox VE nodes up and running to be quorate (have the majority of votes).
Oh yeah your right! Means best would be 5 Nodes with 4:2 if you wanna loose two nodes at the same time. having size 3 and minsize 2 does not allow to loose 2 nodes at the same time, no matter how many mons and or servers you have. Thanks for correcting!

Edit: 5 Nodes also have the benefit of beeing able to loose a node and then fully recover ceph on a 3:2 setup.
 
Last edited:
Do not forget about the Monitors in these calculations! They work by forming a Quorum, similar to the Proxmox VE nodes themselves. You always need a majority of Ceph MONs and Proxmox VE nodes up and running to be quorate (have the majority of votes).
How do I check if done correctly?

ceph status gives me the following:
Code:
  services:
    mon: 4 daemons, quorum aegaeon,anthe,atlas,calypso (age 33h)
    mgr: anthe(active, since 33h), standbys: atlas, aegaeon, calypso
    mds: 2/2 daemons up, 2 standby
    osd: 4 osds: 4 up (since 33h), 4 in (since 33h)
 
Okay it happend again.

Here is what I did:
  1. Shutdown one node: atlas (yesterday) - all worked as expected HA did its job
  2. starting node: atlas - working Nodes stopped working even restarting
  3. wait it took like 30 minutes - all Nodes and VMs come back
In between step 2 and 3: pvecm status showed thinks I think its called split brain condition, had the feeling the also got a second time restarted.

missing info I should be able to get via syslog/logs

The added screenshots have timestamps in there names:
 

Attachments

  • Screenshot from 2023-11-08 20-28-33.png
    Screenshot from 2023-11-08 20-28-33.png
    518.8 KB · Views: 5
  • Screenshot from 2023-11-08 20-18-10.png
    Screenshot from 2023-11-08 20-18-10.png
    139.4 KB · Views: 5
  • Screenshot from 2023-11-08 20-11-08.png
    Screenshot from 2023-11-08 20-11-08.png
    537.3 KB · Views: 5
  • Screenshot from 2023-11-08 20-02-16.png
    Screenshot from 2023-11-08 20-02-16.png
    143.9 KB · Views: 5
  • Screenshot from 2023-11-08 19-58-50.png
    Screenshot from 2023-11-08 19-58-50.png
    397.2 KB · Views: 5
  • Screenshot from 2023-11-08 19-37-43.png
    Screenshot from 2023-11-08 19-37-43.png
    284 KB · Views: 5
  • Screenshot from 2023-11-08 19-33-10.png
    Screenshot from 2023-11-08 19-33-10.png
    201.6 KB · Views: 5
QDevice won't be enough, as it won't help with ceph quorum. With 2 out of 4 ceph monitors off, the ceph cluster will not have quorum and I/O will be suspended, even if the Proxmox quorum is ok.

The recommended deployment is to use an even number of monitors [1] (3 is more than enough for your use case) + a QDevice when using an even number of nodes:
  • During planned downtime if you need to take down two of the servers with the ceph monitors, simply add a monitor in a fourth node and remove the monitor from one of the server(s) to be shutdown. This way you will still have 2 monitors running and keep ceph quorum as 2 out of 3 monitors will be up. Proxmox quorum will be ok as there will be 3 votes out of 5 (2 nodes + qdevice).
  • During unplanned downtime of two nodes, the VMs running in the surving cluster will work if two ceph monitors are in these hosts. If not, VMs will halt their I/O as there won't be ceph quorum. Proxmox quorum will be ok (again, there will be 3 votes out of 5).
The idea was to design my cluster that 2-Nodes can fail. In general my concept is I want to be able to maintain hardware while cluster is running and still one node is allowed to fail.
You can't do this with 4 nodes and allow any two nodes to fail due to ceph quorum: you can't lose two ceph monitors, either with 3 (recomended) or 4 monitors installed (what you have now).

What I would do is get something like a NUC or cheap second hand server and use it as a fifth ceph monitor and proxmox node. It won't have OSD nor run VMs, just add it's vote.

[1] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/
 
Still, your problem is that at some point some/all nodes can't see each other via corosync link(s), losing quorum and making HA reboot some/all nodes, causing ceph to lose quorum and making VM I/O to halt. I will even say that you are using a single corosync link shared with ceph and corosync goes mad when ceph starts rebalancing and latency in the network starts jittering, making some node(s) to restart again, adding salt to the injury :)

Please post the output of corosync-cfgtool -s on each node while the cluster is up and healthy. Also post /etc/ceph/ceph.conf. If the problem happens again (looks like you can easily reproduce it), you can also use that command to see if and how each node reaches each other one.
 
Still, your problem is that at some point some/all nodes can't see each other via corosync link(s), losing quorum and making HA reboot some/all nodes, causing ceph to lose quorum and making VM I/O to halt. I will even say that you are using a single corosync link shared with ceph and corosync goes mad when ceph starts rebalancing and latency in the network starts jittering, making some node(s) to restart again, adding salt to the injury :)

Please post the output of corosync-cfgtool -s on each node while the cluster is up and healthy. Also post /etc/ceph/ceph.conf. If the problem happens again (looks like you can easily reproduce it), you can also use that command to see if and how each node reaches each other one.

From what you write I have the impression one network for all things is not sufficient?

Code:
 # pvecm status
Cluster information
-------------------
Name:             saturn
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Nov  8 20:16:42 2023
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000004
Ring ID:          1.bb7
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.10
0x00000002          1 192.168.0.11
0x00000003          1 192.168.0.12
0x00000004          1 192.168.0.13 (local)

 0 calypso.saturn:/root
 # corosync-cfgtool -s
Local node ID 4, transport knet
LINK ID 0 udp
    addr    = 192.168.0.13
    status:
        nodeid:          1:    connected
        nodeid:          2:    connected
        nodeid:          3:    connected
        nodeid:          4:    localhost

 0 calypso.saturn:/root
 # cat /etc/ceph/ceph.conf
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.0.10/24
     fsid = ddfe12d5-782f-4028-b499-71f3e6763d8a
     mon_allow_pool_delete = true
     mon_host = 192.168.0.10 192.168.0.11 192.168.0.12 192.168.0.13
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.0.10/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.aegaeon]
     host = aegaeon
     mds_standby_for_name = pve

[mds.anthe]
     host = anthe
     mds standby for name = pve

[mds.atlas]
     host = atlas
     mds_standby_for_name = pve

[mds.calypso]
     host = calypso
     mds_standby_for_name = pve

[mon.aegaeon]
     public_addr = 192.168.0.10

[mon.anthe]
     public_addr = 192.168.0.11

[mon.atlas]
     public_addr = 192.168.0.12

[mon.calypso]
     public_addr = 192.168.0.13
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!