4-Node-Ceph-Cluster went down

liszca · Nov 5, 2023

On my 4-Node Cluster with Ceph I shut down one system to make some BIOS changes.

The issue is the cluster came to a complete stop while doing this.

What I checked beforehand on the shutdown node:

No HA rules are applied to any of the VMs, LXCs
All are on Ceph Storage
No Backup is running on that Node

Where do I have to look to figure out what caused my cluster to fail?

mgabriel · Nov 5, 2023

if your cluster is running again now, please post the output of ceph status and if there are errors, post also the output of ceph health and ceph health detail.

liszca · Nov 5, 2023

Code:

# ceph status
  cluster:
    id:     ddfe12d5-782f-4028-b499-71f3e6763d8a
    health: HEALTH_OK
 
  services:
    mon: 4 daemons, quorum aegaeon,anthe,atlas,calypso (age 12h)
    mgr: anthe(active, since 12h), standbys: atlas, calypso, aegaeon
    mds: 2/2 daemons up, 2 standby
    osd: 4 osds: 4 up (since 12h), 4 in (since 4w)
 
  data:
    volumes: 2/2 healthy
    pools:   7 pools, 193 pgs
    objects: 98.12k objects, 377 GiB
    usage:   1.1 TiB used, 2.6 TiB / 3.7 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   112 KiB/s wr, 0 op/s rd, 18 op/s wr

Code:

# ceph health
HEALTH_OK

Code:

# ceph health detail
HEALTH_OK

mgabriel · Nov 6, 2023

Is there something in the syslog / journal log at that time that gives some hints what happens?

What do you mean by "complete stop" of the cluster? What happened?

Maximiliano · Nov 6, 2023

Could you please share with us the output of `pvecm status`?

liszca · Nov 6, 2023

Maximiliano said:
Could you please share with us the output of `pvecm status`?

Code:

# pvecm status

Cluster information

-------------------

Name:             saturn

Config Version:   4

Transport:        knet

Secure auth:      on


Quorum information

------------------

Date:             Mon Nov  6 23:26:38 2023

Quorum provider:  corosync_votequorum

Nodes:            4

Node ID:          0x00000002

Ring ID:          1.4b2

Quorate:          Yes


Votequorum information

----------------------

Expected votes:   4

Highest expected: 4

Total votes:      4

Quorum:           3

Flags:            Quorate


Membership information

----------------------

    Nodeid      Votes Name

0x00000001          1 192.168.0.10

0x00000002          1 192.168.0.11 (local)

0x00000003          1 192.168.0.12

0x00000004          1 192.168.0.13

looks like my voting is wrong?

The idea was to design my cluster that 2-Nodes can fail. In general my concept is I want to be able to maintain hardware while cluster is running and still one node is allowed to fail.

liszca · Nov 6, 2023

mgabriel said:
What do you mean by "complete stop" of the cluster? What happened?

All node stopped there VMs, LXCs. Some Nodes where even unreachable.

Maximiliano · Nov 7, 2023

liszca said:
The idea was to design my cluster that 2-Nodes can fail. In general my concept is I want to be able to maintain hardware while cluster is running and still one node is allowed to fail.

Corosync requires n+1 votes if you have either 2n or 2n+1 nodes. So 3 votes if you have 4 nodes. If you add a QDevice, that will bump the maximum votes to 5 with 3 expected votes This allow for two devices to be down, it also has other benefits.

Do note that with half of the nodes down, from the perspective of the other half of the nodes it is impossible to know if the missing nodes are down or running perfectly well together, you need a tie breaker for this cases and this is precisely the purpose of a qdevice.

Please take a look at [1] for more info.

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support

liszca · Nov 7, 2023

Maximiliano said:
Corosync requires n+1 votes if you have either 2n or 2n+1 nodes. So 3 votes if you have 4 nodes. If you add a QDevice, that will bump the maximum votes to 5 with 3 expected votes This allow for two devices to be down, it also has other benefits.

I haven't yet figured votes out completely. Here is my thoughts from a storage perspective:

Systems with Raid 1 equivalent have 4 votes
Systems with single disks have 2 votes
systems with raid 0 equivalent have 1 vote

Is this approach feasible?

VictorSTS · Nov 8, 2023

The only way to know what happened is checking syslog on every node in the time range since you shut down one node and the other nodes "failed". Please post them so we can have somewhere to look at to shed some light.

Beside that, I don't understand why you mention "raid" and "ceph", which are mutually exclusive.

Also, the vote results you posted before are perfectly fine, why do you think it's "wrong" and what do you want to accomplish changing the votes of some nodes?

liszca · Nov 8, 2023

It took me a while to figure out what went wrong. First it was me who did things wrong.

But let me explain:
After the initial setup of my cluster consisting of 3 Nodes I made a test run and figured out it had no problems stopping one Node. Then for some reason I thought with the fourth Node I can loose another Node which turned out not to be true.

My Plan is now to have a look into the QDevice.

jsterr · Nov 8, 2023

You can also go for 4:2 (size: 4 | minsize: 2) if you have enough disk-space - then you can loose 2 nodes at the same time.

SIZE = count of replicas per cluster that will be written to different host. 3 means, 3 different hosts.
MINSIZE = count of replicas that need to be online and working to have a working/running ceph

SIZE - MINSIZE = count of nodes you can loose at the same time

aaron · Nov 8, 2023

Do not forget about the Monitors in these calculations! They work by forming a Quorum, similar to the Proxmox VE nodes themselves. You always need a majority of Ceph MONs and Proxmox VE nodes up and running to be quorate (have the majority of votes).

jsterr · Nov 8, 2023

aaron said:
Do not forget about the Monitors in these calculations! They work by forming a Quorum, similar to the Proxmox VE nodes themselves. You always need a majority of Ceph MONs and Proxmox VE nodes up and running to be quorate (have the majority of votes).

Oh yeah your right! Means best would be 5 Nodes with 4:2 if you wanna loose two nodes at the same time. having size 3 and minsize 2 does not allow to loose 2 nodes at the same time, no matter how many mons and or servers you have. Thanks for correcting!

Edit: 5 Nodes also have the benefit of beeing able to loose a node and then fully recover ceph on a 3:2 setup.

liszca · Nov 8, 2023

Maximiliano said:
[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support

Is it possible to give every of my nodes a QDevice? I want to avoid to have another machine just for the QDevice.

liszca · Nov 8, 2023

aaron said:
Do not forget about the Monitors in these calculations! They work by forming a Quorum, similar to the Proxmox VE nodes themselves. You always need a majority of Ceph MONs and Proxmox VE nodes up and running to be quorate (have the majority of votes).

How do I check if done correctly?

ceph status gives me the following:

Code:

  services:
    mon: 4 daemons, quorum aegaeon,anthe,atlas,calypso (age 33h)
    mgr: anthe(active, since 33h), standbys: atlas, aegaeon, calypso
    mds: 2/2 daemons up, 2 standby
    osd: 4 osds: 4 up (since 33h), 4 in (since 33h)

liszca · Nov 8, 2023

Okay it happend again.

Here is what I did:

Shutdown one node: atlas (yesterday) - all worked as expected HA did its job
starting node: atlas - working Nodes stopped working even restarting
wait it took like 30 minutes - all Nodes and VMs come back

In between step 2 and 3: pvecm status showed thinks I think its called split brain condition, had the feeling the also got a second time restarted.

missing info I should be able to get via syslog/logs

The added screenshots have timestamps in there names:

VictorSTS · Nov 8, 2023

QDevice won't be enough, as it won't help with ceph quorum. With 2 out of 4 ceph monitors off, the ceph cluster will not have quorum and I/O will be suspended, even if the Proxmox quorum is ok.

The recommended deployment is to use an even number of monitors [1] (3 is more than enough for your use case) + a QDevice when using an even number of nodes:

During planned downtime if you need to take down two of the servers with the ceph monitors, simply add a monitor in a fourth node and remove the monitor from one of the server(s) to be shutdown. This way you will still have 2 monitors running and keep ceph quorum as 2 out of 3 monitors will be up. Proxmox quorum will be ok as there will be 3 votes out of 5 (2 nodes + qdevice).
During unplanned downtime of two nodes, the VMs running in the surving cluster will work if two ceph monitors are in these hosts. If not, VMs will halt their I/O as there won't be ceph quorum. Proxmox quorum will be ok (again, there will be 3 votes out of 5).

liszca said:
The idea was to design my cluster that 2-Nodes can fail. In general my concept is I want to be able to maintain hardware while cluster is running and still one node is allowed to fail.

You can't do this with 4 nodes and allow any two nodes to fail due to ceph quorum: you can't lose two ceph monitors, either with 3 (recomended) or 4 monitors installed (what you have now).

What I would do is get something like a NUC or cheap second hand server and use it as a fifth ceph monitor and proxmox node. It won't have OSD nor run VMs, just add it's vote.

[1] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/

VictorSTS · Nov 8, 2023

Still, your problem is that at some point some/all nodes can't see each other via corosync link(s), losing quorum and making HA reboot some/all nodes, causing ceph to lose quorum and making VM I/O to halt. I will even say that you are using a single corosync link shared with ceph and corosync goes mad when ceph starts rebalancing and latency in the network starts jittering, making some node(s) to restart again, adding salt to the injury

Please post the output of corosync-cfgtool -s on each node while the cluster is up and healthy. Also post /etc/ceph/ceph.conf. If the problem happens again (looks like you can easily reproduce it), you can also use that command to see if and how each node reaches each other one.

liszca · Nov 8, 2023

VictorSTS said:
Still, your problem is that at some point some/all nodes can't see each other via corosync link(s), losing quorum and making HA reboot some/all nodes, causing ceph to lose quorum and making VM I/O to halt. I will even say that you are using a single corosync link shared with ceph and corosync goes mad when ceph starts rebalancing and latency in the network starts jittering, making some node(s) to restart again, adding salt to the injury

Please post the output of corosync-cfgtool -s on each node while the cluster is up and healthy. Also post /etc/ceph/ceph.conf. If the problem happens again (looks like you can easily reproduce it), you can also use that command to see if and how each node reaches each other one.

From what you write I have the impression one network for all things is not sufficient?

Code:

 # pvecm status
Cluster information
-------------------
Name:             saturn
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Nov  8 20:16:42 2023
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000004
Ring ID:          1.bb7
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.10
0x00000002          1 192.168.0.11
0x00000003          1 192.168.0.12
0x00000004          1 192.168.0.13 (local)

 0 calypso.saturn:/root
 # corosync-cfgtool -s
Local node ID 4, transport knet
LINK ID 0 udp
    addr    = 192.168.0.13
    status:
        nodeid:          1:    connected
        nodeid:          2:    connected
        nodeid:          3:    connected
        nodeid:          4:    localhost

 0 calypso.saturn:/root
 # cat /etc/ceph/ceph.conf
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.0.10/24
     fsid = ddfe12d5-782f-4028-b499-71f3e6763d8a
     mon_allow_pool_delete = true
     mon_host = 192.168.0.10 192.168.0.11 192.168.0.12 192.168.0.13
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.0.10/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.aegaeon]
     host = aegaeon
     mds_standby_for_name = pve

[mds.anthe]
     host = anthe
     mds standby for name = pve

[mds.atlas]
     host = atlas
     mds_standby_for_name = pve

[mds.calypso]
     host = calypso
     mds_standby_for_name = pve

[mon.aegaeon]
     public_addr = 192.168.0.10

[mon.anthe]
     public_addr = 192.168.0.11

[mon.atlas]
     public_addr = 192.168.0.12

[mon.calypso]
     public_addr = 192.168.0.13

4-Node-Ceph-Cluster went down

Active Member

Renowned Member

Active Member

Renowned Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Famous Member

Active Member

Renowned Member

Proxmox Staff Member

Renowned Member

Active Member

Active Member

Active Member

Attachments

Famous Member

Famous Member

Active Member