Question about ceph and redundancy

rekordskratch · Mar 3, 2022

Hi everyone

I have 4 nodes running 7.1-10 with each node having 5 x SSD's. I have set up CEPH and created monitors, managers and metadata servers on all of the nodes. I know that not all nodes need to have these set up but the plan is to expand this cluster in future. I've created OSD's for all of the disks and set up cephFS for my ISO's etc. I also created a pool for VMs and CT's. All of the above was done with the default proxmox values. I then installed a test VM and configured HA with shutdown_policy=migrate. Rebooting the node where the VM is running, works as expected ie the VM migrates to another node and all is well.

The problem comes in when I reboot 2 of the nodes at the same time. When I do this, the other 2 nodes lock up and eventually reboot. I can understand the reboot as that's probably the proxmox fencing mechanisms at work. To make sure something wasn't misbehaving, I've tried with both IPMI watchdog and the default softdog.

Is this expected behaviour? Would changing the ceph replication sizes make any difference here?

Some additional context: I have 2 x Cisco Nexus 3064 set up with VPC for redundancy. Each of my servers has 1 x quad 10G nic and 1 x dual 10G nic.
2 ports in 802.3ad across both switches for management and public access to VM's (I have set up separate VLAN's and bridges of course)
2 ports in 802.3ad across both switches for ceph public and cluster (I have set up 2 vlans here for separating public and cluster)
2 ports without any bonding for proxmox corosync cluster (I assigned an IP to each interface and set them up as link 0 and link 1 in cluster config)

Any help or advice would be very much appreciated

aaron · Mar 3, 2022

rekordskratch said:
Is this expected behaviour? Would changing the ceph replication sizes make any difference here?

Proxmox VE and Ceph are two different software stacks. Ceph is deployed and managed by Proxmox VE. So let's first take a look at Proxmox VE.

You have 4 nodes. The PVE cluster works by forming a quorum (majority). Only if the cluster is quorate, will you be able to do certain actions like starting a guest or changing its configuration.

You can check the PVE cluster status by running pvecm status. Especially the last part will inform you about the expected number of votes, how many votes are currently present and how many are needed to form a quorum.

What happens if nodes loses contact to the quorate part of the cluster? If you do not use HA, not much. Guests will keep running, but as mentioned, certain actions won't work. If you use HA then the situation changes dramatically. A node that has HA guests running loses contact to the quorate part of the cluster and will not be able to reestablish that connection within a short timeframe (AFAIR 1 minute), it will fence itself to make sure that the HA guests are definitely powered off before the (hopefully) remaining cluster will start them after a grace period of another minute.

So if you power down 2 of 4 nodes you have only 50% of the votes left which is not the majority and I assume the whole reboot procedure takes longer than one minute. Therefore, the 2 remaining nodes will fence themselves.

Now regarding Ceph. The MONs also work by majority. So if you power down half of the MONs you also lose quorum and the cluster won't work until you have at least 3 out of the 4 votes.

To avoid a split brain situation, it is a good idea to always have an odd number of nodes in the cluster. While there is the QDevice for Proxmox VE to add an additional vote, there is no such thing for Ceph MONs.

With a size/min_size of 3/2 it is also not a good idea to power down 2 nodes at the same time as, for sure, at least some PGs have 2 of their 3 replicas on those 2 nodes and will therefore be undersized. The affected pools will be IO blocked until the PGs will at least be back up to the min_size.

You could remedy that by setting the size to 4, but you lose net storage space as more replicas need to be stored in the cluster.

In the end, it all comes down to how many node failures you expect and plan to handle and need to scale up your cluster for that. So if you want to lose 2 nodes at the same time with Ceph, you would need a size/min_size of 4/2 and 5 nodes. Ideally all with a MON so that any 2 nodes can fail.

The network looks good from your description.

rekordskratch said:
2 ports without any bonding for proxmox corosync cluster (I assigned an IP to each interface and set them up as link 0 and link 1 in cluster config)

Hopefully in different IP subnets? For example, one in 192.168.1.x/24 and the other network in 192.168.2.x/24?

rekordskratch · Mar 3, 2022

Thank you so much @aaron for this excellent explanation. It makes perfect sense now.

aaron said:
Hopefully in different IP subnets? For example, one in 192.168.1.x/24 and the other network in 192.168.2.x/24?

Oops no... I couldn't find an answer on this so I put them on the same subnet. My thinking was that in the case of a NIC failure, that it would not be able to route to the other subnet since I have no router involved.

Thinking on it now, I'm realising that corosync will keep the config in sync regardless so if link 1 becomes preferred, it should do that across all nodes if I'm understanding correctly?

aaron · Mar 3, 2022

If you have IPs in the same subnet, the Linux kernel takes the liberty to answer from interfaces that might have another IP in that subnet which is not the one that got the actual request.
This is not a good idea if you want to separate traffic. For example, if you want to be sure the Ceph traffic is actually using the physical network you intended for it, you should configure the Ceph network in its own subnet that is only configured on those NICs.

Regarding Corosync... well, it could be seen as a feature, but it goes against the assumption that Corosync has. It will make it hard for Corosync to determine if a link is down.

If you are afraid that, for some reason, both Corosync links could become unusable, feel free to add the other networks as additional Corosync links. Corosync can use up to 8 different links and in a problematic situation, those additional Links might help to keep the Corosync communication up. But there is no guarantee.

One more thing. If you change the IPs in the corosync.conf file to the new subnet on the other Corosync link, you will have to restart the Corosync services for them to use the changed IP addresses on the already existing link.

And lastly, if you work on Corosync or its network, it can be a good idea to temporarily stop the HA services for that time.

First stop the LRM on all nodes:

Code:

systemctl stop pve-ha-lrm

Then the CRM on all nodes:

Code:

systemctl stop pve-ha-crm

m

Once you are done, you can start them in the same order. First the LRM on all nodes, then the CRM.

Search

Search

Question about ceph and redundancy

rekordskratch

Member

aaron

Proxmox Staff Member

rekordskratch

Member

aaron

Proxmox Staff Member