CEPH - one pool crashing can bring down other pools and derail whole cluster.

tomtom13 · Dec 8, 2024

Hi,
Since we've been migrating more and more stuff to ceph under proxmox, we've found a quirky behavior and I've built a test case for that on my test cluster.

Create a small cluster with minimum 4 nodes.
create one ceph pool sharing using one disk per node with 4 times mirroring, with minimum set to 2 let's call it pool_1
create VM's using this pool as storage, let's call it vm_A,
create second pool using single (yet separate) disk per node with 2 times mirroring with minimum set to 1, let's call it pool_2 (here best is to use sata / usb disks to make it easy to unplug those for failure simulation)
create a VM using second pool as a storage, let's call it vm_B

Now, you can switch of two nodes, OR simple unplug two disks that are storing pool_2.

What I would expect to happen is proxmox to kill / suspend VM_B - and it happens (well those get killed but hey)
What I would NOT expect to happen is proxmox to kill VM_A as well.

I know that some will come after me with pitch fork for using blasphemous mode of less then 3 mirrors or later will try to dismiss problems using this mode as an excuse but ¯\_(ツ)_/¯ sorry, this is easiest I can show how to replicate the problem. In our case it was that we've joined few unreliable nodes to the cluster and put one pool on those, those nodes went down for RAM replacement and whole cluster just imploded.

bbgeek17 · Dec 8, 2024

Hi @tomtom13 ,
Admittedly, I only read your post briefly. That said, I think you are getting yourself into "blasphemous mode" of an even number of nodes in the cluster.
I recommend reviewing cluster best practices for Proxmox, and a few of the videos such as:
https://www.youtube.com/watch?v=aTu7WUJfGo8
https://www.youtube.com/watch?v=ID2pDEw15us

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

itNGO · Dec 8, 2024

Using garbage and blaming Ceph for having problems with non enterprise equipment is "blasphemous mode". But yes, you can kill Ceph with even clusternode numbers and removing unreiable nodes without informing the cluster by just unplug them. Create mons and Managers on the reliable nodes only and with an uneven number. This should fix the behavior.

tomtom13 · Dec 8, 2024

@bbgeek17 - I've illustrated the problem with most minimalistic test cluster setup possible for anyone interested to test, production cluster is slightly different.

@itNGO - as kindly as possible: I've replicated the problem on a test cluster that we've noticed in production and presented it here for anyone interested to replicate, so very kindly don't call my hardware garbage if you have zero idea on what we call "few unreliable nodes".

VictorSTS · Dec 9, 2024

"Garbage" and "Blasphemous" definitions aside, this is going to be extremely difficult to diagnose or even give some advice without every little detail of the settings for PVE, the VMs and Ceph. And some logs too, there should be some trace about why VM_A got killed.... Even if anyone takes the time to build a 4 node cluster to replicate your proposed configuration, any little difference may influence the result (VMs halting or not).

I can tell you for sure that I run similar configurations (i.e. different pools for different servers/disks) and failures in a pool never influenced the other ones (as long as Ceph kept quorum, of course).

tomtom13 · Dec 9, 2024

VictorSTS said:
"Garbage" and "Blasphemous" definitions aside, this is going to be extremely difficult to diagnose or even give some advice without every little detail of the settings for PVE, the VMs and Ceph. And some logs too, there should be some trace about why VM_A got killed.... Even if anyone takes the time to build a 4 node cluster to replicate your proposed configuration, any little difference may influence the result (VMs halting or not).

I can tell you for sure that I run similar configurations (i.e. different pools for different servers/disks) and failures in a pool never influenced the other ones (as long as Ceph kept quorum, of course).

That is interesting! I grant you that maybe my test setup did not replicate the original problem and is simply broken but this is something I can replicate. So for me if I pull two disk out of pool_2 I get "ceph error" (not ceph warn) and all VM's go down - which is a bit bizzare for me. I wonder if that due to the fact that on those disks I've got pool for RDB and cephFS and cephfs is mounted as an extra storage (you know, just going through the cephfs creation wizard, it just adds storage in proxmox).

Anyway, thanks for feedback that it doesn't happen for you.

Search

Search

CEPH - one pool crashing can bring down other pools and derail whole cluster.

tomtom13

Renowned Member

bbgeek17

Distinguished Member

itNGO

Renowned Member

tomtom13

Renowned Member

VictorSTS

Famous Member

tomtom13

Renowned Member