CEPH - one pool crashing can bring down other pools and derail whole cluster.

tomtom13

Renowned Member
Dec 28, 2016
90
6
73
42
Hi,
Since we've been migrating more and more stuff to ceph under proxmox, we've found a quirky behavior and I've built a test case for that on my test cluster.

Create a small cluster with minimum 4 nodes.
create one ceph pool sharing using one disk per node with 4 times mirroring, with minimum set to 2 let's call it pool_1
create VM's using this pool as storage, let's call it vm_A,
create second pool using single (yet separate) disk per node with 2 times mirroring with minimum set to 1, let's call it pool_2 (here best is to use sata / usb disks to make it easy to unplug those for failure simulation)
create a VM using second pool as a storage, let's call it vm_B

Now, you can switch of two nodes, OR simple unplug two disks that are storing pool_2.

What I would expect to happen is proxmox to kill / suspend VM_B - and it happens (well those get killed but hey)
What I would NOT expect to happen is proxmox to kill VM_A as well.


I know that some will come after me with pitch fork for using blasphemous mode of less then 3 mirrors or later will try to dismiss problems using this mode as an excuse but ¯\_(ツ)_/¯ sorry, this is easiest I can show how to replicate the problem. In our case it was that we've joined few unreliable nodes to the cluster and put one pool on those, those nodes went down for RAM replacement and whole cluster just imploded.
 
  • Like
Reactions: itNGO
Using garbage and blaming Ceph for having problems with non enterprise equipment is "blasphemous mode". But yes, you can kill Ceph with even clusternode numbers and removing unreiable nodes without informing the cluster by just unplug them. Create mons and Managers on the reliable nodes only and with an uneven number. This should fix the behavior.
 
@bbgeek17 - I've illustrated the problem with most minimalistic test cluster setup possible for anyone interested to test, production cluster is slightly different.

@itNGO - as kindly as possible: I've replicated the problem on a test cluster that we've noticed in production and presented it here for anyone interested to replicate, so very kindly don't call my hardware garbage if you have zero idea on what we call "few unreliable nodes".
 
"Garbage" and "Blasphemous" definitions aside, this is going to be extremely difficult to diagnose or even give some advice without every little detail of the settings for PVE, the VMs and Ceph. And some logs too, there should be some trace about why VM_A got killed.... Even if anyone takes the time to build a 4 node cluster to replicate your proposed configuration, any little difference may influence the result (VMs halting or not).

I can tell you for sure that I run similar configurations (i.e. different pools for different servers/disks) and failures in a pool never influenced the other ones (as long as Ceph kept quorum, of course).
 
"Garbage" and "Blasphemous" definitions aside, this is going to be extremely difficult to diagnose or even give some advice without every little detail of the settings for PVE, the VMs and Ceph. And some logs too, there should be some trace about why VM_A got killed.... Even if anyone takes the time to build a 4 node cluster to replicate your proposed configuration, any little difference may influence the result (VMs halting or not).

I can tell you for sure that I run similar configurations (i.e. different pools for different servers/disks) and failures in a pool never influenced the other ones (as long as Ceph kept quorum, of course).
That is interesting! I grant you that maybe my test setup did not replicate the original problem and is simply broken but this is something I can replicate. So for me if I pull two disk out of pool_2 I get "ceph error" (not ceph warn) and all VM's go down - which is a bit bizzare for me. I wonder if that due to the fact that on those disks I've got pool for RDB and cephFS and cephfs is mounted as an extra storage (you know, just going through the cephfs creation wizard, it just adds storage in proxmox).

Anyway, thanks for feedback that it doesn't happen for you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!