Hello all,
I have 2 Proxmox Nodes with:
And a third Proxmox node as Quorum Device as VM.
I have created a Proxmox Cluster with all 3 Nodes, then I have installed Ceph on all 3 Nodes.
I just want to replicate on two Nodes, so in the configuration I set:
For all 3 Nodes I created a Ceph Monitor, Ceph Manager and Ceph Metadata Server.
Then I created OSDs with the NVMes on two Nodes and I created a pool for these.
I can install VMs, import VMs, migrate VMs between the two Nodes. I can shutdown and restart a Node and everything works fine.
Then I start testing worst case scenarios and there I get a Problem.
When I simulate a Node Crash with
When the crashed Node is back online, everything works fine.
I would expect, that the cluster says its degraded and everything ist working, same as if I would shutdown a node.
Has someone an idea why the Cluster stops working, after a Node crash?
Thanks in advance!
René
I have 2 Proxmox Nodes with:
- CPU: 2x 18 Core Intel Xeon Gold 6154
- RAM: 16x 32GB ECC DDR4 SDRAM (512 GB)
- Storage: 10x 960GB Samsung PM93A NVMe
And a third Proxmox node as Quorum Device as VM.
I have created a Proxmox Cluster with all 3 Nodes, then I have installed Ceph on all 3 Nodes.
I just want to replicate on two Nodes, so in the configuration I set:
Code:
osd_pool_default_min_size = 1
osd_pool_default_size = 2
For all 3 Nodes I created a Ceph Monitor, Ceph Manager and Ceph Metadata Server.
Then I created OSDs with the NVMes on two Nodes and I created a pool for these.
I can install VMs, import VMs, migrate VMs between the two Nodes. I can shutdown and restart a Node and everything works fine.
Then I start testing worst case scenarios and there I get a Problem.
When I simulate a Node Crash with
echo c > /proc/sysrq-trigger
, I get a short cluster Timeout and then I got some laggy PGs. I cant access any VMs. But the Ceph Cluster says HEATH_OK. And when there are some VMs on the crashed Node, they are moved to the healthy Node. So the HA rules are working.When the crashed Node is back online, everything works fine.
I would expect, that the cluster says its degraded and everything ist working, same as if I would shutdown a node.
Has someone an idea why the Cluster stops working, after a Node crash?
Thanks in advance!
René