Proxmox Ceph no rebalancing

boero

New Member
Jan 20, 2025
4
1
3
Hi everyone,
I have currently set up a Proxmox Hyper-Converged Ceph Cluster. VMs etc. are running as desired, so I am now testing crash scenarios.

Here I encounter the following problem: If a node fails (storage network interfaces, see the current setup below), no rebalancing takes place and the VMs on the affected node are no longer accessible and migration to another node is also not possible.

If all network interfaces (HA and Ceph) fail, the migration works and everything continues to run as desired. If a node is shut down, the migration also works and everything continues to run as desired.

I am currently testing the crash scenario with my node “pve03”. I have attached a few screenshots for an overview.

Does anyone have a solution?

The setup is as follows:
- 3 nodes with 2 OSDs each.
- Full Mesh Routed Setup (with Fallback), two 10 GbE network interfaces each for Ceph
- one 1 GbE network interface each for HA
- PGs for the Ceph pools are configured according to optimal settings as per the pool overview
- Ceph version: 18.2.4
 

Attachments

  • 2025-01-20 12_52_31-pve01 - Proxmox Virtual Environment und 7 weitere Seiten - Geschäftlich – ...png
    2025-01-20 12_52_31-pve01 - Proxmox Virtual Environment und 7 weitere Seiten - Geschäftlich – ...png
    127.1 KB · Views: 8
  • 2025-01-20 12_53_59-pve01 - Proxmox Virtual Environment und 7 weitere Seiten - Geschäftlich – ...png
    2025-01-20 12_53_59-pve01 - Proxmox Virtual Environment und 7 weitere Seiten - Geschäftlich – ...png
    18.5 KB · Views: 8
  • 2025-01-20 12_55_23-pve01 - Proxmox Virtual Environment und 7 weitere Seiten - Geschäftlich – ...png
    2025-01-20 12_55_23-pve01 - Proxmox Virtual Environment und 7 weitere Seiten - Geschäftlich – ...png
    30.8 KB · Views: 8
That is a rather constructed scenario, where you intentionally removed both NICs. With one gone, everything would keep working. That is what that network setup can protect you from, but not if both are down.

From the POV of Ceph, the node is gone and therefore the OSDs are shown as down and, since it was long enough, as out. This is only a 3-node cluster, therefore Ceph cannot recover the data to another node, as the remaining two already have replicas. There can only be one replica per node.

From the POV of Proxmox VE, the node is still up and running; therefore HA won't recover the VMs to the other nodes. The VMs cannot longer read or write to their disk images on Ceph as the connection to it is completely down.
 
That is a rather constructed scenario, where you intentionally removed both NICs. With one gone, everything would keep working. That is what that network setup can protect you from, but not if both are down.
Yes, it was done deliberately, for testing. Just right, if one NIC fails, everything continues to run smoothly.

From the POV of Ceph, the node is gone and therefore the OSDs are shown as down and, since it was long enough, as out. This is only a 3-node cluster, therefore Ceph cannot recover the data to another node, as the remaining two already have replicas. There can only be one replica per node.

Due to the replications on the other nodes, the VMs should still work, or am I misunderstanding?

From the POV of Proxmox VE, the node is still up and running; therefore HA won't recover the VMs to the other nodes. The VMs cannot longer read or write to their disk images on Ceph as the connection to it is completely down.

What is the best practice for this scenario?

Thanks for your reply!
 
Due to the replications on the other nodes, the VMs should still work, or am I misunderstanding?
The VMs act as Ceph clients and need access to the Ceph Public network. With both NICs down, that access is most likely not given anymore, therefore from the POV of the guest OS, the disk is not reacting to any IO.

What is the best practice for this scenario?
Well, the VMs cannot access their disk anymore. Therefore a hard stop will be needed. Once powered off, you can try to do an offline migration to one of the other nodes. If that fails, you can always manually move the VM configs, circumventing all safety checks:
Code:
mv /etc/pve/nodes/{source}/qemu-server/{vmid}.conf /etc/pve/nodes/{target}/qemu-server/
 
The VMs act as Ceph clients and need access to the Ceph Public network. With both NICs down, that access is most likely not given anymore, therefore from the POV of the guest OS, the disk is not reacting to any IO.
Okay, so for more error tolerance I could use a third, separate NIC for the public_network? Currently public_network and cluster_network run via the same NIC / IP address.
Well, the VMs cannot access their disk anymore. Therefore a hard stop will be needed. Once powered off, you can try to do an offline migration to one of the other nodes. If that fails, you can always manually move the VM configs, circumventing all safety checks:
Code:
mv /etc/pve/nodes/{source}/qemu-server/{vmid}.conf /etc/pve/nodes/{target}/qemu-server/
I will try that!
 
Last edited:
Okay, so for more error tolerance I could use a third, separate NIC for the public_network? Currently public_network and cluster_network run via the same NIC / IP address.
Will it be fast and redundant? Keep in mind, that the Ceph Public network is the main Ceph network. The Ceph Cluster network is optional and can be placed on a different physical network to take load away.
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

The Full-Mesh routed with fallback option helps agains the loss of one network cable in the mesh. What do you want to protect against? If you have both NICs on the same PCI card that could fail, then rather add/change one of the ports to a NIC on a different PCI card.
 
Will it be fast and redundant? Keep in mind, that the Ceph Public network is the main Ceph network. The Ceph Cluster network is optional and can be placed on a different physical network to take load away.
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/
Thanks for the information!
The Full-Mesh routed with fallback option helps agains the loss of one network cable in the mesh. What do you want to protect against? If you have both NICs on the same PCI card that could fail, then rather add/change one of the ports to a NIC on a different PCI card.
Makes sense, thank you!


When I stop the VMs and migrate them offline, it works perfectly fine. I have all the information I need. Thanks for your help!
 
  • Like
Reactions: aaron