Hi there
I am running a 3 node Proxmox Cluster with CEPH and about 60 VMs with HA on it.
All three nodes are equaly sized with one Epyc CPU and 512GB of RAM and some OSDs each.
All VMs have fixed RAM size.
My Problem ist with bulk migration of VMs / evacuation of nodes.
I like to evacuate a node to apply patches and reboot for newer kernel.
This is how it looks before:
node 1 - 30 VMs
node 2 - 30 VMs
node 3 - 0 VMs
If I do a bulk migration of all VMs from node1 to node3 everything works as expected.
But as the preselectet node in the migration dialogue ist the one with the lowest number not including itself ist ist node2 in this case there is a risk to migrate to the already populated node.
I made sure to always double check, but totay was one of those mondays where I simply forgot.
Vms were migrated until about maybe 10Gigs of free RAM on node2 but that was not enough free space so the oom killer killed two ram intensive VMs
I do not want this to happen again any time soon and I do not think it is enough to be cautious while doing admin stuff. What I would like is a technical solution to my problem and there may be one that I am not aware of.
Things I could think of:
1. Proxmox preselects the node with the lowest memory load
2. Proxmox Checks that the max Ram all VMs to be migrated will less than the free RAM on the destination host + a security margin for CEPH and waht else may need RAM
3. Something like VMwares DRS would have prevented this by balancing the nodes in time. If I am not mistaken there is work going on to implement somethin similar.
Im looking forward to hear your ideas and how you deal with this.
All the best
123pve!
I am running a 3 node Proxmox Cluster with CEPH and about 60 VMs with HA on it.
All three nodes are equaly sized with one Epyc CPU and 512GB of RAM and some OSDs each.
All VMs have fixed RAM size.
My Problem ist with bulk migration of VMs / evacuation of nodes.
I like to evacuate a node to apply patches and reboot for newer kernel.
This is how it looks before:
node 1 - 30 VMs
node 2 - 30 VMs
node 3 - 0 VMs
If I do a bulk migration of all VMs from node1 to node3 everything works as expected.
But as the preselectet node in the migration dialogue ist the one with the lowest number not including itself ist ist node2 in this case there is a risk to migrate to the already populated node.
I made sure to always double check, but totay was one of those mondays where I simply forgot.
Vms were migrated until about maybe 10Gigs of free RAM on node2 but that was not enough free space so the oom killer killed two ram intensive VMs
I do not want this to happen again any time soon and I do not think it is enough to be cautious while doing admin stuff. What I would like is a technical solution to my problem and there may be one that I am not aware of.
Things I could think of:
1. Proxmox preselects the node with the lowest memory load
2. Proxmox Checks that the max Ram all VMs to be migrated will less than the free RAM on the destination host + a security margin for CEPH and waht else may need RAM
3. Something like VMwares DRS would have prevented this by balancing the nodes in time. If I am not mistaken there is work going on to implement somethin similar.
Im looking forward to hear your ideas and how you deal with this.
All the best
123pve!