Understanding LXC Migration Behavior in a Proxmox HA Cluster with 3 Nodes

ensnare

Member
Aug 23, 2020
5
0
6
40
I'm currently managing a Proxmox cluster with three nodes configured for high availability (HA). I've observed some behaviors regarding LXC container management and failover mechanisms, and I'd appreciate any insights or clarifications you might offer.
  1. Preventing Duplicate LXC Instances: In our HA setup, I'm curious about the safeguards that Proxmox has in place to prevent an LXC container from starting on two nodes simultaneously. How does the system ensure that the same container does not accidentally run on multiple nodes at the same time? Particularly in the event a node fails, then comes back online quickly, etc... Is the duplicate instance scenario possible? Anything I can do to prevent it?
  2. Behavior During Sequential Node Failures: In a scenario where we have three nodes, if the node hosting an LXC container fails, the system successfully starts the container on a second node. However, if this second node also fails shortly after, the container does not attempt to migrate to the third node. Is this behavior expected? Are there specific configurations or limitations that prevent the container from migrating to the remaining operational node?
Understanding these aspects will help me better manage our resources and ensure maximum uptime. Thanks in advance for your help and guidance!
 
I do not use containers, but:
  • a service (being a VM or a Container) can only run on one node at a given point in time. This is fundamental and the core of the PVE software takes care of this
  • when the second node fails (and both stay off/dead) the third one is alone. It loses Quorum and it will fence itself --> it will reboot. Now, when it starts, the Quorum is still not present (as I assume the first two nodes are still dead). For this reason it will not try to start any LXC or VM
  • only when one second node starts up successfully they can build a Quorum (two out of three votes) and theses two will start all HA services as expected
See also https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_quorum and several threads in this forum...
 
Interesting, thanks. If I could ask more specifically, what might happen in this situation:

A VM is running on Node 1. Nodes 1,2,3 are all connected to the same VLAN. HA is enabled such that the VM can migrate to nodes 2 or 3 if node 1 goes down.

Now, Node 1 is partitioned to a different VLAN and can no longer communicate with Nodes 2 or 3, but still maintains internet access.

Presumably, the VM would begin to run on Node 2 or 3.

My questions are:
- What happens to the running instance of the VM on Node 1?
- Will there ever be any overlap of the VM instance running on Node 1, or Nodes 2/3?
- What actually happens here?
 
- What happens to the running instance of the VM on Node 1?
The Node will reboot - it will "fence" itself. The VM on this Node will not start again. Node 1 will never quorate without the correct network connectivity to reach Node 2/3.
- Will there ever be any overlap of the VM instance running on Node 1, or Nodes 2/3?
No.
It would be possible if the VM on 2/3 would start immediately and/or Node 1 would wait too long to shut down. I did not test this scenario by myself, but I am fairly sure the Proxmox developers knew this and prepared the timing accordingly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!