Hi,
We've a 3-node Proxmox cluster connected to our Ceph storage cluster. We're still testing all options and stability, our last test didn't succeed in retaining the high availability we expected. Let me start by saying we are very satisfied with Proxmox, all seems very good and stable. Good work!
The cluster is setup and all nodes can see eachother, quorum is OK, rgmanager is running, fencing is configured and tested, everything works like expected. All VM's are HA enabled.
- When we reboot a node, the VM's on that node are shutdown, migrated to the other nodes and started, all automatically. Great!
- When we shutdown the (cluster) network interface on a node (ifconfig vmbr199 down), the other nodes determine that the node is dead and fence it, VM's are migrated and started, dead node gets rebooted because of fencing and cluster state become OK again automatically.
So far so good. But...
When we cut off the power (hard, pull the plugs) the webinterface shows that node in red (offline), but VM's aren't started on other nodes, we waited 10 minutes!! Within this time we tried to start a HA enabled VM that was stopped before, but this VM couldn't start, keeps loading (starting VM 103... or something like that), after we powered the node back on the VM could start, also the VM's on the 'failed' node got migrated to the other nodes when it was coming back online.
This should work for HA enabled VM's right? When a node crashes because of severe hardware failure (power supplies, motherboard, etc.) other nodes should start the VM's that were running on that node. And yes, I have redundant power supplies, fencing is configured to Dell's DRAC cards.
If you need additional information, please let me know. Thanks in advance.
We've a 3-node Proxmox cluster connected to our Ceph storage cluster. We're still testing all options and stability, our last test didn't succeed in retaining the high availability we expected. Let me start by saying we are very satisfied with Proxmox, all seems very good and stable. Good work!
The cluster is setup and all nodes can see eachother, quorum is OK, rgmanager is running, fencing is configured and tested, everything works like expected. All VM's are HA enabled.
- When we reboot a node, the VM's on that node are shutdown, migrated to the other nodes and started, all automatically. Great!
- When we shutdown the (cluster) network interface on a node (ifconfig vmbr199 down), the other nodes determine that the node is dead and fence it, VM's are migrated and started, dead node gets rebooted because of fencing and cluster state become OK again automatically.
So far so good. But...
When we cut off the power (hard, pull the plugs) the webinterface shows that node in red (offline), but VM's aren't started on other nodes, we waited 10 minutes!! Within this time we tried to start a HA enabled VM that was stopped before, but this VM couldn't start, keeps loading (starting VM 103... or something like that), after we powered the node back on the VM could start, also the VM's on the 'failed' node got migrated to the other nodes when it was coming back online.
This should work for HA enabled VM's right? When a node crashes because of severe hardware failure (power supplies, motherboard, etc.) other nodes should start the VM's that were running on that node. And yes, I have redundant power supplies, fencing is configured to Dell's DRAC cards.
If you need additional information, please let me know. Thanks in advance.