Hello all,
I have a small three-node cluster using Ceph with a single NVMe OSD in each. No issues with storage at all. The cluster network is connected directly through a pair of dedicated 2.5Gb NICs on each node (I can't recall the precise configuration, but I don't think it's relevant here). The vmbr/LAN is via a bond of my motherboard gigabit and a usb ethernet adapter on each node.
The problem I'm having is that a failure on my "eth0" doesn't provide the appropriate failure condition to force the VMs and LXCs to migrate to a different node. The cluster will still report it's health (which, from its limited perspective is correct since it's on a different network) but I lack the ability to trigger the migration remotely.
I have added some USB ethernet NICs to help offset the pre-existing unreliability of my motherboard NICs but that's just a minor mitigation and not a solution.
I did not consider that this situation would occur when I planned this out. I have looked at a few solutions:
1. a local health check to confirm access to the LAN; if fails , go into maintenance mode and then reboot.
2. Complex routing logic to route LAN traffic across the mesh network as well (eg. eth0 on Node1 could technically provide connectivity to Node2's vmbr via the existing mesh
3. Combine the cluster and corosync and LAN traffic in one highly fault-tolerant bond (2 2.5Gb NIC, mobo gigabit, USB ethernet)
Any suggestions on how to proceed?
Thanks,
Corey
I have a small three-node cluster using Ceph with a single NVMe OSD in each. No issues with storage at all. The cluster network is connected directly through a pair of dedicated 2.5Gb NICs on each node (I can't recall the precise configuration, but I don't think it's relevant here). The vmbr/LAN is via a bond of my motherboard gigabit and a usb ethernet adapter on each node.
The problem I'm having is that a failure on my "eth0" doesn't provide the appropriate failure condition to force the VMs and LXCs to migrate to a different node. The cluster will still report it's health (which, from its limited perspective is correct since it's on a different network) but I lack the ability to trigger the migration remotely.
I have added some USB ethernet NICs to help offset the pre-existing unreliability of my motherboard NICs but that's just a minor mitigation and not a solution.
I did not consider that this situation would occur when I planned this out. I have looked at a few solutions:
1. a local health check to confirm access to the LAN; if fails , go into maintenance mode and then reboot.
2. Complex routing logic to route LAN traffic across the mesh network as well (eg. eth0 on Node1 could technically provide connectivity to Node2's vmbr via the existing mesh
3. Combine the cluster and corosync and LAN traffic in one highly fault-tolerant bond (2 2.5Gb NIC, mobo gigabit, USB ethernet)
Any suggestions on how to proceed?
Thanks,
Corey