Is it possible for HA to simply monitor a network link ?

ulysse31

New Member
Oct 29, 2021
5
0
1
41
Hi,

It may be a silly question, but I'll ask it anyways ...
I've built a 3 node cluster, each node as a 2 bonds of 2 physical interfaces each, one bond is a VLAN trunk for the VM network assignment (via a vlan-aware bridge on top of the bond), the other bond is for the proxmox interface and management (one management VLAN).
The cluster was setup to use both links (the trunk bond as an IP address on the native vlan for the LACP bond).
If the "VM bond" goes down (lets say cable failure on the 2 fibers), the VMs (that were added to the HA) are not migrated to another cluster node ...
The node itself sees that the bond is down ... (message declaring bond down on dmesg).
But on the web GUI, the interfaces (bond, as well as physical iface of the bond) are indicated as "LINK - Yes" on the network section of the impacted node.
Is it possible to setup the HA (or on each node, don't know) to stop all VMs that are running if the "VM bond" link goes down, and let the HA start them on another node were the "VM Bond" is up ?
Thanks for the kind answers and help.

Regards,

--
Ulysse31
 
Not from PVE side.

You need implement something as STONITH (locally or remotely), for example shutdown the affected node.
 
  • Like
Reactions: ulysse31
Hello All,

I've been searching documentations to solve this issue, and maybe have some about watchdog and softdog.
It seems that on all proxmox nodes in a HA, softdog is enabled by default, that means a services (watchdog-mux) opens /dev/watchdog and writes data to it regularly, less than every 60 seconds, otherwise, the node reboots.
I've build a little Perl daemon, that monitors the link on the bond0 interface (crucial interface for VM hosting, without that bond, the node is useless).
That daemon works on 2 consecutive phases:
- arming phase : it polls the interface and checks that the link is up and steady, time interval and number of checks is configurable, for now, i set check 3 times every 20 secs, which gives a link steady during 60 secs, onces this step is validated, it goes to watch phase.
- watching phase : it polls the interface and check that link is up, for now every 10 seconds, and, if the link is down more than 5 consecutive times (50 seconds total), it kills watchdog-mux process, forcing machine to reboot.

Since this daemon is at boot, it means that if link stays down, the daemon will wait on "arming phase" that link becomes up and steady.
I do not like the action of killing watchdog-mux, would have preferred something more "clean" / "soft", if someone has an idea or suggestion, I would be really thankful ^^'

Thanks for your help,

--
Ulysse31
 
Last edited:
UPDATE:

Seems that I may also define "Default maintenance policy" of the HA to "Migrate" and send a "reboot" command instead of killing the watchdog-mux process ...
Seems cleaner ? no ?
Thanks for the help


--
Ulysse31
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!