[TUTORIAL] [High Availability] Watchdog reboots

aaron · Aug 22, 2025

magingale said:
I'm more than happy to do the test again are you able to provide the commands/logs you want analyse?

yeah, if you can do another test, I am interested in the pvecm status output of node pve/Node1 a few seconds after you disconnected pve1/Node2, but before it will eventually fence itself (if something is wrong).

I think, that info is not yet in the previous posts, unless I missed it.

Because if, for some reason, it only reports one vote, then the fence is expected, and you should further investigate why it apparently can't get the additional vote from the qdevice.

magingale · Aug 22, 2025

@aaron here you go

12:16:58 ping lost from my pc to PVE1
12:17:00 PiServer Quorum lost PVE1
12:18:07 pvecm status showed lost membership status
12:18:xx reboot
12:18:19 PiServer Quorum lost PVE

PVE journalctl grep on "watchdog" edit: both pve journalctl logs from that time frame attached.

Code:

Aug 22 12:14:06 pve pve-ha-crm[1201]: watchdog active
Aug 22 12:17:47 pve watchdog-mux[747]: client watchdog is about to expire
Aug 22 12:17:57 pve watchdog-mux[747]: client watchdog expired - disable watchdog updates
Aug 22 12:17:58 pve watchdog-mux[747]: exit watchdog-mux with active connections
Aug 22 12:17:58 pve kernel: watchdog: watchdog0: watchdog did not stop!
Aug 22 12:18:31 pve kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Aug 22 12:18:33 pve watchdog-mux[758]: Watchdog driver 'Software Watchdog', version 0
Aug 22 12:18:34 pve nut-monitor[995]: upsnotify: logged the systemd watchdog situation once, will not spam more about it
Aug 22 12:18:35 pve corosync[1151]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog augeas systemd xmlconf vqsim nozzle snmp pie relro bindnow
Aug 22 12:18:35 pve corosync[1151]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Aug 22 12:20:36 pve pve-ha-crm[1216]: watchdog active
Aug 22 12:21:41 pve pve-ha-lrm[1300]: watchdog active

Code:

Aug 22 12:17:00 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.21:39638 doesn't sent any message during 12000ms. Disconnecting
Aug 22 12:18:19 piserver corosync-qnetd[1617370]: Client ::ffff:172.16.1.20:40332 doesn't sent any message during 12000ms. Disconnecting

Code:

root@pve:~# while true; do      date +"%Y-%m-%d %H:%M:%S" | tee -a /root/pvecm_status.log;     pvecm status | tee -a /root/pvecm_status.log;     echo "----------------------------------------" | tee -a /root/pvecm_status.log;     sleep 2; done

Date:             Fri Aug 22 12:17:00 2025

<full data in attachment>

Quorum information
------------------
Date:             Fri Aug 22 12:17:59 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.2a5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20 (local)
0x00000002          1    A,V,NMW 172.16.1.21
0x00000000          1            Qdevice
----------------------------------------
2025-08-22 12:18:01
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 22 12:18:02 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.2a5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20 (local)
0x00000002          1    A,V,NMW 172.16.1.21
0x00000000          1            Qdevice
----------------------------------------
2025-08-22 12:18:04
Cluster information
-------------------
Name:             Home
Config Version:   11
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 22 12:18:07 2025
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.2a9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.1.20 (local)
0x00000000          1            Qdevice

lost due to reboot
----------------------------------------

aaron · Aug 22, 2025

Can you please post your /etc/pve/corosync.conf file? And make sure that the /etc/pve/corosync.conf and /etc/corosync/corosync.conf files are the same.

magingale · Aug 22, 2025

aaron said:
Can you please post your /etc/pve/corosync.conf file? And make sure that the /etc/pve/corosync.conf and /etc/corosync/corosync.conf files are the same.

Sure see attached, in my first initial search for the reboots I increased the totem values to survive a switch upgrade of 3 minutes. I checked the file and compared them, hey al looks the same to me?

aaron · Aug 22, 2025

Well, those long timeouts are most likely the explanation. If corosync takes too long to form a new quorum with just the QDevice, it might take longer than the 60s timeout of the LRM!

Please set it back to defaults, from one of my test clusters:

Code:

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: aarontest
  config_version: 4
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

magingale said:
I increased the totem values to survive a switch upgrade of 3 minutes

That's where multiple redundant networks and Corosync links come in handy.

If you can't have that in a smaller setup, you can disarm HA by setting all HA resources to "ignored" and waiting for ~10 min until the LRMs are inactive.

If the hosts have enough NICs, you could also add a direct cable between them as a second Corosync link.

magingale · Aug 22, 2025

aaron said:
Well, those long timeouts are most likely the explanation. If corosync takes too long to form a new quorum with just the QDevice, it might take longer than the 60s timeout of the LRM!

Please set it back to defaults, from one of my test clusters:

Code:

quorum { provider: corosync_votequorum } totem { cluster_name: aarontest config_version: 4 interface { linknumber: 0 } ip_version: ipv4-6 link_mode: passive secauth: on version: 2 }

That's where multiple redundant networks and Corosync links come in handy.
If you can't have that in a smaller setup, you can disarm HA by setting all HA resources to "ignored" and waiting for ~10 min until the LRMs are inactive.

If the hosts have enough NICs, you could also add a direct cable between them as a second Corosync link.

Thats a great catch! I did not recall it until you asked me the files, otherwise I had shared this fact earlier. Thanks for the great support!

You mean two corosync paths, one back to back and one via the switches to the Quorum Pi?

aaron · Aug 22, 2025

magingale said:
You mean two corosync paths, one back to back and one via the switches to the Quorum Pi?

Yeah. I assume you have one interface for everything on the hosts, that goes to the switch, right?
The single point of failure there is the switch.

If you can add a direct cable between the hosts, without a switch in between, you can configure a second IP subnet on it and add it as second Corosync link. This way, the nodes can still talk to each other if the switch is down. And they still have quorum, even if the QDevice is unreachable. In such a scenario, no node is allowed to fail though!

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy explains how to add another Corosync link to an already existing cluster.

Corosync will switch between the configured networks if the currently used one becomes unusable (down, high latency, ...)

magingale · Aug 22, 2025

@aaron this is way beter even perfect ;-) You can even see pve1 rebooting because the direct link goes down, I just patched that link but did not configure it!

Many thanks for the support appreciate it

Code:

Aug 22 17:14:29 pve corosync[133461]:   [KNET  ] link: host: 2 link: 0 is down
Aug 22 17:14:29 pve corosync[133461]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 17:14:29 pve corosync[133461]:   [KNET  ] host: host: 2 has no active links
Aug 22 17:14:30 pve corosync[133461]:   [TOTEM ] Token has not been received in 2250 ms
Aug 22 17:14:30 pve corosync[133461]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Aug 22 17:14:34 pve corosync[133461]:   [QUORUM] Sync members[1]: 1
Aug 22 17:14:34 pve corosync[133461]:   [QUORUM] Sync left[1]: 2
Aug 22 17:14:34 pve corosync[133461]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Aug 22 17:14:34 pve corosync[133461]:   [TOTEM ] A new membership (1.2cc) was formed. Members left: 2
Aug 22 17:14:34 pve corosync[133461]:   [TOTEM ] Failed to receive the leave message. failed: 2
Aug 22 17:14:34 pve pmxcfs[134732]: [dcdb] notice: members: 1/134732
Aug 22 17:14:34 pve pmxcfs[134732]: [status] notice: members: 1/134732
Aug 22 17:14:34 pve corosync[133461]:   [QUORUM] Members[1]: 1
Aug 22 17:14:34 pve corosync[133461]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 22 17:14:45 pve pvedaemon[1209]: <root@pam> successful auth for user 'root@pam'
Aug 22 17:15:30 pve kernel: igc 0000:04:00.0 enp4s0: NIC Link is Down
Aug 22 17:15:30 pve kernel: vmbr10: port 1(enp4s0) entered disabled state
Aug 22 17:15:32 pve kernel: igc 0000:04:00.0 enp4s0: NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Aug 22 17:15:32 pve kernel: vmbr10: port 1(enp4s0) entered blocking state
Aug 22 17:15:32 pve kernel: vmbr10: port 1(enp4s0) entered forwarding state
Aug 22 17:15:37 pve kernel: igc 0000:04:00.0 enp4s0: NIC Link is Down
Aug 22 17:15:37 pve kernel: vmbr10: port 1(enp4s0) entered disabled state
Aug 22 17:15:39 pve kernel: igc 0000:04:00.0 enp4s0: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Aug 22 17:15:39 pve kernel: vmbr10: port 1(enp4s0) entered blocking state
Aug 22 17:15:39 pve kernel: vmbr10: port 1(enp4s0) entered forwarding state
Aug 22 17:15:50 pve kernel: igc 0000:04:00.0 enp4s0: NIC Link is Down
Aug 22 17:15:50 pve kernel: vmbr10: port 1(enp4s0) entered disabled state
Aug 22 17:15:59 pve kernel: igc 0000:04:00.0 enp4s0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
Aug 22 17:15:59 pve kernel: vmbr10: port 1(enp4s0) entered blocking state
Aug 22 17:15:59 pve kernel: vmbr10: port 1(enp4s0) entered forwarding state
Aug 22 17:16:29 pve pve-ha-crm[1216]: successfully acquired lock 'ha_manager_lock'
Aug 22 17:16:29 pve pve-ha-crm[1216]: watchdog active
Aug 22 17:16:29 pve pve-ha-crm[1216]: status change slave => master
Aug 22 17:16:29 pve pve-ha-crm[1216]: node 'pve1': state changed from 'online' => 'unknown'
Aug 22 17:17:29 pve pve-ha-crm[1216]: service 'vm:102': state changed from 'started' to 'fence'
Aug 22 17:17:29 pve pve-ha-crm[1216]: node 'pve1': state changed from 'unknown' => 'fence'
Aug 22 17:17:29 pve pve-ha-crm[1216]: lost lock 'ha_agent_pve1_lock - can't get cfs lock
Aug 22 17:17:29 pve pve-ha-crm[1216]: successfully acquired lock 'ha_agent_pve1_lock'
Aug 22 17:17:29 pve pve-ha-crm[1216]: fencing: acknowledged - got agent lock for node 'pve1'
Aug 22 17:17:29 pve pve-ha-crm[1216]: node 'pve1': state changed from 'fence' => 'unknown'
Aug 22 17:17:29 pve pve-ha-crm[1216]: service 'vm:102': state changed from 'fence' to 'recovery'
Aug 22 17:17:29 pve pve-ha-crm[1216]: recover service 'vm:102' from fenced node 'pve1' to node 'pve'
Aug 22 17:17:29 pve pve-ha-crm[1216]: service 'vm:102': state changed from 'recovery' to 'started'  (node = pve)
Aug 22 17:17:30 pve pve-ha-lrm[1300]: watchdog active
Aug 22 17:17:30 pve pve-ha-lrm[1300]: status change wait_for_agent_lock => active
Aug 22 17:17:30 pve pve-ha-lrm[150772]: starting service vm:102
Aug 22 17:17:30 pve pve-ha-lrm[150772]: <root@pam> starting task UPID:pve:00024CF5:001B6044:68A88A0A:qmstart:102:root@pam:
Aug 22 17:17:30 pve pve-ha-lrm[150773]: start VM 102: UPID:pve:00024CF5:001B6044:68A88A0A:qmstart:102:root@pam:
Aug 22 17:17:30 pve pve-ha-lrm[150772]: <root@pam> end task UPID:pve:00024CF5:001B6044:68A88A0A:qmstart:102:root@pam: OK
Aug 22 17:17:30 pve pve-ha-lrm[150772]: service status vm:102 started
Aug 22 17:18:04 pve pvedaemon[1211]: <root@pam> successful auth for user 'root@pam'
Aug 22 17:18:21 pve pveproxy[128993]: worker exit
Aug 22 17:18:21 pve pveproxy[1281]: worker 128993 finished
Aug 22 17:18:21 pve pveproxy[1281]: starting 1 worker(s)
Aug 22 17:18:21 pve pveproxy[1281]: worker 151312 started

magingale · Aug 22, 2025

Code:

root@pve:~# journalctl -b -f -u corosync
Aug 22 17:32:46 pve corosync[133461]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 22 17:36:27 pve corosync[133461]:   [KNET  ] link: host: 2 link: 0 is down
Aug 22 17:36:27 pve corosync[133461]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Aug 22 17:43:08 pve corosync[133461]:   [CFG   ] Config reload requested by node 1
Aug 22 17:43:08 pve corosync[133461]:   [TOTEM ] Configuring link 0
Aug 22 17:43:08 pve corosync[133461]:   [TOTEM ] Configured link number 0: local addr: 172.16.1.20, port=5405
Aug 22 17:43:08 pve corosync[133461]:   [TOTEM ] Configuring link 1
Aug 22 17:43:08 pve corosync[133461]:   [TOTEM ] Configured link number 1: local addr: 10.10.10.20, port=5406
Aug 22 17:43:08 pve corosync[133461]:   [KNET  ] pmtud: MTU manually set to: 0
Aug 22 17:43:08 pve corosync[133461]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 10)
^C
root@pve:~# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
    addr    = 172.16.1.20
    status:
        nodeid:          1:    localhost
        nodeid:          2:    disconnected
LINK ID 1 udp
    addr    = 10.10.10.20
    status:
        nodeid:          1:    localhost
        nodeid:          2:    connected
root@pve:~# corosync-quorumtool -s
Quorum information
------------------
Date:             Fri Aug 22 17:47:52 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1.2d5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1    A,V,NMW pve (local)
         2          1   A,NV,NMW pve1
         0          1            Qdevice

And for others who want the direct link too:

Take care of reading the instructions how to edit the corosync file(s)

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.1.20
    ring1_addr: 10.10.10.20
  }
  node {
    name: pve1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.1.21
    ring1_addr: 10.10.10.21
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 172.16.1.252
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: Home
  config_version: 16
  interface {
    linknumber: 0
    knet_link_priority: 1
  }
  interface {
    linknumber: 1
    knet_link_priority: 10
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

silverstone · Aug 22, 2025

aaron said:
Yeah. I assume you have one interface for everything on the hosts, that goes to the switch, right?
The single point of failure there is the switch.

If you can add a direct cable between the hosts, without a switch in between, you can configure a second IP subnet on it and add it as second Corosync link. This way, the nodes can still talk to each other if the switch is down. And they still have quorum, even if the QDevice is unreachable. In such a scenario, no node is allowed to fail though!

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy explains how to add another Corosync link to an already existing cluster.

Corosync will switch between the configured networks if the currently used one becomes unusable (down, high latency, ...)

Wouldn't it be better to just have second Switch and enable STP ? Or have it on a separate Parallel Subnet, that could also be another Possibility I guess.

magingale · Aug 24, 2025

silverstone said:
Wouldn't it be better to just have second Switch and enable STP ? Or have it on a separate Parallel Subnet, that could also be another Possibility I guess.

Well I was thinking how would I design this for customers who move away from VMware. In VMware you just enable HA and 90% of the redundancy is magically in place.

My own vision to a good design over two datacenters would be NFS storage as we sell that a lot, CEPH is a bridge to far now. 2 servers per DC makes 4 of them and connect them all with a bond to two separate switches which runs independently during upgrades like Cisco Nexus. Create a cluster of 4 servers over two sites. Add a second link to the quorum config and use the NFS storage NIC as extra NIC for syncing. That what I have learned from this topic, if I miss something please add.

janus57 · Aug 24, 2025

Hi,

magingale said:
If I miss something please add.

with 4 nodes if 2 are down all your cluster is down, because the quorum need over 50% vote, so you should always want an odd number (3/5/7/9 etc.)

Best regards,

magingale · Aug 24, 2025

janus57 said:
Hi,

with 4 nodes if 2 are down all your cluster is down, because the quorum need over 50% vote, so you should always want an odd number (3/5/7/9 etc.)

Best regards,

Good point so 3 per site is the best fit.

janus57 · Aug 24, 2025

Hi,

magingale said:
Good point so 3 per site is the best fit.

Only if you create 2 clusters (1 per site), otherwise the problem is still the same, for 6 nodes to have a quorum you need 4 nodes reachable.

If you really want to have 1 cluster, with 2 sites and 2 nodes per site, here is some advice :
1 - Make sur you have low latency between the 2 sites (less than 5ms, see: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network_requirements)
2 - use at least 1 dedicated NIC/network for corosync
3 - use on external vote with a qdevice (on a third site or external provider) if you want to have 2+2nodes

Unlike corosync itself, a QDevice connects to the cluster over TCP/IP.The daemon can also run outside the LAN of the cluster and isn’t limited to the low latencies requirements of corosync.

Cf : https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support

Whole cluster doc : https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvecm

Best regards,

Search

Search

[TUTORIAL] [High Availability] Watchdog reboots

aaron

Proxmox Staff Member

magingale

Member

Attachments

aaron

Proxmox Staff Member

magingale

Member

Attachments

aaron

Proxmox Staff Member

magingale

Member

aaron

Proxmox Staff Member

magingale

Member

magingale

Member

silverstone

Well-Known Member

magingale

Member

janus57

Renowned Member

magingale

Member

janus57

Renowned Member

We value your privacy