PVE Watchdog reboot node whereas quorum is re-formed

rustine22

New Member
Jun 9, 2024
25
5
3
Hello,
I hav tried to simulate network unplug on a PVE node during 25 seconds.

When networking is plugged again, quorum re-formed well and is ready. But 60 second after first network unplug, watchdot reboot node. If network unplug during 19 second, node isn't rebooted.

It seems that when watchdog is timed out (10 second), it's too late to cancel watchdog to reboot node : https://github.com/ThomasLamprecht/pve-ha-manager/blob/master/src/watchdog-mux.c

could this behavior be improved to avoid reboot node when quorum is OK ?

2025-03-30T17:30:56.128390+02:00 pve01 kernel: [ 369.553859] vmxnet3 0000:13:00.0 ens224: NIC Link is Down
2025-03-30T17:30:56.128413+02:00 pve01 kernel: [ 369.553951] vmbr0: port 1(ens224) entered disabled state
2025-03-30T17:30:58.536691+02:00 pve01 corosync[1156]: [TOTEM ] Token has not been received in 2737 ms
2025-03-30T17:30:59.449464+02:00 pve01 corosync[1156]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
2025-03-30T17:31:01.009380+02:00 pve01 corosync[1156]: [KNET ] link: host: 1 link: 0 is down
2025-03-30T17:31:01.009654+02:00 pve01 corosync[1156]: [KNET ] link: host: 1 link: 1 is down
2025-03-30T17:31:01.009723+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2025-03-30T17:31:01.009836+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 has no active links
2025-03-30T17:31:01.009984+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2025-03-30T17:31:01.010083+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 has no active links
2025-03-30T17:31:01.209172+02:00 pve01 corosync[1156]: [KNET ] link: host: 3 link: 0 is down
2025-03-30T17:31:01.209421+02:00 pve01 corosync[1156]: [KNET ] link: host: 3 link: 1 is down
2025-03-30T17:31:01.209489+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
2025-03-30T17:31:01.209559+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 has no active links
2025-03-30T17:31:01.209614+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
2025-03-30T17:31:01.209683+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 has no active links
2025-03-30T17:31:03.832601+02:00 pve01 corosync[1156]: [QUORUM] Sync members[1]: 2
2025-03-30T17:31:03.834270+02:00 pve01 corosync[1156]: [QUORUM] Sync left[2]: 1 3
2025-03-30T17:31:03.834609+02:00 pve01 corosync[1156]: [TOTEM ] A new membership (2.bd) was formed. Members left: 1 3
2025-03-30T17:31:03.834687+02:00 pve01 corosync[1156]: [TOTEM ] Failed to receive the leave message. failed: 1 3
2025-03-30T17:31:03.834754+02:00 pve01 corosync[1156]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
2025-03-30T17:31:03.834815+02:00 pve01 corosync[1156]: [QUORUM] Members[1]: 2
2025-03-30T17:31:03.834884+02:00 pve01 corosync[1156]: [MAIN ] Completed service synchronization, ready to provide service.
2025-03-30T17:31:03.835262+02:00 pve01 pmxcfs[1094]: [status] notice: node lost quorum
2025-03-30T17:31:03.835349+02:00 pve01 pmxcfs[1094]: [dcdb] notice: members: 2/1094
2025-03-30T17:31:03.835428+02:00 pve01 pmxcfs[1094]: [status] notice: members: 2/1094
2025-03-30T17:31:03.835482+02:00 pve01 pmxcfs[1094]: [dcdb] crit: received write while not quorate - trigger resync
2025-03-30T17:31:03.835535+02:00 pve01 pmxcfs[1094]: [dcdb] crit: leaving CPG group
2025-03-30T17:31:04.493105+02:00 pve01 pve-ha-lrm[1252]: lost lock 'ha_agent_pve01_lock - cfs lock update failed - Permission denied
2025-03-30T17:31:04.507506+02:00 pve01 pmxcfs[1094]: [dcdb] notice: start cluster connection
2025-03-30T17:31:04.507705+02:00 pve01 pmxcfs[1094]: [dcdb] crit: cpg_join failed: 14
2025-03-30T17:31:04.509742+02:00 pve01 pve-ha-crm[1240]: status change slave => wait_for_quorum
2025-03-30T17:31:04.510280+02:00 pve01 pmxcfs[1094]: [dcdb] crit: can't initialize service
2025-03-30T17:31:09.495491+02:00 pve01 pve-ha-lrm[1252]: status change active => lost_agent_lock
2025-03-30T17:31:09.511192+02:00 pve01 pvescheduler[3340]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
2025-03-30T17:31:09.514214+02:00 pve01 pvescheduler[3341]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
2025-03-30T17:31:10.521174+02:00 pve01 pmxcfs[1094]: [dcdb] notice: members: 2/1094
2025-03-30T17:31:10.521316+02:00 pve01 pmxcfs[1094]: [dcdb] notice: all data is up to date
2025-03-30T17:31:16.637945+02:00 pve01 kernel: [ 390.064102] vmxnet3 0000:13:00.0 ens224: NIC Link is Up 10000 Mbps
2025-03-30T17:31:16.637978+02:00 pve01 kernel: [ 390.064160] vmbr0: port 1(ens224) entered blocking state
2025-03-30T17:31:16.637981+02:00 pve01 kernel: [ 390.064168] vmbr0: port 1(ens224) entered forwarding state
2025-03-30T17:31:17.226431+02:00 pve01 corosync[1156]: [KNET ] rx: host: 1 link: 0 is up
2025-03-30T17:31:17.226571+02:00 pve01 corosync[1156]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
2025-03-30T17:31:17.226635+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2025-03-30T17:31:17.228209+02:00 pve01 corosync[1156]: [KNET ] rx: host: 3 link: 0 is up
2025-03-30T17:31:17.228519+02:00 pve01 corosync[1156]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
2025-03-30T17:31:17.228634+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
2025-03-30T17:31:17.240964+02:00 pve01 corosync[1156]: [QUORUM] Sync members[3]: 1 2 3
2025-03-30T17:31:17.241067+02:00 pve01 corosync[1156]: [QUORUM] Sync joined[2]: 1 3
2025-03-30T17:31:17.241127+02:00 pve01 corosync[1156]: [TOTEM ] A new membership (1.c1) was formed. Members joined: 1 3
2025-03-30T17:31:17.256129+02:00 pve01 pmxcfs[1094]: [dcdb] notice: members: 1/81643, 2/1094, 3/387980
2025-03-30T17:31:17.256233+02:00 pve01 pmxcfs[1094]: [dcdb] notice: starting data syncronisation
2025-03-30T17:31:17.256293+02:00 pve01 pmxcfs[1094]: [status] notice: members: 1/81643, 2/1094, 3/387980
2025-03-30T17:31:17.256370+02:00 pve01 pmxcfs[1094]: [status] notice: starting data syncronisation
2025-03-30T17:31:17.266258+02:00 pve01 corosync[1156]: [QUORUM] This node is within the primary component and will provide service.
2025-03-30T17:31:17.267982+02:00 pve01 corosync[1156]: [QUORUM] Members[3]: 1 2 3
2025-03-30T17:31:17.268060+02:00 pve01 corosync[1156]: [MAIN ] Completed service synchronization, ready to provide service.
2025-03-30T17:31:17.268110+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: node has quorum
2025-03-30T17:31:17.320822+02:00 ste-pve200-mon corosync[1156]: [KNET ] pmtud: Global data MTU changed to: 1397
2025-03-30T17:31:17.358730+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: received sync request (epoch 1/81643/00000011)
2025-03-30T17:31:17.359605+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: received sync request (epoch 1/81643/00000011)
2025-03-30T17:31:17.387932+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: received all states
2025-03-30T17:31:17.388048+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: leader is 1/81643
2025-03-30T17:31:17.388106+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: synced members: 1/81643, 3/387980
2025-03-30T17:31:17.388161+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: waiting for updates from leader
2025-03-30T17:31:17.395914+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: received all states
2025-03-30T17:31:17.396668+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: all data is up to date
2025-03-30T17:31:17.396750+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: dfsm_deliver_queue: queue length 2
2025-03-30T17:31:17.396927+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: update complete - trying to commit (got 8 inode updates)
2025-03-30T17:31:17.397759+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: all data is up to date
2025-03-30T17:31:17.656826+02:00 ste-pve200-mon corosync[1156]: [KNET ] rx: host: 1 link: 1 is up
2025-03-30T17:31:17.657080+02:00 ste-pve200-mon corosync[1156]: [KNET ] link: Resetting MTU for link 1 because host 1 joined
2025-03-30T17:31:17.657196+02:00 ste-pve200-mon corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2025-03-30T17:31:17.725501+02:00 ste-pve200-mon corosync[1156]: [KNET ] pmtud: Global data MTU changed to: 1397
2025-03-30T17:31:18.492392+02:00 ste-pve200-mon corosync[1156]: [KNET ] rx: host: 3 link: 1 is up
2025-03-30T17:31:18.492602+02:00 ste-pve200-mon corosync[1156]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
2025-03-30T17:31:18.492716+02:00 ste-pve200-mon corosync[1156]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
2025-03-30T17:31:18.533026+02:00 ste-pve200-mon corosync[1156]: [KNET ] pmtud: Global data MTU changed to: 1397
2025-03-30T17:31:55.496622+02:00 ste-pve200-mon watchdog-mux[682]: client watchdog expired - disable watchdog updates
 
Last edited: