Hello,
I hav tried to simulate network unplug on a PVE node during 25 seconds.
When networking is plugged again, quorum re-formed well and is ready. But 60 second after first network unplug, watchdot reboot node. If network unplug during 19 second, node isn't rebooted.
It seems that when watchdog is timed out (10 second), it's too late to cancel watchdog to reboot node : https://github.com/ThomasLamprecht/pve-ha-manager/blob/master/src/watchdog-mux.c
could this behavior be improved to avoid reboot node when quorum is OK ?
I hav tried to simulate network unplug on a PVE node during 25 seconds.
When networking is plugged again, quorum re-formed well and is ready. But 60 second after first network unplug, watchdot reboot node. If network unplug during 19 second, node isn't rebooted.
It seems that when watchdog is timed out (10 second), it's too late to cancel watchdog to reboot node : https://github.com/ThomasLamprecht/pve-ha-manager/blob/master/src/watchdog-mux.c
could this behavior be improved to avoid reboot node when quorum is OK ?
2025-03-30T17:30:56.128390+02:00 pve01 kernel: [ 369.553859] vmxnet3 0000:13:00.0 ens224: NIC Link is Down2025-03-30T17:30:56.128413+02:00 pve01 kernel: [ 369.553951] vmbr0: port 1(ens224) entered disabled state2025-03-30T17:30:58.536691+02:00 pve01 corosync[1156]: [TOTEM ] Token has not been received in 2737 ms2025-03-30T17:30:59.449464+02:00 pve01 corosync[1156]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.2025-03-30T17:31:01.009380+02:00 pve01 corosync[1156]: [KNET ] link: host: 1 link: 0 is down2025-03-30T17:31:01.009654+02:00 pve01 corosync[1156]: [KNET ] link: host: 1 link: 1 is down2025-03-30T17:31:01.009723+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)2025-03-30T17:31:01.009836+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 has no active links2025-03-30T17:31:01.009984+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)2025-03-30T17:31:01.010083+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 has no active links2025-03-30T17:31:01.209172+02:00 pve01 corosync[1156]: [KNET ] link: host: 3 link: 0 is down2025-03-30T17:31:01.209421+02:00 pve01 corosync[1156]: [KNET ] link: host: 3 link: 1 is down2025-03-30T17:31:01.209489+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)2025-03-30T17:31:01.209559+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 has no active links2025-03-30T17:31:01.209614+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)2025-03-30T17:31:01.209683+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 has no active links2025-03-30T17:31:03.832601+02:00 pve01 corosync[1156]: [QUORUM] Sync members[1]: 22025-03-30T17:31:03.834270+02:00 pve01 corosync[1156]: [QUORUM] Sync left[2]: 1 32025-03-30T17:31:03.834609+02:00 pve01 corosync[1156]: [TOTEM ] A new membership (2.bd) was formed. Members left: 1 32025-03-30T17:31:03.834687+02:00 pve01 corosync[1156]: [TOTEM ] Failed to receive the leave message. failed: 1 32025-03-30T17:31:03.834754+02:00 pve01 corosync[1156]: [QUORUM] This node is within the non-primary component and will NOT provide any services.2025-03-30T17:31:03.834815+02:00 pve01 corosync[1156]: [QUORUM] Members[1]: 22025-03-30T17:31:03.834884+02:00 pve01 corosync[1156]: [MAIN ] Completed service synchronization, ready to provide service.2025-03-30T17:31:03.835262+02:00 pve01 pmxcfs[1094]: [status] notice: node lost quorum2025-03-30T17:31:03.835349+02:00 pve01 pmxcfs[1094]: [dcdb] notice: members: 2/10942025-03-30T17:31:03.835428+02:00 pve01 pmxcfs[1094]: [status] notice: members: 2/10942025-03-30T17:31:03.835482+02:00 pve01 pmxcfs[1094]: [dcdb] crit: received write while not quorate - trigger resync2025-03-30T17:31:03.835535+02:00 pve01 pmxcfs[1094]: [dcdb] crit: leaving CPG group2025-03-30T17:31:04.493105+02:00 pve01 pve-ha-lrm[1252]: lost lock 'ha_agent_pve01_lock - cfs lock update failed - Permission denied2025-03-30T17:31:04.507506+02:00 pve01 pmxcfs[1094]: [dcdb] notice: start cluster connection2025-03-30T17:31:04.507705+02:00 pve01 pmxcfs[1094]: [dcdb] crit: cpg_join failed: 142025-03-30T17:31:04.509742+02:00 pve01 pve-ha-crm[1240]: status change slave => wait_for_quorum2025-03-30T17:31:04.510280+02:00 pve01 pmxcfs[1094]: [dcdb] crit: can't initialize service2025-03-30T17:31:09.495491+02:00 pve01 pve-ha-lrm[1252]: status change active => lost_agent_lock2025-03-30T17:31:09.511192+02:00 pve01 pvescheduler[3340]: replication: cfs-lock 'file-replication_cfg' error: no quorum!2025-03-30T17:31:09.514214+02:00 pve01 pvescheduler[3341]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!2025-03-30T17:31:10.521174+02:00 pve01 pmxcfs[1094]: [dcdb] notice: members: 2/10942025-03-30T17:31:10.521316+02:00 pve01 pmxcfs[1094]: [dcdb] notice: all data is up to date2025-03-30T17:31:16.637945+02:00 pve01 kernel: [ 390.064102] vmxnet3 0000:13:00.0 ens224: NIC Link is Up 10000 Mbps2025-03-30T17:31:16.637978+02:00 pve01 kernel: [ 390.064160] vmbr0: port 1(ens224) entered blocking state2025-03-30T17:31:16.637981+02:00 pve01 kernel: [ 390.064168] vmbr0: port 1(ens224) entered forwarding state2025-03-30T17:31:17.226431+02:00 pve01 corosync[1156]: [KNET ] rx: host: 1 link: 0 is up2025-03-30T17:31:17.226571+02:00 pve01 corosync[1156]: [KNET ] link: Resetting MTU for link 0 because host 1 joined2025-03-30T17:31:17.226635+02:00 pve01 corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)2025-03-30T17:31:17.228209+02:00 pve01 corosync[1156]: [KNET ] rx: host: 3 link: 0 is up2025-03-30T17:31:17.228519+02:00 pve01 corosync[1156]: [KNET ] link: Resetting MTU for link 0 because host 3 joined2025-03-30T17:31:17.228634+02:00 pve01 corosync[1156]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)2025-03-30T17:31:17.240964+02:00 pve01 corosync[1156]: [QUORUM] Sync members[3]: 1 2 32025-03-30T17:31:17.241067+02:00 pve01 corosync[1156]: [QUORUM] Sync joined[2]: 1 32025-03-30T17:31:17.241127+02:00 pve01 corosync[1156]: [TOTEM ] A new membership (1.c1) was formed. Members joined: 1 32025-03-30T17:31:17.256129+02:00 pve01 pmxcfs[1094]: [dcdb] notice: members: 1/81643, 2/1094, 3/3879802025-03-30T17:31:17.256233+02:00 pve01 pmxcfs[1094]: [dcdb] notice: starting data syncronisation2025-03-30T17:31:17.256293+02:00 pve01 pmxcfs[1094]: [status] notice: members: 1/81643, 2/1094, 3/3879802025-03-30T17:31:17.256370+02:00 pve01 pmxcfs[1094]: [status] notice: starting data syncronisation2025-03-30T17:31:17.266258+02:00 pve01 corosync[1156]: [QUORUM] This node is within the primary component and will provide service.2025-03-30T17:31:17.267982+02:00 pve01 corosync[1156]: [QUORUM] Members[3]: 1 2 32025-03-30T17:31:17.268060+02:00 pve01 corosync[1156]: [MAIN ] Completed service synchronization, ready to provide service.2025-03-30T17:31:17.268110+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: node has quorum2025-03-30T17:31:17.320822+02:00 ste-pve200-mon corosync[1156]: [KNET ] pmtud: Global data MTU changed to: 13972025-03-30T17:31:17.358730+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: received sync request (epoch 1/81643/00000011)2025-03-30T17:31:17.359605+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: received sync request (epoch 1/81643/00000011)2025-03-30T17:31:17.387932+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: received all states2025-03-30T17:31:17.388048+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: leader is 1/816432025-03-30T17:31:17.388106+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: synced members: 1/81643, 3/3879802025-03-30T17:31:17.388161+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: waiting for updates from leader2025-03-30T17:31:17.395914+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: received all states2025-03-30T17:31:17.396668+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: all data is up to date2025-03-30T17:31:17.396750+02:00 ste-pve200-mon pmxcfs[1094]: [status] notice: dfsm_deliver_queue: queue length 22025-03-30T17:31:17.396927+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: update complete - trying to commit (got 8 inode updates)2025-03-30T17:31:17.397759+02:00 ste-pve200-mon pmxcfs[1094]: [dcdb] notice: all data is up to date2025-03-30T17:31:17.656826+02:00 ste-pve200-mon corosync[1156]: [KNET ] rx: host: 1 link: 1 is up2025-03-30T17:31:17.657080+02:00 ste-pve200-mon corosync[1156]: [KNET ] link: Resetting MTU for link 1 because host 1 joined2025-03-30T17:31:17.657196+02:00 ste-pve200-mon corosync[1156]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)2025-03-30T17:31:17.725501+02:00 ste-pve200-mon corosync[1156]: [KNET ] pmtud: Global data MTU changed to: 13972025-03-30T17:31:18.492392+02:00 ste-pve200-mon corosync[1156]: [KNET ] rx: host: 3 link: 1 is up2025-03-30T17:31:18.492602+02:00 ste-pve200-mon corosync[1156]: [KNET ] link: Resetting MTU for link 1 because host 3 joined2025-03-30T17:31:18.492716+02:00 ste-pve200-mon corosync[1156]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)2025-03-30T17:31:18.533026+02:00 ste-pve200-mon corosync[1156]: [KNET ] pmtud: Global data MTU changed to: 13972025-03-30T17:31:55.496622+02:00 ste-pve200-mon watchdog-mux[682]: client watchdog expired - disable watchdog updates
Last edited: