I have some question regarding fencing and node force reboot which recently cause outage to my cluster.
Having 3(let call 1,2,3) node cluster, all same version 8.0.3
node 3 having some issue thus was shutdown. No vm is running
node 2 is out for maintenance few days ago
node 1 is running. Unfortunately reloaded even quorum was set to 1 before 60 sec
Here is the syslog on node 1
Start timestamp
2023-07-27T22:28:11.728476+08:00 pve1 corosync[2835]: [CFG ] Node 3 was shut down by sysadmin
2023-07-27T22:28:11.743625+08:00 pve1 corosync[2835]: [QUORUM] Sync members[1]: 1
2023-07-27T22:28:11.743988+08:00 pve1 corosync[2835]: [QUORUM] Sync left[1]: 3
2023-07-27T22:28:11.744208+08:00 pve1 corosync[2835]: [TOTEM ] A new membership (1.ef2) was formed. Members left: 3
2023-07-27T22:28:11.744943+08:00 pve1 corosync[2835]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
2023-07-27T22:28:11.745266+08:00 pve1 corosync[2835]: [QUORUM] Members[1]: 1
2023-07-27T22:28:11.746199+08:00 pve1 corosync[2835]: [MAIN ] Completed service synchronization, ready to provide service.
2023-07-27T22:28:12.810077+08:00 pve1 corosync[2835]: [KNET ] link: host: 3 link: 0 is down
2023-07-27T22:28:12.810318+08:00 pve1 corosync[2835]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
2023-07-27T22:28:12.810489+08:00 pve1 corosync[2835]: [KNET ] host: host: 3 has no active links
Configured 'pvecm e 1' on pve1
2023-07-27T22:29:05.553133+08:00 pve1 corosync[2835]: [QUORUM] This node is within the primary component and will provide service.
2023-07-27T22:29:05.553542+08:00 pve1 corosync[2835]: [QUORUM] Members[1]: 1
[Thu Jul 27 22:29:53 2023] Pve1 reloaded by fencing
2023-07-27T22:30:53.161464+08:00 pve1 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
2023-07-27T22:30:53.193733+08:00 pve1 corosync[3424]: [MAIN ] Corosync Cluster Engine starting up
My concern is why PVE1 still reloaded even Quorum is Yes.
My understanding is fencing if Quorum lost for 60 second
22:28:11 - 22:29:05 is about 54 secs.
Having 3(let call 1,2,3) node cluster, all same version 8.0.3
node 3 having some issue thus was shutdown. No vm is running
node 2 is out for maintenance few days ago
node 1 is running. Unfortunately reloaded even quorum was set to 1 before 60 sec
Here is the syslog on node 1
Start timestamp
2023-07-27T22:28:11.728476+08:00 pve1 corosync[2835]: [CFG ] Node 3 was shut down by sysadmin
2023-07-27T22:28:11.743625+08:00 pve1 corosync[2835]: [QUORUM] Sync members[1]: 1
2023-07-27T22:28:11.743988+08:00 pve1 corosync[2835]: [QUORUM] Sync left[1]: 3
2023-07-27T22:28:11.744208+08:00 pve1 corosync[2835]: [TOTEM ] A new membership (1.ef2) was formed. Members left: 3
2023-07-27T22:28:11.744943+08:00 pve1 corosync[2835]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
2023-07-27T22:28:11.745266+08:00 pve1 corosync[2835]: [QUORUM] Members[1]: 1
2023-07-27T22:28:11.746199+08:00 pve1 corosync[2835]: [MAIN ] Completed service synchronization, ready to provide service.
2023-07-27T22:28:12.810077+08:00 pve1 corosync[2835]: [KNET ] link: host: 3 link: 0 is down
2023-07-27T22:28:12.810318+08:00 pve1 corosync[2835]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
2023-07-27T22:28:12.810489+08:00 pve1 corosync[2835]: [KNET ] host: host: 3 has no active links
Configured 'pvecm e 1' on pve1
2023-07-27T22:29:05.553133+08:00 pve1 corosync[2835]: [QUORUM] This node is within the primary component and will provide service.
2023-07-27T22:29:05.553542+08:00 pve1 corosync[2835]: [QUORUM] Members[1]: 1
[Thu Jul 27 22:29:53 2023] Pve1 reloaded by fencing
2023-07-27T22:30:53.161464+08:00 pve1 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
2023-07-27T22:30:53.193733+08:00 pve1 corosync[3424]: [MAIN ] Corosync Cluster Engine starting up
My concern is why PVE1 still reloaded even Quorum is Yes.
My understanding is fencing if Quorum lost for 60 second
22:28:11 - 22:29:05 is about 54 secs.