Primary Corosync Link Failure, Redundant Link Not Taking Over

Oct 16, 2021
14
1
8
40
Hello all,

Before I begin I want to thank you all for your great posts/replies here. We've been using Proxmox for over 1 year and have never made a post as we've always found the answers here by searching (and that could be the case here, but I'm currently on vacation and wife gonna kill me if I spend 8+ hours on a computer :)). But we're a large web host based in the US and are currently in the process of building out our own Datacenter; using Proxmox to power our virtualization. I cannot say enough good things about the software and the community behind it.

That out of the way, Zabbix notified us about 4 hours ago of a failed Interface/Link on one of our switches which was immediately followed by alerts unable to reach one of our Proxmox Nodes in a 4 Node cluster. Initially, I did do a graceful reboot of the failed-node to see if that made any difference (it didn't). Upon further investigation it appears to be a failed optic or line cable - which we are in the process of getting replaced. In the meantime though, we do have a redundant cluster in place, but the failover link(s) don't appear to be kicking over:

Here's the syslog on the failed node at the time of the failure - see line Oct 16 10:33:21 (replacing the hostname to "failed-node", which is host 2 in the cluster)

root@failed-node:/etc/pve# grep "Oct 16 10:3" /var/log/syslog

Code:
Oct 16 10:31:06 failed-node pmxcfs[2900]: [status] notice: received log
Oct 16 10:32:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:32:00 failed-node systemd[1]: pvesr.service: Succeeded.
Oct 16 10:32:00 failed-node systemd[1]: Started Proxmox VE replication runner.
Oct 16 10:33:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:33:00 failed-node systemd[1]: pvesr.service: Succeeded.
Oct 16 10:33:00 failed-node systemd[1]: Started Proxmox VE replication runner.
Oct 16 10:33:21 failed-node kernel: [2493654.770594] i40e 0000:25:00.0 ens2f0: NIC Link is Down
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] link: host: 1 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] link: host: 4 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] link: host: 3 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] host: host: 3 has no active links
Oct 16 10:33:29 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 5514 ms
Oct 16 10:33:33 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 9815 ms
Oct 16 10:33:39 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:33:39 failed-node corosync[3013]:   [QUORUM] Sync left[3]: 1 3 4
Oct 16 10:33:39 failed-node corosync[3013]:   [TOTEM ] A new membership (2.298) was formed. Members left: 1 3 4
Oct 16 10:33:39 failed-node corosync[3013]:   [TOTEM ] Failed to receive the leave message. failed: 1 3 4
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] notice: members: 2/2900
Oct 16 10:33:39 failed-node pmxcfs[2900]: [status] notice: members: 2/2900
Oct 16 10:33:39 failed-node corosync[3013]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 16 10:33:39 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:33:39 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:33:39 failed-node pmxcfs[2900]: [status] notice: node lost quorum
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] crit: received write while not quorate - trigger resync
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] crit: leaving CPG group
Oct 16 10:33:39 failed-node pve-ha-lrm[3106]: unable to write lrm status file - closing file '/etc/pve/nodes/failed-node/lrm_status.tmp.3106' failed - Operation not permitted
Oct 16 10:33:42 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:33:47 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:33:53 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:33:53 failed-node corosync[3013]:   [TOTEM ] A new membership (2.2a4) was formed. Members
Oct 16 10:33:53 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:33:53 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] notice: start cluster connection
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] crit: cpg_join failed: 14
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] crit: can't initialize service
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: lost lock 'ha_manager_lock - cfs lock update failed - Device or resource busy
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: status change master => lost_manager_lock
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: watchdog closed (disabled)
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: status change lost_manager_lock => wait_for_quorum
Oct 16 10:33:56 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:34:00 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:01 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7526 ms
Oct 16 10:34:01 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:02 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:03 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:04 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:05 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:06 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:07 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:34:07 failed-node corosync[3013]:   [TOTEM ] A new membership (2.2b0) was formed. Members
Oct 16 10:34:07 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:34:07 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:34:07 failed-node pmxcfs[2900]: [dcdb] notice: members: 2/2900
Oct 16 10:34:07 failed-node pmxcfs[2900]: [dcdb] notice: all data is up to date
Oct 16 10:34:07 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:08 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:09 failed-node pvesr[51807]: cfs-lock 'file-replication_cfg' error: no quorum!
Oct 16 10:34:09 failed-node systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Oct 16 10:34:09 failed-node systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 16 10:34:09 failed-node systemd[1]: Failed to start Proxmox VE replication runner.
Oct 16 10:34:10 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:14 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:21 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:34:21 failed-node corosync[3013]:   [TOTEM ] A new membership (2.2bc) was formed. Members
Oct 16 10:34:21 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:34:21 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:34:24 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:28 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:34 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:34:34 failed-node corosync[3013]:   [TOTEM ] A new membership (2.2c8) was formed. Members
Oct 16 10:34:34 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:34:34 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:34:38 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:42 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:48 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2

root@failed-node:/etc/pve# systemctl status pve-cluster corosync

Code:
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2021-10-16 13:46:13 CDT; 1h 30min ago
  Process: 7423 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
 Main PID: 7433 (pmxcfs)
    Tasks: 7 (limit: 9830)
   Memory: 60.1M
   CGroup: /system.slice/pve-cluster.service
           └─7433 /usr/bin/pmxcfs


Oct 16 15:16:30 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retried 100 times
Oct 16 15:16:30 failed-node pmxcfs[7433]: [status] crit: cpg_send_message failed: 6
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retry 10
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retried 14 times
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/local-lvm: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/nvme-raid-10: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/local: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/failed-node/local: /var/lib/r


● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2021-10-16 13:46:14 CDT; 1h 30min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 7582 (corosync)
    Tasks: 9 (limit: 9830)
   Memory: 197.2M
   CGroup: /system.slice/corosync.service
           └─7582 /usr/sbin/corosync -f


Oct 16 15:16:45 failed-node corosync[7582]:   [TOTEM ] A new membership (2.3d99) was formed. Members
Oct 16 15:16:45 failed-node corosync[7582]:   [QUORUM] Members[1]: 2
Oct 16 15:16:45 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 15:16:48 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 15:16:52 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 15:16:58 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2
Oct 16 15:16:58 failed-node corosync[7582]:   [TOTEM ] A new membership (2.3da5) was formed. Members
Oct 16 15:16:58 failed-node corosync[7582]:   [QUORUM] Members[1]: 2
Oct 16 15:16:58 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 15:17:02 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 3226 ms
lines 21-43/43 (END)

root@failed-node:/etc/pve# cat /etc/corosync/corosync.conf

Code:
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: healthy-node-20
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.1.20
    ring1_addr: 10.10.2.20
  }
  node {
    name: healthy-node-21
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.1.21
    ring1_addr: 10.10.2.21
  }
  node {
    name: healthy-node-22
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.1.22
    ring1_addr: 10.10.2.22
  }
  node {
    name: failed-node
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.1.23
    ring1_addr: 10.10.2.23
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: Cloud-300
  config_version: 9
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@healthy-node-22:~# cat /etc/pve/corosync.conf

Code:
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: healthy-node-20
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.1.20
    ring1_addr: 10.10.2.20
  }
  node {
    name: healthy-node-21
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.1.21
    ring1_addr: 10.10.2.21
  }
  node {
    name: healthy-node-22
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.1.22
    ring1_addr: 10.10.2.22
  }
  node {
    name: failed-node
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.1.23
    ring1_addr: 10.10.2.23
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: Cloud-300
  config_version: 9
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@failed-node:/etc/pve# pvecm status

Code:
Cluster information
-------------------
Name:             Cloud-300
Config Version:   9
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Sat Oct 16 15:23:51 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.3f15
Quorate:          No


Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:          


Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.10.1.23 (local)

Any help/feedback/suggestions are greatly appreciated. Additionally, if it's been answered here before feel free to flame me =>.
 
root@failed-node:/etc/pve# journalctl -b -u corosync


Code:
-- Logs begin at Sat 2021-10-16 13:46:09 CDT, end at Sat 2021-10-16 15:29:44 CDT. --

Oct 16 13:46:13 failed-node systemd[1]: Starting Corosync Cluster Engine...

Oct 16 13:46:13 failed-node corosync[7582]:   [MAIN  ] Corosync Cluster Engine 3.1.2 starting up

Oct 16 13:46:13 failed-node corosync[7582]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie rel

Oct 16 13:46:13 failed-node corosync[7582]:   [TOTEM ] Initializing transport (Kronosnet).

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] totemknet initialized

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/cr

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync configuration map access [0]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: cmap

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync configuration service [1]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: cfg

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: cpg

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync profile loading service [4]

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]

Oct 16 13:46:14 failed-node corosync[7582]:   [WD    ] Watchdog not enabled by configuration

Oct 16 13:46:14 failed-node corosync[7582]:   [WD    ] resource load_15min missing a recovery key.

Oct 16 13:46:14 failed-node corosync[7582]:   [WD    ] resource memory_used missing a recovery key.

Oct 16 13:46:14 failed-node corosync[7582]:   [WD    ] no resources configured.

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync watchdog service [7]

Oct 16 13:46:14 failed-node corosync[7582]:   [QUORUM] Using quorum provider corosync_votequorum

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: votequorum

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: quorum

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] Configuring link 0

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] Configured link number 0: local addr: 10.10.1.23, port=5405

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] Configuring link 1

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] Configured link number 1: local addr: 10.10.2.23, port=5406

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2

Oct 16 13:46:14 failed-node corosync[7582]:   [QUORUM] Sync joined[1]: 2

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] A new membership (2.2a75) was formed. Members joined: 2

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 2 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [QUORUM] Members[1]: 2

Oct 16 13:46:14 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node systemd[1]: Started Corosync Cluster Engine.

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] rx: host: 1 link: 1 is up

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] rx: host: 4 link: 1 is up

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] pmtud: Global data MTU changed to: 469

Oct 16 13:46:20 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 4072 ms

Oct 16 13:46:25 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 8373 ms

Oct 16 13:46:31 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2

Oct 16 13:46:31 failed-node corosync[7582]:   [TOTEM ] A new membership (2.2a81) was formed. Members

Oct 16 13:46:31 failed-node corosync[7582]:   [QUORUM] Members[1]: 2

Oct 16 13:46:31 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.

Oct 16 13:46:34 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 3226 ms

Oct 16 13:46:34 failed-node corosync[7582]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397

Oct 16 13:46:34 failed-node corosync[7582]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397

Oct 16 13:46:34 failed-node corosync[7582]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Oct 16 13:46:38 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 7527 ms

Oct 16 13:46:45 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2

Oct 16 13:46:45 failed-node corosync[7582]:   [TOTEM ] A new membership (2.2a8d) was formed. Members

Oct 16 13:46:45 failed-node corosync[7582]:   [QUORUM] Members[1]: 2

Oct 16 13:46:45 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.

Oct 16 13:46:48 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 3226 ms

Oct 16 13:46:53 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 7527 ms

Oct 16 13:46:59 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2

Oct 16 13:46:59 failed-node corosync[7582]:   [TOTEM ] A new membership (2.2a99) was formed. Members

lines 75-97 (clipped)
 
Package versions (on all nodes):

Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)

pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)

pve-kernel-5.4: 6.4-5

pve-kernel-helper: 6.4-5

pve-kernel-5.4.128-1-pve: 5.4.128-2

pve-kernel-5.4.124-1-pve: 5.4.124-2

pve-kernel-5.4.106-1-pve: 5.4.106-1

ceph-fuse: 12.2.11+dfsg1-2.1+b1

corosync: 3.1.2-pve1

criu: 3.11-3

glusterfs-client: 5.5-3

ifupdown: not correctly installed

ifupdown2: 3.0.0-1+pve4~bpo10

ksm-control-daemon: 1.3-1

libjs-extjs: 6.0.1-10

libknet1: 1.20-pve1

libproxmox-acme-perl: 1.1.0

libproxmox-backup-qemu0: 1.1.0-1

libpve-access-control: 6.4-3

libpve-apiclient-perl: 3.1-3

libpve-common-perl: 6.4-3

libpve-guest-common-perl: 3.1-5

libpve-http-server-perl: 3.2-3

libpve-network-perl: 0.6.0

libpve-storage-perl: 6.4-1

libqb0: 1.0.5-1

libspice-server1: 0.14.2-4~pve6+1

lvm2: 2.03.02-pve4

lxc-pve: 4.0.6-2

lxcfs: 4.0.6-pve1

novnc-pve: 1.1.0-1

proxmox-backup-client: 1.1.13-2

proxmox-mini-journalreader: 1.1-1

proxmox-widget-toolkit: 2.6-1

pve-cluster: 6.4-1

pve-container: 3.3-6

pve-docs: 6.4-2

pve-edk2-firmware: 2.20200531-1

pve-firewall: 4.1-4

pve-firmware: 3.2-4

pve-ha-manager: 3.1-1

pve-i18n: 2.3-1

pve-qemu-kvm: 5.2.0-6

pve-xtermjs: 4.7.0-3

qemu-server: 6.4-2

smartmontools: 7.2-pve2

spiceterm: 3.1-1

vncterm: 1.6-2

zfsutils-linux: 2.0.5-pve1~bpo10+1
 
the logs indicate that your node 3 didn't have the second link set up properly..