Primary Corosync Link Failure, Redundant Link Not Taking Over

Oct 16, 2021
14
1
8
39
Hello all,

Before I begin I want to thank you all for your great posts/replies here. We've been using Proxmox for over 1 year and have never made a post as we've always found the answers here by searching (and that could be the case here, but I'm currently on vacation and wife gonna kill me if I spend 8+ hours on a computer :)). But we're a large web host based in the US and are currently in the process of building out our own Datacenter; using Proxmox to power our virtualization. I cannot say enough good things about the software and the community behind it.

That out of the way, Zabbix notified us about 4 hours ago of a failed Interface/Link on one of our switches which was immediately followed by alerts unable to reach one of our Proxmox Nodes in a 4 Node cluster. Initially, I did do a graceful reboot of the failed-node to see if that made any difference (it didn't). Upon further investigation it appears to be a failed optic or line cable - which we are in the process of getting replaced. In the meantime though, we do have a redundant cluster in place, but the failover link(s) don't appear to be kicking over:

Here's the syslog on the failed node at the time of the failure - see line Oct 16 10:33:21 (replacing the hostname to "failed-node", which is host 2 in the cluster)

root@failed-node:/etc/pve# grep "Oct 16 10:3" /var/log/syslog

Code:
Oct 16 10:31:06 failed-node pmxcfs[2900]: [status] notice: received log
Oct 16 10:32:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:32:00 failed-node systemd[1]: pvesr.service: Succeeded.
Oct 16 10:32:00 failed-node systemd[1]: Started Proxmox VE replication runner.
Oct 16 10:33:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:33:00 failed-node systemd[1]: pvesr.service: Succeeded.
Oct 16 10:33:00 failed-node systemd[1]: Started Proxmox VE replication runner.
Oct 16 10:33:21 failed-node kernel: [2493654.770594] i40e 0000:25:00.0 ens2f0: NIC Link is Down
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] link: host: 1 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] link: host: 4 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] link: host: 3 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]:   [KNET  ] host: host: 3 has no active links
Oct 16 10:33:29 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 5514 ms
Oct 16 10:33:33 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 9815 ms
Oct 16 10:33:39 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:33:39 failed-node corosync[3013]:   [QUORUM] Sync left[3]: 1 3 4
Oct 16 10:33:39 failed-node corosync[3013]:   [TOTEM ] A new membership (2.298) was formed. Members left: 1 3 4
Oct 16 10:33:39 failed-node corosync[3013]:   [TOTEM ] Failed to receive the leave message. failed: 1 3 4
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] notice: members: 2/2900
Oct 16 10:33:39 failed-node pmxcfs[2900]: [status] notice: members: 2/2900
Oct 16 10:33:39 failed-node corosync[3013]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 16 10:33:39 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:33:39 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:33:39 failed-node pmxcfs[2900]: [status] notice: node lost quorum
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] crit: received write while not quorate - trigger resync
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] crit: leaving CPG group
Oct 16 10:33:39 failed-node pve-ha-lrm[3106]: unable to write lrm status file - closing file '/etc/pve/nodes/failed-node/lrm_status.tmp.3106' failed - Operation not permitted
Oct 16 10:33:42 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:33:47 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:33:53 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:33:53 failed-node corosync[3013]:   [TOTEM ] A new membership (2.2a4) was formed. Members
Oct 16 10:33:53 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:33:53 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] notice: start cluster connection
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] crit: cpg_join failed: 14
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] crit: can't initialize service
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: lost lock 'ha_manager_lock - cfs lock update failed - Device or resource busy
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: status change master => lost_manager_lock
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: watchdog closed (disabled)
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: status change lost_manager_lock => wait_for_quorum
Oct 16 10:33:56 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:34:00 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:01 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7526 ms
Oct 16 10:34:01 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:02 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:03 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:04 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:05 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:06 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:07 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:34:07 failed-node corosync[3013]:   [TOTEM ] A new membership (2.2b0) was formed. Members
Oct 16 10:34:07 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:34:07 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:34:07 failed-node pmxcfs[2900]: [dcdb] notice: members: 2/2900
Oct 16 10:34:07 failed-node pmxcfs[2900]: [dcdb] notice: all data is up to date
Oct 16 10:34:07 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:08 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:09 failed-node pvesr[51807]: cfs-lock 'file-replication_cfg' error: no quorum!
Oct 16 10:34:09 failed-node systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Oct 16 10:34:09 failed-node systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 16 10:34:09 failed-node systemd[1]: Failed to start Proxmox VE replication runner.
Oct 16 10:34:10 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:14 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:21 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:34:21 failed-node corosync[3013]:   [TOTEM ] A new membership (2.2bc) was formed. Members
Oct 16 10:34:21 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:34:21 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:34:24 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:28 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:34 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2
Oct 16 10:34:34 failed-node corosync[3013]:   [TOTEM ] A new membership (2.2c8) was formed. Members
Oct 16 10:34:34 failed-node corosync[3013]:   [QUORUM] Members[1]: 2
Oct 16 10:34:34 failed-node corosync[3013]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 10:34:38 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:42 failed-node corosync[3013]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:48 failed-node corosync[3013]:   [QUORUM] Sync members[1]: 2

root@failed-node:/etc/pve# systemctl status pve-cluster corosync

Code:
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2021-10-16 13:46:13 CDT; 1h 30min ago
  Process: 7423 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
 Main PID: 7433 (pmxcfs)
    Tasks: 7 (limit: 9830)
   Memory: 60.1M
   CGroup: /system.slice/pve-cluster.service
           └─7433 /usr/bin/pmxcfs


Oct 16 15:16:30 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retried 100 times
Oct 16 15:16:30 failed-node pmxcfs[7433]: [status] crit: cpg_send_message failed: 6
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retry 10
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retried 14 times
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/local-lvm: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/nvme-raid-10: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/local: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/failed-node/local: /var/lib/r


● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2021-10-16 13:46:14 CDT; 1h 30min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 7582 (corosync)
    Tasks: 9 (limit: 9830)
   Memory: 197.2M
   CGroup: /system.slice/corosync.service
           └─7582 /usr/sbin/corosync -f


Oct 16 15:16:45 failed-node corosync[7582]:   [TOTEM ] A new membership (2.3d99) was formed. Members
Oct 16 15:16:45 failed-node corosync[7582]:   [QUORUM] Members[1]: 2
Oct 16 15:16:45 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 15:16:48 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 3226 ms
Oct 16 15:16:52 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 7527 ms
Oct 16 15:16:58 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2
Oct 16 15:16:58 failed-node corosync[7582]:   [TOTEM ] A new membership (2.3da5) was formed. Members
Oct 16 15:16:58 failed-node corosync[7582]:   [QUORUM] Members[1]: 2
Oct 16 15:16:58 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 16 15:17:02 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 3226 ms
lines 21-43/43 (END)

root@failed-node:/etc/pve# cat /etc/corosync/corosync.conf

Code:
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: healthy-node-20
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.1.20
    ring1_addr: 10.10.2.20
  }
  node {
    name: healthy-node-21
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.1.21
    ring1_addr: 10.10.2.21
  }
  node {
    name: healthy-node-22
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.1.22
    ring1_addr: 10.10.2.22
  }
  node {
    name: failed-node
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.1.23
    ring1_addr: 10.10.2.23
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: Cloud-300
  config_version: 9
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@healthy-node-22:~# cat /etc/pve/corosync.conf

Code:
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    name: healthy-node-20
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.1.20
    ring1_addr: 10.10.2.20
  }
  node {
    name: healthy-node-21
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.1.21
    ring1_addr: 10.10.2.21
  }
  node {
    name: healthy-node-22
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.1.22
    ring1_addr: 10.10.2.22
  }
  node {
    name: failed-node
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.1.23
    ring1_addr: 10.10.2.23
  }
}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: Cloud-300
  config_version: 9
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@failed-node:/etc/pve# pvecm status

Code:
Cluster information
-------------------
Name:             Cloud-300
Config Version:   9
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Sat Oct 16 15:23:51 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.3f15
Quorate:          No


Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:          


Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.10.1.23 (local)

Any help/feedback/suggestions are greatly appreciated. Additionally, if it's been answered here before feel free to flame me =>.
 
root@failed-node:/etc/pve# journalctl -b -u corosync


Code:
-- Logs begin at Sat 2021-10-16 13:46:09 CDT, end at Sat 2021-10-16 15:29:44 CDT. --

Oct 16 13:46:13 failed-node systemd[1]: Starting Corosync Cluster Engine...

Oct 16 13:46:13 failed-node corosync[7582]:   [MAIN  ] Corosync Cluster Engine 3.1.2 starting up

Oct 16 13:46:13 failed-node corosync[7582]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie rel

Oct 16 13:46:13 failed-node corosync[7582]:   [TOTEM ] Initializing transport (Kronosnet).

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] totemknet initialized

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/cr

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync configuration map access [0]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: cmap

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync configuration service [1]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: cfg

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: cpg

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync profile loading service [4]

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]

Oct 16 13:46:14 failed-node corosync[7582]:   [WD    ] Watchdog not enabled by configuration

Oct 16 13:46:14 failed-node corosync[7582]:   [WD    ] resource load_15min missing a recovery key.

Oct 16 13:46:14 failed-node corosync[7582]:   [WD    ] resource memory_used missing a recovery key.

Oct 16 13:46:14 failed-node corosync[7582]:   [WD    ] no resources configured.

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync watchdog service [7]

Oct 16 13:46:14 failed-node corosync[7582]:   [QUORUM] Using quorum provider corosync_votequorum

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: votequorum

Oct 16 13:46:14 failed-node corosync[7582]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]

Oct 16 13:46:14 failed-node corosync[7582]:   [QB    ] server name: quorum

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] Configuring link 0

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] Configured link number 0: local addr: 10.10.1.23, port=5405

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] Configuring link 1

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] Configured link number 1: local addr: 10.10.2.23, port=5406

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2

Oct 16 13:46:14 failed-node corosync[7582]:   [QUORUM] Sync joined[1]: 2

Oct 16 13:46:14 failed-node corosync[7582]:   [TOTEM ] A new membership (2.2a75) was formed. Members joined: 2

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 2 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [QUORUM] Members[1]: 2

Oct 16 13:46:14 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 3 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node systemd[1]: Started Corosync Cluster Engine.

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 4 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Oct 16 13:46:14 failed-node corosync[7582]:   [KNET  ] host: host: 1 has no active links

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] rx: host: 1 link: 1 is up

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] rx: host: 4 link: 1 is up

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 1)

Oct 16 13:46:16 failed-node corosync[7582]:   [KNET  ] pmtud: Global data MTU changed to: 469

Oct 16 13:46:20 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 4072 ms

Oct 16 13:46:25 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 8373 ms

Oct 16 13:46:31 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2

Oct 16 13:46:31 failed-node corosync[7582]:   [TOTEM ] A new membership (2.2a81) was formed. Members

Oct 16 13:46:31 failed-node corosync[7582]:   [QUORUM] Members[1]: 2

Oct 16 13:46:31 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.

Oct 16 13:46:34 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 3226 ms

Oct 16 13:46:34 failed-node corosync[7582]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397

Oct 16 13:46:34 failed-node corosync[7582]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397

Oct 16 13:46:34 failed-node corosync[7582]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Oct 16 13:46:38 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 7527 ms

Oct 16 13:46:45 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2

Oct 16 13:46:45 failed-node corosync[7582]:   [TOTEM ] A new membership (2.2a8d) was formed. Members

Oct 16 13:46:45 failed-node corosync[7582]:   [QUORUM] Members[1]: 2

Oct 16 13:46:45 failed-node corosync[7582]:   [MAIN  ] Completed service synchronization, ready to provide service.

Oct 16 13:46:48 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 3226 ms

Oct 16 13:46:53 failed-node corosync[7582]:   [TOTEM ] Token has not been received in 7527 ms

Oct 16 13:46:59 failed-node corosync[7582]:   [QUORUM] Sync members[1]: 2

Oct 16 13:46:59 failed-node corosync[7582]:   [TOTEM ] A new membership (2.2a99) was formed. Members

lines 75-97 (clipped)
 
Package versions (on all nodes):

Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)

pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)

pve-kernel-5.4: 6.4-5

pve-kernel-helper: 6.4-5

pve-kernel-5.4.128-1-pve: 5.4.128-2

pve-kernel-5.4.124-1-pve: 5.4.124-2

pve-kernel-5.4.106-1-pve: 5.4.106-1

ceph-fuse: 12.2.11+dfsg1-2.1+b1

corosync: 3.1.2-pve1

criu: 3.11-3

glusterfs-client: 5.5-3

ifupdown: not correctly installed

ifupdown2: 3.0.0-1+pve4~bpo10

ksm-control-daemon: 1.3-1

libjs-extjs: 6.0.1-10

libknet1: 1.20-pve1

libproxmox-acme-perl: 1.1.0

libproxmox-backup-qemu0: 1.1.0-1

libpve-access-control: 6.4-3

libpve-apiclient-perl: 3.1-3

libpve-common-perl: 6.4-3

libpve-guest-common-perl: 3.1-5

libpve-http-server-perl: 3.2-3

libpve-network-perl: 0.6.0

libpve-storage-perl: 6.4-1

libqb0: 1.0.5-1

libspice-server1: 0.14.2-4~pve6+1

lvm2: 2.03.02-pve4

lxc-pve: 4.0.6-2

lxcfs: 4.0.6-pve1

novnc-pve: 1.1.0-1

proxmox-backup-client: 1.1.13-2

proxmox-mini-journalreader: 1.1-1

proxmox-widget-toolkit: 2.6-1

pve-cluster: 6.4-1

pve-container: 3.3-6

pve-docs: 6.4-2

pve-edk2-firmware: 2.20200531-1

pve-firewall: 4.1-4

pve-firmware: 3.2-4

pve-ha-manager: 3.1-1

pve-i18n: 2.3-1

pve-qemu-kvm: 5.2.0-6

pve-xtermjs: 4.7.0-3

qemu-server: 6.4-2

smartmontools: 7.2-pve2

spiceterm: 3.1-1

vncterm: 1.6-2

zfsutils-linux: 2.0.5-pve1~bpo10+1
 
the logs indicate that your node 3 didn't have the second link set up properly..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!