Hello,
I am using a two-node cluster (7.2-4) with the PBS machine as third quorum. The first node runs without problems. But the second node "disconnects" without any obvious reason (about every 15 Minutes). It does not reboot, but after a short time, it is available again without any interaction. I do not have any idea, why this is happening, since both nodes are configured the same way (from hardware perspective) and have a three nic bond on the same gigabit switch (LACP (802.3ad)). The only thing I have to show, is the log. But I do not understand, what is causing this issue, nor where to go next to find the issue to solve it. Any ideas?
I am using a two-node cluster (7.2-4) with the PBS machine as third quorum. The first node runs without problems. But the second node "disconnects" without any obvious reason (about every 15 Minutes). It does not reboot, but after a short time, it is available again without any interaction. I do not have any idea, why this is happening, since both nodes are configured the same way (from hardware perspective) and have a three nic bond on the same gigabit switch (LACP (802.3ad)). The only thing I have to show, is the log. But I do not understand, what is causing this issue, nor where to go next to find the issue to solve it. Any ideas?
Code:
Jun 17 08:46:02 srv2 pvestatd[1917]: status update time (10.242 seconds)
Jun 17 08:50:33 srv2 corosync[1896]: [KNET ] link: host: 1 link: 0 is down
Jun 17 08:50:33 srv2 corosync[1896]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 17 08:50:33 srv2 corosync[1896]: [KNET ] host: host: 1 has no active links
Jun 17 08:50:34 srv2 corosync[1896]: [TOTEM ] Token has not been received in 2250 ms
Jun 17 08:50:35 srv2 corosync[1896]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 17 08:50:38 srv2 kernel: r8169 0000:03:00.0 enp3s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
Jun 17 08:50:39 srv2 corosync[1896]: [QUORUM] Sync members[1]: 2
Jun 17 08:50:39 srv2 corosync[1896]: [QUORUM] Sync left[1]: 1
Jun 17 08:50:39 srv2 corosync[1896]: [TOTEM ] A new membership (2.3788) was formed. Members left: 1
Jun 17 08:50:39 srv2 corosync[1896]: [TOTEM ] Failed to receive the leave message. failed: 1
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] notice: members: 2/1800
Jun 17 08:50:39 srv2 pmxcfs[1800]: [status] notice: members: 2/1800
Jun 17 08:50:39 srv2 corosync[1896]: [VOTEQ ] Unable to determine origin of the qdevice register call!
Jun 17 08:50:39 srv2 corosync[1896]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 17 08:50:39 srv2 corosync[1896]: [QUORUM] Members[1]: 2
Jun 17 08:50:39 srv2 corosync[1896]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 17 08:50:39 srv2 pmxcfs[1800]: [status] notice: node lost quorum
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] crit: received write while not quorate - trigger resync
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] crit: leaving CPG group
Jun 17 08:50:39 srv2 pve-ha-lrm[623348]: unable to write lrm status file - unable to open file '/etc/pve/nodes/srv2/lrm_status.tmp.623348' - Permission denied
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] notice: start cluster connection
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] crit: cpg_join failed: 14
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] crit: can't initialize service
Jun 17 08:50:46 srv2 pmxcfs[1800]: [dcdb] notice: members: 2/1800
Jun 17 08:50:46 srv2 pmxcfs[1800]: [dcdb] notice: all data is up to date
Jun 17 08:50:52 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:50:52 srv2 pvestatd[1917]: status update time (20.232 seconds)
Jun 17 08:51:04 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Connection timed out)
Jun 17 08:51:04 srv2 pvestatd[1917]: status update time (12.241 seconds)
Jun 17 08:51:10 srv2 pvescheduler[2900497]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 17 08:51:10 srv2 pvescheduler[2900496]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 17 08:51:24 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:51:24 srv2 pvestatd[1917]: status update time (20.265 seconds)
Jun 17 08:51:49 srv2 pvestatd[1917]: proxmox-backup-client failed: Error: error trying to connect: error connecting to https://backup.kmpr.local:8007/ - dns error: failed to lookup address information: Temporary failure in name resolution
Jun 17 08:51:50 srv2 pvestatd[1917]: status update time (25.259 seconds)
Jun 17 08:52:07 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Connection timed out)
Jun 17 08:52:07 srv2 pvestatd[1917]: status update time (17.254 seconds)