[SOLVED] Node disconnects without any obvious reason

manuelkamp · Jun 17, 2022

Hello,

I am using a two-node cluster (7.2-4) with the PBS machine as third quorum. The first node runs without problems. But the second node "disconnects" without any obvious reason (about every 15 Minutes). It does not reboot, but after a short time, it is available again without any interaction. I do not have any idea, why this is happening, since both nodes are configured the same way (from hardware perspective) and have a three nic bond on the same gigabit switch (LACP (802.3ad)). The only thing I have to show, is the log. But I do not understand, what is causing this issue, nor where to go next to find the issue to solve it. Any ideas?

Code:

Jun 17 08:46:02 srv2 pvestatd[1917]: status update time (10.242 seconds)
Jun 17 08:50:33 srv2 corosync[1896]:   [KNET  ] link: host: 1 link: 0 is down
Jun 17 08:50:33 srv2 corosync[1896]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 17 08:50:33 srv2 corosync[1896]:   [KNET  ] host: host: 1 has no active links
Jun 17 08:50:34 srv2 corosync[1896]:   [TOTEM ] Token has not been received in 2250 ms
Jun 17 08:50:35 srv2 corosync[1896]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 17 08:50:38 srv2 kernel: r8169 0000:03:00.0 enp3s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
Jun 17 08:50:39 srv2 corosync[1896]:   [QUORUM] Sync members[1]: 2
Jun 17 08:50:39 srv2 corosync[1896]:   [QUORUM] Sync left[1]: 1
Jun 17 08:50:39 srv2 corosync[1896]:   [TOTEM ] A new membership (2.3788) was formed. Members left: 1
Jun 17 08:50:39 srv2 corosync[1896]:   [TOTEM ] Failed to receive the leave message. failed: 1
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] notice: members: 2/1800
Jun 17 08:50:39 srv2 pmxcfs[1800]: [status] notice: members: 2/1800
Jun 17 08:50:39 srv2 corosync[1896]:   [VOTEQ ] Unable to determine origin of the qdevice register call!
Jun 17 08:50:39 srv2 corosync[1896]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 17 08:50:39 srv2 corosync[1896]:   [QUORUM] Members[1]: 2
Jun 17 08:50:39 srv2 corosync[1896]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 08:50:39 srv2 pmxcfs[1800]: [status] notice: node lost quorum
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] crit: received write while not quorate - trigger resync
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] crit: leaving CPG group
Jun 17 08:50:39 srv2 pve-ha-lrm[623348]: unable to write lrm status file - unable to open file '/etc/pve/nodes/srv2/lrm_status.tmp.623348' - Permission denied
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] notice: start cluster connection
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] crit: cpg_join failed: 14
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] crit: can't initialize service
Jun 17 08:50:46 srv2 pmxcfs[1800]: [dcdb] notice: members: 2/1800
Jun 17 08:50:46 srv2 pmxcfs[1800]: [dcdb] notice: all data is up to date
Jun 17 08:50:52 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:50:52 srv2 pvestatd[1917]: status update time (20.232 seconds)
Jun 17 08:51:04 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Connection timed out)
Jun 17 08:51:04 srv2 pvestatd[1917]: status update time (12.241 seconds)
Jun 17 08:51:10 srv2 pvescheduler[2900497]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 17 08:51:10 srv2 pvescheduler[2900496]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 17 08:51:24 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:51:24 srv2 pvestatd[1917]: status update time (20.265 seconds)
Jun 17 08:51:49 srv2 pvestatd[1917]: proxmox-backup-client failed: Error: error trying to connect: error connecting to https://backup.kmpr.local:8007/ - dns error: failed to lookup address information: Temporary failure in name resolution
Jun 17 08:51:50 srv2 pvestatd[1917]: status update time (25.259 seconds)
Jun 17 08:52:07 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Connection timed out)
Jun 17 08:52:07 srv2 pvestatd[1917]: status update time (17.254 seconds)

shrdlicka · Jun 17, 2022

Hello,
Is therey anything else in the Logs at that time journalctl -S '2022-06-17 08:45:00' -U '2022-06-17 09:15:00' on the problematic or on the other nodes?

manuelkamp · Jun 17, 2022

Hi, I provide full logs, because I do not want to delete something, which may be relevant:

node2 (the one which goes off):

Code:

-- Journal begins at Sat 2022-04-30 13:03:52 CEST, ends at Fri 2022-06-17 13:09:12 CEST. --
Jun 17 08:45:10 srv2 pvescheduler[2896060]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 17 08:45:10 srv2 pvescheduler[2896059]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 17 08:45:11 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:45:11 srv2 pvestatd[1917]: status update time (20.270 seconds)
Jun 17 08:45:31 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:45:31 srv2 pvestatd[1917]: status update time (20.282 seconds)
Jun 17 08:45:51 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:45:51 srv2 pvestatd[1917]: status update time (20.260 seconds)
Jun 17 08:46:00 srv2 corosync[1896]:   [KNET  ] rx: host: 1 link: 0 is up
Jun 17 08:46:00 srv2 corosync[1896]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 17 08:46:00 srv2 corosync[1896]:   [QUORUM] Sync members[2]: 1 2
Jun 17 08:46:00 srv2 corosync[1896]:   [QUORUM] Sync joined[1]: 1
Jun 17 08:46:00 srv2 corosync[1896]:   [TOTEM ] A new membership (1.3784) was formed. Members joined: 1
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: members: 1/3074, 2/1800
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: starting data syncronisation
Jun 17 08:46:00 srv2 pmxcfs[1800]: [status] notice: members: 1/3074, 2/1800
Jun 17 08:46:00 srv2 pmxcfs[1800]: [status] notice: starting data syncronisation
Jun 17 08:46:00 srv2 corosync[1896]:   [VOTEQ ] Unable to determine origin of the qdevice register call!
Jun 17 08:46:00 srv2 corosync[1896]:   [QUORUM] This node is within the primary component and will provide service.
Jun 17 08:46:00 srv2 corosync[1896]:   [QUORUM] Members[2]: 1 2
Jun 17 08:46:00 srv2 corosync[1896]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 08:46:00 srv2 pmxcfs[1800]: [status] notice: node has quorum
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: received sync request (epoch 1/3074/00000A24)
Jun 17 08:46:00 srv2 pmxcfs[1800]: [status] notice: received sync request (epoch 1/3074/00000A24)
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: received all states
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: leader is 1/3074
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: synced members: 1/3074
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: waiting for updates from leader
Jun 17 08:46:00 srv2 pmxcfs[1800]: [status] notice: received all states
Jun 17 08:46:00 srv2 pmxcfs[1800]: [status] notice: all data is up to date
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: update complete - trying to commit (got 2 inode updates)
Jun 17 08:46:00 srv2 pmxcfs[1800]: [dcdb] notice: all data is up to date
Jun 17 08:46:01 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:46:02 srv2 pvestatd[1917]: status update time (10.242 seconds)
Jun 17 08:50:33 srv2 corosync[1896]:   [KNET  ] link: host: 1 link: 0 is down
Jun 17 08:50:33 srv2 corosync[1896]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 17 08:50:33 srv2 corosync[1896]:   [KNET  ] host: host: 1 has no active links
Jun 17 08:50:34 srv2 corosync[1896]:   [TOTEM ] Token has not been received in 2250 ms
Jun 17 08:50:35 srv2 corosync[1896]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 17 08:50:38 srv2 kernel: r8169 0000:03:00.0 enp3s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
Jun 17 08:50:39 srv2 corosync[1896]:   [QUORUM] Sync members[1]: 2
Jun 17 08:50:39 srv2 corosync[1896]:   [QUORUM] Sync left[1]: 1
Jun 17 08:50:39 srv2 corosync[1896]:   [TOTEM ] A new membership (2.3788) was formed. Members left: 1
Jun 17 08:50:39 srv2 corosync[1896]:   [TOTEM ] Failed to receive the leave message. failed: 1
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] notice: members: 2/1800
Jun 17 08:50:39 srv2 pmxcfs[1800]: [status] notice: members: 2/1800
Jun 17 08:50:39 srv2 corosync[1896]:   [VOTEQ ] Unable to determine origin of the qdevice register call!
Jun 17 08:50:39 srv2 corosync[1896]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 17 08:50:39 srv2 corosync[1896]:   [QUORUM] Members[1]: 2
Jun 17 08:50:39 srv2 corosync[1896]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 08:50:39 srv2 pmxcfs[1800]: [status] notice: node lost quorum
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] crit: received write while not quorate - trigger resync
Jun 17 08:50:39 srv2 pmxcfs[1800]: [dcdb] crit: leaving CPG group
Jun 17 08:50:39 srv2 pve-ha-lrm[623348]: unable to write lrm status file - unable to open file '/etc/pve/nodes/srv2/lrm_status.tmp.623348' - Permission denied
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] notice: start cluster connection
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] crit: cpg_join failed: 14
Jun 17 08:50:40 srv2 pmxcfs[1800]: [dcdb] crit: can't initialize service
Jun 17 08:50:46 srv2 pmxcfs[1800]: [dcdb] notice: members: 2/1800
Jun 17 08:50:46 srv2 pmxcfs[1800]: [dcdb] notice: all data is up to date
Jun 17 08:50:52 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:50:52 srv2 pvestatd[1917]: status update time (20.232 seconds)
Jun 17 08:51:04 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Connection timed out)
Jun 17 08:51:04 srv2 pvestatd[1917]: status update time (12.241 seconds)
Jun 17 08:51:10 srv2 pvescheduler[2900497]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 17 08:51:10 srv2 pvescheduler[2900496]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 17 08:51:24 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Temporary failure in name resolution)
Jun 17 08:51:24 srv2 pvestatd[1917]: status update time (20.265 seconds)
Jun 17 08:51:49 srv2 pvestatd[1917]: proxmox-backup-client failed: Error: error trying to connect: error connecting to https://backup.kmpr.local:8007/ - dns error: failed to lookup address information: Temporary failure in name resolution
Jun 17 08:51:50 srv2 pvestatd[1917]: status update time (25.259 seconds)
Jun 17 08:52:07 srv2 pvestatd[1917]: Backup: error fetching datastores - 500 Can't connect to backup.kmpr.local:8007 (Connection timed out)
Jun 17 08:52:07 srv2 pvestatd[1917]: status update time (17.254 seconds)
Jun 17 08:52:09 srv2 corosync[1896]:   [KNET  ] rx: host: 1 link: 0 is up
Jun 17 08:52:09 srv2 corosync[1896]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 17 08:52:09 srv2 corosync[1896]:   [QUORUM] Sync members[2]: 1 2
Jun 17 08:52:09 srv2 corosync[1896]:   [QUORUM] Sync joined[1]: 1
Jun 17 08:52:09 srv2 corosync[1896]:   [TOTEM ] A new membership (1.378c) was formed. Members joined: 1
Jun 17 08:52:09 srv2 pmxcfs[1800]: [dcdb] notice: members: 1/3074, 2/1800
Jun 17 08:52:09 srv2 pmxcfs[1800]: [dcdb] notice: starting data syncronisation
Jun 17 08:52:09 srv2 pmxcfs[1800]: [status] notice: members: 1/3074, 2/1800
Jun 17 08:52:09 srv2 pmxcfs[1800]: [status] notice: starting data syncronisation
Jun 17 08:52:09 srv2 corosync[1896]:   [VOTEQ ] Unable to determine origin of the qdevice register call!
Jun 17 08:52:09 srv2 corosync[1896]:   [QUORUM] This node is within the primary component and will provide service.
Jun 17 08:52:09 srv2 corosync[1896]:   [QUORUM] Members[2]: 1 2
Jun 17 08:52:09 srv2 corosync[1896]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 08:52:09 srv2 pmxcfs[1800]: [status] notice: node has quorum
Jun 17 08:52:09 srv2 pmxcfs[1800]: [dcdb] notice: received sync request (epoch 1/3074/00000A26)
Jun 17 08:52:09 srv2 pmxcfs[1800]: [status] notice: received sync request (epoch 1/3074/00000A26)
Jun 17 08:52:09 srv2 pmxcfs[1800]: [dcdb] notice: received all states
Jun 17 08:52:09 srv2 pmxcfs[1800]: [dcdb] notice: leader is 1/3074
Jun 17 08:52:09 srv2 pmxcfs[1800]: [dcdb] notice: synced members: 1/3074
Jun 17 08:52:09 srv2 pmxcfs[1800]: [dcdb] notice: waiting for updates from leader
Jun 17 08:52:10 srv2 pmxcfs[1800]: [status] notice: received all states
Jun 17 08:52:10 srv2 pmxcfs[1800]: [status] notice: all data is up to date
Jun 17 08:52:10 srv2 pmxcfs[1800]: [dcdb] notice: update complete - trying to commit (got 2 inode updates)
Jun 17 08:52:10 srv2 pmxcfs[1800]: [dcdb] notice: all data is up to date
Jun 17 08:52:12 srv2 pvestatd[1917]: status update time (5.370 seconds)
Jun 17 09:00:44 srv2 smartd[1429]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 62 to 64
Jun 17 09:00:44 srv2 smartd[1429]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 194 Temperature_Celsius changed from 98 to 100
Jun 17 09:09:18 srv2 audit[2915166]: AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-116_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2915166 comm="(ionclean)" srcname="/" flags="rw, rbind"
Jun 17 09:09:18 srv2 kernel: audit: type=1400 audit(1655449758.469:92): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-116_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=2915166 comm="(ionclean)" srcname="/" flags="rw, rbind"

node1 (running without issues):

Code:

-- Journal begins at Thu 2022-02-03 19:22:46 CET, ends at Fri 2022-06-17 13:11:25 CEST. --
Jun 17 08:45:59 srv corosync[3182]:   [KNET  ] rx: host: 2 link: 0 is up
Jun 17 08:45:59 srv corosync[3182]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 08:46:00 srv corosync[3182]:   [QUORUM] Sync members[2]: 1 2
Jun 17 08:46:00 srv corosync[3182]:   [QUORUM] Sync joined[1]: 2
Jun 17 08:46:00 srv corosync[3182]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Jun 17 08:46:00 srv corosync[3182]:   [TOTEM ] A new membership (1.3784) was formed. Members joined: 2
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: members: 1/3074, 2/1800
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: starting data syncronisation
Jun 17 08:46:00 srv corosync[3182]:   [QUORUM] Members[2]: 1 2
Jun 17 08:46:00 srv corosync[3182]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: cpg_send_message retried 1 times
Jun 17 08:46:00 srv pmxcfs[3074]: [status] notice: members: 1/3074, 2/1800
Jun 17 08:46:00 srv pmxcfs[3074]: [status] notice: starting data syncronisation
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: received sync request (epoch 1/3074/00000A24)
Jun 17 08:46:00 srv pmxcfs[3074]: [status] notice: received sync request (epoch 1/3074/00000A24)
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: received all states
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: leader is 1/3074
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: synced members: 1/3074
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: start sending inode updates
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: sent all (2) updates
Jun 17 08:46:00 srv pmxcfs[3074]: [dcdb] notice: all data is up to date
Jun 17 08:46:00 srv pmxcfs[3074]: [status] notice: received all states
Jun 17 08:46:00 srv pmxcfs[3074]: [status] notice: all data is up to date
Jun 17 08:50:33 srv corosync[3182]:   [KNET  ] link: host: 2 link: 0 is down
Jun 17 08:50:33 srv corosync[3182]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 08:50:33 srv corosync[3182]:   [KNET  ] host: host: 2 has no active links
Jun 17 08:50:34 srv corosync[3182]:   [TOTEM ] Token has not been received in 2250 ms
Jun 17 08:50:35 srv corosync[3182]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 17 08:50:38 srv corosync[3182]:   [QUORUM] Sync members[1]: 1
Jun 17 08:50:38 srv corosync[3182]:   [QUORUM] Sync left[1]: 2
Jun 17 08:50:38 srv corosync[3182]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Jun 17 08:50:38 srv corosync[3182]:   [TOTEM ] A new membership (1.3788) was formed. Members left: 2
Jun 17 08:50:38 srv corosync[3182]:   [TOTEM ] Failed to receive the leave message. failed: 2
Jun 17 08:50:38 srv pmxcfs[3074]: [dcdb] notice: members: 1/3074
Jun 17 08:50:38 srv pmxcfs[3074]: [status] notice: members: 1/3074
Jun 17 08:50:39 srv corosync[3182]:   [QUORUM] Members[1]: 1
Jun 17 08:50:39 srv corosync[3182]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 08:52:08 srv corosync[3182]:   [KNET  ] rx: host: 2 link: 0 is up
Jun 17 08:52:08 srv corosync[3182]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 08:52:09 srv corosync[3182]:   [QUORUM] Sync members[2]: 1 2
Jun 17 08:52:09 srv corosync[3182]:   [QUORUM] Sync joined[1]: 2
Jun 17 08:52:09 srv corosync[3182]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Jun 17 08:52:09 srv corosync[3182]:   [TOTEM ] A new membership (1.378c) was formed. Members joined: 2
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: members: 1/3074, 2/1800
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: starting data syncronisation
Jun 17 08:52:09 srv corosync[3182]:   [QUORUM] Members[2]: 1 2
Jun 17 08:52:09 srv corosync[3182]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: cpg_send_message retried 1 times
Jun 17 08:52:09 srv pmxcfs[3074]: [status] notice: members: 1/3074, 2/1800
Jun 17 08:52:09 srv pmxcfs[3074]: [status] notice: starting data syncronisation
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: received sync request (epoch 1/3074/00000A26)
Jun 17 08:52:09 srv pmxcfs[3074]: [status] notice: received sync request (epoch 1/3074/00000A26)
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: received all states
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: leader is 1/3074
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: synced members: 1/3074
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: start sending inode updates
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: sent all (2) updates
Jun 17 08:52:09 srv pmxcfs[3074]: [dcdb] notice: all data is up to date
Jun 17 08:52:10 srv pmxcfs[3074]: [status] notice: received all states
Jun 17 08:52:10 srv pmxcfs[3074]: [status] notice: all data is up to date
Jun 17 09:09:01 srv audit[1397248]: AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-103_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=1397248 comm="(ionclean)" srcname="/" flags="rw, rbind"
Jun 17 09:09:01 srv kernel: audit: type=1400 audit(1655449741.450:4183): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-103_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=1397248 comm="(ionclean)" srcname="/" flags="rw, rbind"
Jun 17 09:09:36 srv audit[1399691]: AVC apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-112_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=1399691 comm="(ionclean)" srcname="/" flags="rw, rbind"
Jun 17 09:09:36 srv kernel: audit: type=1400 audit(1655449776.210:4184): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-112_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=1399691 comm="(ionclean)" srcname="/" flags="rw, rbind"

backup PBS (quorum):

Code:

-- Journal begins at Fri 2022-02-11 17:50:06 CET, ends at Fri 2022-06-17 13:14:35 CEST. --
Jun 17 09:12:15 backup proxmox-backup-proxy[4563]: write rrd data back to disk
Jun 17 09:12:15 backup proxmox-backup-proxy[4563]: starting rrd data sync
Jun 17 09:12:15 backup proxmox-backup-proxy[4563]: rrd journal successfully committed (23 files in 0.008 seconds)

Thank you!

manuelkamp · Jun 17, 2022

sorry for double post, got an error due to max length of characters to the previous one:

Attention line for me is this one:

Code:

kernel: r8169 0000:03:00.0 enp3s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).

I do not know what it means for now, but I try to investigate further. If you already have an idea (or if you can exclude it to be the problem) pls let me know.

edit:
As far as I found out, it is an firmware issue for the NIC, I wonder why it does not occur on the other node or the PBS machine since they have the same NIC inside...
Anyways, if you agree, that this might be the issue, I might change NICs, since I have other ones at my hand. Any quick attention points for doing that with proxmox regarding bonding etc.? Or if there is an easy fix available (setting, new firmware...) I would try that first.

edit2: installed new driver and will monitor it now the next hours and report back. for anyone else coming to this issue, this were my steps so far:

I downloaded the new driver for Kernels up to 5.17:
https://www.realtek.com/en/componen...0-1000m-gigabit-ethernet-pci-express-software

installed build-essentials:

Code:

apt install build-essential linux-headers-$(uname -r)

blacklisted r8169 driver:

Code:

sh -c 'echo blacklist r8169 >> /etc/modprobe.d/blacklist.conf'

unpacked the downloaded driver and run autorun.sh

Code:

tar xfvj r8168-8.050.03.tar.bz2
./autorun.sh

check it

Code:

lsmod | grep r8

then reboot.

shrdlicka · Jun 17, 2022

Yeah looks like a firmware issue this bug sounds a lot what you are describing:

https://bugzilla.kernel.org/show_bug.cgi?id=209839

manuelkamp · Jun 17, 2022

exactly. Just wondering why only one of three machines with this NIC is affected... anyways, I hope the new driver resolves it, I will come back later that day to post an update on it.

edit: should I put the new drivers on the other two working machines too (or keep the old one - never change a running system?)

shrdlicka · Jun 17, 2022

If you have problems on the others you can still change the drivers later

.

manuelkamp · Jun 17, 2022

sure, that sounds better

One has to make this steps every time a kernel update was installed, so maybe patching this will become routine work

manuelkamp · Jun 17, 2022

ok, i found out, why they might work. The one on the not-working node is:

Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)

and on the others (working machines) it is:

Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

manuelkamp · Jun 17, 2022

Good afternoon,

as promised, I monitored this node for 3 hours (within this timespan the issue usually occured at least one time). So far, logs look fine, no issues. I am now going to enable HA again, as i consider this node as stable again. Solved!

Search

Search

[SOLVED] Node disconnects without any obvious reason

manuelkamp

Member

shrdlicka

Proxmox Retired Staff

manuelkamp

Member

manuelkamp

Member

shrdlicka

Proxmox Retired Staff

manuelkamp

Member

shrdlicka

Proxmox Retired Staff

manuelkamp

Member

manuelkamp

Member

manuelkamp

Member

We value your privacy