Network disconnect almost daily

giuppy

Member
Dec 10, 2020
29
0
6
47
Hi,
I have an ongoing issue on one host.
Network gets disconnected almost daily, on the node all VMs are getting marked with question sign. Then restarting the networking service I am able to see them again but any guest is able to comunicate over internet and all must be rebooted one by one.
I found some logs of the moment when it is happening.
Code:
Aug 25 07:56:38 pve2 kernel: bnx2x 0000:01:00.0 eno1: NIC Link is Down
Aug 25 07:56:38 pve2 kernel: vmbr2: port 1(eno1) entered disabled state
Aug 25 07:56:39 pve2 corosync[6950]:   [KNET  ] link: host: 3 link: 0 is down
Aug 25 07:56:39 pve2 corosync[6950]:   [KNET  ] link: host: 2 link: 0 is down
Aug 25 07:56:39 pve2 corosync[6950]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 25 07:56:39 pve2 corosync[6950]:   [KNET  ] host: host: 3 has no active links
Aug 25 07:56:39 pve2 corosync[6950]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 25 07:56:39 pve2 corosync[6950]:   [KNET  ] host: host: 2 has no active links
Aug 25 07:56:40 pve2 corosync[6950]:   [TOTEM ] Token has not been received in 2737 ms
Aug 25 07:56:40 pve2 corosync[6950]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Aug 25 07:56:45 pve2 corosync[6950]:   [QUORUM] Sync members[1]: 1
Aug 25 07:56:45 pve2 corosync[6950]:   [QUORUM] Sync left[2]: 2 3
Aug 25 07:56:45 pve2 corosync[6950]:   [TOTEM ] A new membership (1.1c37) was formed. Members left: 2 3
Aug 25 07:56:45 pve2 corosync[6950]:   [TOTEM ] Failed to receive the leave message. failed: 2 3
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] notice: members: 1/6868
Aug 25 07:56:45 pve2 corosync[6950]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 25 07:56:45 pve2 corosync[6950]:   [QUORUM] Members[1]: 1
Aug 25 07:56:45 pve2 corosync[6950]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 25 07:56:45 pve2 pmxcfs[6868]: [status] notice: node lost quorum
Aug 25 07:56:45 pve2 pmxcfs[6868]: [status] notice: members: 1/6868
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] crit: received write while not quorate - trigger resync
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] crit: leaving CPG group
Aug 25 07:56:45 pve2 pve-ha-lrm[7055]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve2/lrm_status.tmp.7055' - Permission denied
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] notice: start cluster connection
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] crit: cpg_join failed: 14
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] crit: can't initialize service
Aug 25 07:56:51 pve2 kernel: bnx2x 0000:01:00.0 eno1: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
Aug 25 07:56:51 pve2 kernel: vmbr2: port 1(eno1) entered blocking state
Aug 25 07:56:51 pve2 kernel: vmbr2: port 1(eno1) entered forwarding state
Aug 25 07:56:51 pve2 pmxcfs[6868]: [dcdb] notice: members: 1/6868
Aug 25 07:56:51 pve2 pmxcfs[6868]: [dcdb] notice: all data is up to date
Aug 25 07:56:53 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (Connection timed out)
Aug 25 07:56:54 pve2 pvestatd[6974]: status update time (7.828 seconds)
Aug 25 07:57:03 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (Connection timed out)
Aug 25 07:57:03 pve2 pvestatd[6974]: status update time (7.816 seconds)
Aug 25 07:57:10 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:12 pve2 pvescheduler[1192434]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 07:57:12 pve2 pvescheduler[1192433]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 07:57:20 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:29 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:40 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:49 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:59 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:12 pve2 pvescheduler[1196462]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 07:58:12 pve2 pvescheduler[1196461]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 07:58:20 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:29 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:33 pve2 pveproxy[1146877]: worker exit
Aug 25 07:58:33 pve2 pveproxy[7042]: worker 1146877 finished
Aug 25 07:58:33 pve2 pveproxy[7042]: starting 1 worker(s)
Aug 25 07:58:33 pve2 pveproxy[7042]: worker 1198335 started
Aug 25 07:58:40 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:50 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:59 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:12 pve2 pvescheduler[1200283]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 07:59:12 pve2 pvescheduler[1200282]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 07:59:20 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:30 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:36 pve2 pveproxy[1156755]: worker exit
Aug 25 07:59:36 pve2 pveproxy[7042]: worker 1156755 finished
Aug 25 07:59:36 pve2 pveproxy[7042]: starting 1 worker(s)
Aug 25 07:59:36 pve2 pveproxy[7042]: worker 1202705 started
Aug 25 07:59:39 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:49 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:59 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:12 pve2 pvescheduler[1204846]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 08:00:12 pve2 pvescheduler[1204845]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 08:00:19 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:30 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:40 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:49 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:00 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:12 pve2 pvescheduler[1208736]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 08:01:12 pve2 pvescheduler[1208735]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 08:01:19 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:30 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:40 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:49 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:02:00 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:02:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:02:12 pve2 pvescheduler[1212467]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!

Do you have some idea on what can be and how to fix it?
Thanks, Giuseppe
 
Aug 25 07:56:38 pve2 kernel: bnx2x 0000:01:00.0 eno1: NIC Link is Down
This looks like a HW issue if the NIC reports the link as down. I would check the cables and maybe try another switch port in case that is faulty.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!