Hi,
I have an ongoing issue on one host.
Network gets disconnected almost daily, on the node all VMs are getting marked with question sign. Then restarting the networking service I am able to see them again but any guest is able to comunicate over internet and all must be rebooted one by one.
I found some logs of the moment when it is happening.
Do you have some idea on what can be and how to fix it?
Thanks, Giuseppe
I have an ongoing issue on one host.
Network gets disconnected almost daily, on the node all VMs are getting marked with question sign. Then restarting the networking service I am able to see them again but any guest is able to comunicate over internet and all must be rebooted one by one.
I found some logs of the moment when it is happening.
Code:
Aug 25 07:56:38 pve2 kernel: bnx2x 0000:01:00.0 eno1: NIC Link is Down
Aug 25 07:56:38 pve2 kernel: vmbr2: port 1(eno1) entered disabled state
Aug 25 07:56:39 pve2 corosync[6950]: [KNET ] link: host: 3 link: 0 is down
Aug 25 07:56:39 pve2 corosync[6950]: [KNET ] link: host: 2 link: 0 is down
Aug 25 07:56:39 pve2 corosync[6950]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 25 07:56:39 pve2 corosync[6950]: [KNET ] host: host: 3 has no active links
Aug 25 07:56:39 pve2 corosync[6950]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 25 07:56:39 pve2 corosync[6950]: [KNET ] host: host: 2 has no active links
Aug 25 07:56:40 pve2 corosync[6950]: [TOTEM ] Token has not been received in 2737 ms
Aug 25 07:56:40 pve2 corosync[6950]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Aug 25 07:56:45 pve2 corosync[6950]: [QUORUM] Sync members[1]: 1
Aug 25 07:56:45 pve2 corosync[6950]: [QUORUM] Sync left[2]: 2 3
Aug 25 07:56:45 pve2 corosync[6950]: [TOTEM ] A new membership (1.1c37) was formed. Members left: 2 3
Aug 25 07:56:45 pve2 corosync[6950]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] notice: members: 1/6868
Aug 25 07:56:45 pve2 corosync[6950]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 25 07:56:45 pve2 corosync[6950]: [QUORUM] Members[1]: 1
Aug 25 07:56:45 pve2 corosync[6950]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 25 07:56:45 pve2 pmxcfs[6868]: [status] notice: node lost quorum
Aug 25 07:56:45 pve2 pmxcfs[6868]: [status] notice: members: 1/6868
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] crit: received write while not quorate - trigger resync
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] crit: leaving CPG group
Aug 25 07:56:45 pve2 pve-ha-lrm[7055]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve2/lrm_status.tmp.7055' - Permission denied
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] notice: start cluster connection
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] crit: cpg_join failed: 14
Aug 25 07:56:45 pve2 pmxcfs[6868]: [dcdb] crit: can't initialize service
Aug 25 07:56:51 pve2 kernel: bnx2x 0000:01:00.0 eno1: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
Aug 25 07:56:51 pve2 kernel: vmbr2: port 1(eno1) entered blocking state
Aug 25 07:56:51 pve2 kernel: vmbr2: port 1(eno1) entered forwarding state
Aug 25 07:56:51 pve2 pmxcfs[6868]: [dcdb] notice: members: 1/6868
Aug 25 07:56:51 pve2 pmxcfs[6868]: [dcdb] notice: all data is up to date
Aug 25 07:56:53 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (Connection timed out)
Aug 25 07:56:54 pve2 pvestatd[6974]: status update time (7.828 seconds)
Aug 25 07:57:03 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (Connection timed out)
Aug 25 07:57:03 pve2 pvestatd[6974]: status update time (7.816 seconds)
Aug 25 07:57:10 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:12 pve2 pvescheduler[1192434]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 07:57:12 pve2 pvescheduler[1192433]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 07:57:20 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:29 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:40 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:49 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:57:59 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:12 pve2 pvescheduler[1196462]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 07:58:12 pve2 pvescheduler[1196461]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 07:58:20 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:29 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:33 pve2 pveproxy[1146877]: worker exit
Aug 25 07:58:33 pve2 pveproxy[7042]: worker 1146877 finished
Aug 25 07:58:33 pve2 pveproxy[7042]: starting 1 worker(s)
Aug 25 07:58:33 pve2 pveproxy[7042]: worker 1198335 started
Aug 25 07:58:40 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:50 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:58:59 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:12 pve2 pvescheduler[1200283]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 07:59:12 pve2 pvescheduler[1200282]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 07:59:20 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:30 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:36 pve2 pveproxy[1156755]: worker exit
Aug 25 07:59:36 pve2 pveproxy[7042]: worker 1156755 finished
Aug 25 07:59:36 pve2 pveproxy[7042]: starting 1 worker(s)
Aug 25 07:59:36 pve2 pveproxy[7042]: worker 1202705 started
Aug 25 07:59:39 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:49 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 07:59:59 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:12 pve2 pvescheduler[1204846]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 08:00:12 pve2 pvescheduler[1204845]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 08:00:19 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:30 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:40 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:00:49 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:00 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:12 pve2 pvescheduler[1208736]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Aug 25 08:01:12 pve2 pvescheduler[1208735]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Aug 25 08:01:19 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:30 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:40 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:01:49 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:02:00 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:02:09 pve2 pvestatd[6974]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Aug 25 08:02:12 pve2 pvescheduler[1212467]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Do you have some idea on what can be and how to fix it?
Thanks, Giuseppe