Network problem on cluster-node if I start a vm or ct

mjg · Jul 24, 2021

Hello all,

I have a 6-node cluster with proxmox 6.4-13.
All 5 nodes run perfectly. But the newest node has network problems if I start a vm or ct with network.
In the syslog of the node I get the message "cfs-lock 'file-replication_cfg' error: no quorum!" if the network is hanging.
Then the node and the vm or ct are not accessible over the network for 3-4 minutes. After that both are accessible again over network also for 3-4 minutes. And so on...
If I don't start any vm or ct the node is always accessible over the network. Also if I start a vm without network.

Any ideas?

Thank you for helping
Martin

ph0x · Jul 24, 2021

Any duplicate addresses in any of your networks?

mjg · Jul 24, 2021

HI, thank you for your answer.
I don't think so. This problem exist also if I start the vm with any live iso image without any network configuration. And dhcp is not running in this network.

ph0x · Jul 24, 2021

Could you share /etc/network/interfaces of your cluster? At best of a working and the problematic node.

mjg · Jul 24, 2021

yes of course.
The working node:
auto lo
iface lo inet loopback

iface enp3s0 inet manual

auto vmbr0
iface vmbr0 inet static
address 10.14.33.162/20
gateway 10.14.47.254
bridge_ports enp3s0
bridge_stp off
bridge_fd 0

iface enp4s0 inet manual

The problem node:
auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
address 10.14.33.166/20
gateway 10.14.47.254
bridge_ports eno1
bridge_stp off
bridge_fd 0

iface eno2 inet manual

ph0x · Jul 24, 2021

And that's it? No dedicated network for Corosync?

mjg · Jul 24, 2021

no, actually not.

ph0x · Jul 24, 2021

Then I would consider this an expectable side effect. Latency on the sixth node it not low enough after firing up a guest with network.
You should definitely define a separate Corosync network on a separate interface.

mjg · Jul 24, 2021

ok, thank you for your help. Than I will define a dedicated network. I will give an answer if this is fixing the problem.

mjg · Jul 26, 2021

Hi,
I have now a dedicated corosync network. The error message "cfs-lock 'file-replication_cfg' error: no quorum!" is gone. But the problem with the network connection is still exists.
I have seen that the interface eno1 goes down periodically. But this problem exist only if I start a vm on this machine.

Do you have any ideas why this machine makes this trouble?

Best

ph0x · Jul 26, 2021

I'm more the logical network guy, not the physical one. Do you have any hint in the kernel logs or dmesg?
What is the output of the networking service at the time of an outage?

mjg · Jul 26, 2021

I have thinking about this periodical link down.
And I think this problem comes from our network guys. The switch detects a second mac address and cut the link after a while it comes back and so on. I will ask our network guys tomorrow.
Thank you for your help.
It was a good time for the dedicated corosync network.
Thank you very much.
Best

mjg · Jul 31, 2021

Only for information.
This was exactly the problem. The switch detects multiple mac addresses and goes down.
Thank you

Search

Search

Network problem on cluster-node if I start a vm or ct

mjg

New Member

ph0x

Renowned Member

mjg

New Member

ph0x

Renowned Member

mjg

New Member

ph0x

Renowned Member

mjg

New Member

ph0x

Renowned Member

mjg

New Member

mjg

New Member

ph0x

Renowned Member

mjg

New Member

mjg

New Member