[SOLVED] Networking issues

DrillSgtErnst · Nov 18, 2021

Hi,
I have a 5 node Ceph cluster and one Node is just acting up.

Right now it's not in production use, still testing.

I installed everything yesterday.
tested it and everything seemed stable. Just some random stuttering here and then. didn't overthink it.
updated packages
set a local timserver
created Bonds for networking
created cluster
installed ceph
everything is up and running.

Today my first node started being weird.

On management I have packet loss of about 17-25% I have balance-alb. Disabled the bond and tested adapters individually, but still same error on both cards. (These are the random stutters in the Console I saw earlier. They always appear to happen, when some packets are not going through)
I have an Broadcom dual 25GB/s Adapter installed. One Port does not work anymore. I can ping it locally (so my networking seems to know about the adapter), but I see no other devices via Ping. the other port works just fine.
Switching the cable makes the other network unavailable. So it's for certain the adapter itself is causing the problem.
But with the network issues in general I think thats more of a Softwareproblem.

root@pve1:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: 7.1-5 (running version: 7.1-5/6fe299a0)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-1-pve: 5.13.19-2
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.14-1
proxmox-backup-file-restore: 2.0.14-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-2
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-1
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-3
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

I would have to delete the node, reinstall it and rejoin, setup ceph again and purge ceph beforehand.
I just don't want to reinstall it all just to see whether it is the OS or not.
What can I do?

itNGO · Nov 18, 2021

Hi,
maybe you can SWAP one Broadcom NIC from another node and see if the problem walks with the card or stays at the host....
If it walks with the card, replace that NIC....

DrillSgtErnst · Nov 18, 2021

Sorry it seems there is some misunderstanding.
The normal 10GB NIC is having the packet loss, the Broadcom is not responding. I have RMA by tomorrow for the Broadcom.

So if it is the card, I am good. But the normal Network on 10GBit should not have 25% packet loss, which is setting my cluster to degraded every now and then.

itNGO · Nov 18, 2021

Ok, thanks for clarification. Is there a switch used between the nodes? Did you test to separate 2 nodes and direct connect them to test if your Ping is stable then?

DrillSgtErnst · Nov 18, 2021

There is a switch but all other hosts are working fine.
The problem occurs on either network card. I first just connected card A and had 27% loss, then card B with 17% packet loss.
Both together also around 10% packet loss.

I can do a direct connect, but I doubt it is the switch, since 4 hosts cann ping each other just fine.

Sooo I tested it nevertheless, but the problem persists.
Packet drop on direct attach

I tried another Switch also.
The network error appears between all node as I now found out after some more testing.

Dug deeper
another switch
separate the network adapters of each host per switch. All fine.
disabling one of two adapters per linux bond (balance alb) no more packet loss.
Log shows vmbr0: received packet on bond0 with own address as source address (addr:12:7f:ba:...., vlan 0)

Soo I guess now it's even more part of the OS.
afaik balance-alb should not introduce the need to program the switch anyhow.
I will try other bonds

Set them all to ~~balance-tlb~~. Set them to 802.3ad and configured the switch accordingly.
Looks fine by now. No more packets loss at least
0% loss. Everything else unchanged. Gonna switch the Broadcom Card by tomorrow and let you now.

Thank you @itNGO for the help.

itNGO · Nov 18, 2021

Hi,
even balance-tlb and balance-alb are switch independent, there are switches out in the world which do not work well with this.
Packet Broadcast Storm-Filters can make huge problems if configured. However, if you can use 802.3ad this should be the way to go....

glad it is working better now.....

Regards....

DrillSgtErnst · Dec 2, 2021

So last things last.
One active optical cable broke over night.

After using LACP for Management and changing one cable, now everything works fine.

Search

Search

[SOLVED] Networking issues

DrillSgtErnst

Active Member

itNGO

Famous Member

DrillSgtErnst

Active Member

itNGO

Famous Member

DrillSgtErnst

Active Member

itNGO

Famous Member

DrillSgtErnst

Active Member

We value your privacy