No ping on the second network bridge

mindio

New Member
Jul 9, 2023
14
0
1
Dear All,

I am stuck for several days so seek for wisdom. Long story short, I had 2 node cluster, then removed second node and upgraded nic on primary node (HP dl380 gen9). Now ready to add back new node to the cluster, but secondary network link, intended for corosync on primary node is not working.
I have vlan10 on Microtik CRS309 switch for corosync network, and on the second node both links for main untaged subnet an vlan10 works perfectly fine. But I can't ping vlan10 IP on primary node.
My network settings on primary HP node:
Code:
auto vmbr0
iface vmbr0 inet static
        address 192.168.88.8/24
        gateway 192.168.88.1
        bridge-ports eno49np0
        bridge-stp off
        bridge-fd 0
#bridge-private

auto vmbr1
iface vmbr1 inet static
        address 10.10.10.8/24
        bridge-ports eno50np1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#vlan-proxmox

All the links ar UP:
1715191387208.png

1715191464377.png

From the switch side, all the links are fine, perfectly negotiated at 10gbit, ports recognized on VLAN bridge, tried to switch ports, cables, sfp modules, DAC's, but still nothing.

ethtool:
Code:
Settings for eno50np1:
        Supported ports: [ FIBRE         Backplane ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: None        RS      BASER
        Advertised link modes:  1000baseKX/Full
                                10000baseKR/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: None
        Speed: 10000Mb/s
        Duplex: Full
        Auto-negotiation: on
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes

lshw:
Code:
  *-network:1
       description: Ethernet interface
       product: MT27710 Family [ConnectX-4 Lx]
       vendor: Mellanox Technologies
       physical id: 0.1
       bus info: pci@0000:04:00.1
       logical name: eno50np1
       version: 00
       serial: 04:09:73:dd:62:b1
       size: 10Gbit/s
       capacity: 25Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical fibre 1000bt-fd 10000bt-fd 25000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=6.8.4-2-pve duplex=full firmware=14.18.2030 (HP_2690110034) latency=0 link=yes multicast=yes port=fibre speed=10Gbit/s

I daubt this is network hardware related, but I may be runing in circles and missing something obvious. So please help to debug this, clear my mind and get ping on 10.10.10.8 :)

Thanks.
 
Maybe a conflict with the IP 10.10.10.8 since it was used before on different HW/NIC maybe?
I imagine you tried a complete reboot of Router(s)/Switch(s) etc?
 
Maybe a conflict with the IP 10.10.10.8 since it was used before on different HW/NIC maybe?
I imagine you tried a complete reboot of Router(s)/Switch(s) etc?
Yes, sure. Also checked/cleaned ARP table, MTU's are default at 1500 on both ends.
 
But I can't ping vlan10 IP on primary node.
I imagine you tried that from secondary PVE node. Can you try it from a different device on the 10.10.10.0 VLAN? Then try it also from a different device to the secondary PVE node on that VLAN.

How about pinging from the Primary PVE node to other node / device on the 10.10.10.0 VLAN?
 
I imagine you tried that from secondary PVE node. Can you try it from a different device on the 10.10.10.0 VLAN? Then try it also from a different device to the secondary PVE node on that VLAN.

How about pinging from the Primary PVE node to other node / device on the 10.10.10.0 VLAN?
I can ping both nic ports of secondary pve node (192.168.88.7 and 10.10.10.7) from my PC which is on 192.168.88.0 network, also from the router. Currently don't have any more devices on 10.10.10.0 network (except both pve nodes), but will configure something on weekend.

Also just conected second nic ports of both pve nodes directly with DAC to eliminate switch. There is a link but no ping, might need reboot, but currently can't shutdown VMs, will try tomorrow.
 
So I tried to connect second nic ports on both machines directly, but result is the same, there is a link, but no ping.

Now after some more debugging, I have arrived at this weird situation. If I run tcpdump on 10.10.10.8 port, then ping from 10.10.10.7 immediately start working. But as soon, as I kill tcpdump, ping also dies. WTF?

Code:
root@galadriel:~# tcpdump -i eno50np1
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno50np1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:59:06.333134 STP 802.1w, Rapid STP, Flags [Proposal, Learn, Forward], bridge-id 8000.08:55:31:fb:f5:d8.8008, length 36
11:59:06.519528 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:07.521967 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:08.335515 STP 802.1w, Rapid STP, Flags [Proposal, Learn, Forward], bridge-id 8000.08:55:31:fb:f5:d8.8008, length 36
11:59:08.525072 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:09.523939 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:10.338023 STP 802.1w, Rapid STP, Flags [Proposal, Learn, Forward], bridge-id 8000.08:55:31:fb:f5:d8.8008, length 36
11:59:10.518294 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:11.517465 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:12.340534 STP 802.1w, Rapid STP, Flags [Proposal, Learn, Forward], bridge-id 8000.08:55:31:fb:f5:d8.8008, length 36
11:59:12.517277 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:12.677822 IP 10.10.10.7 > 10.10.10.8: ICMP echo request, id 6022, seq 1, length 64
11:59:12.677912 IP 10.10.10.8 > 10.10.10.7: ICMP echo reply, id 6022, seq 1, length 64
11:59:13.517399 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:13.702866 IP 10.10.10.7 > 10.10.10.8: ICMP echo request, id 6022, seq 2, length 64
11:59:13.702907 IP 10.10.10.8 > 10.10.10.7: ICMP echo reply, id 6022, seq 2, length 64
11:59:14.343045 STP 802.1w, Rapid STP, Flags [Proposal, Learn, Forward], bridge-id 8000.08:55:31:fb:f5:d8.8008, length 36
11:59:14.525971 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:14.726903 IP 10.10.10.7 > 10.10.10.8: ICMP echo request, id 6022, seq 3, length 64
11:59:14.726951 IP 10.10.10.8 > 10.10.10.7: ICMP echo reply, id 6022, seq 3, length 64
11:59:15.528842 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 62:81:d4:b8:96:bb (oui Unknown), length 300
11:59:15.750858 IP 10.10.10.7 > 10.10.10.8: ICMP echo request, id 6022, seq 4, length 64
11:59:15.750890 IP 10.10.10.8 > 10.10.10.7: ICMP echo reply, id 6022, seq 4, length 64
 
I remember once reading about a similar situation. As far as I can theorize, I would say its probably got to do with the interface not being in promiscuous mode so the pings fail. However tcpdump AFAIK by default turns promiscuous mode on - so your ping packets reach their destination during this period.
 
I remember once reading about a similar situation. As far as I can theorize, I would say its probably got to do with the interface not being in promiscuous mode so the pings fail. However tcpdump AFAIK by default turns promiscuous mode on - so your ping packets reach their destination during this period.

This theory looks promising:
1715947883605.png

Whats the best way to enable this?
 
So I did ip link set eno50np1 promisc on and bingo! :) Now how to make this persistent?
 
Last edited:
What you are doing is not normal. The promisc mode means that the card will accept all traffic, even the one that is not destined to it. This mode is normally used for sniffers.
It is also used for bridges, as the NIC has to accept IP traffic destined to VMs. The PVE journal is filled with interfaces entering/leaving the promiscuous mode during normal operations.

I'd recommend that you remove vmbr config, and test with just naked hardware NIC. And I would start with direct connection.

Good luck

I'd examine the network trace with tcpdump details or wireshark, paying particular attention to MAC addresses.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!