[SOLVED] VM sees other machines than Proxmox on the same network

rokyo401

New Member
Mar 12, 2024
25
0
1
Hi there,

I am currently configuring access to a separated storage network (with a separat switch) on a VM running on Proxmox and I'm having the issue that the VM can only ping some machines on that storage network afterwards but not others. The VM can reach and ping machines on the regular Proxmox management network just fine.

The Proxmox host has a 2-port PCIe 10Gb NIC that I configured as an active-backup bond in Proxmox, then set this bond as the interface for a bridge called vmbr1. This bridge is given to the VM as interface1 (interface0 being the normal Proxmox management network vmbr0).
This bridge is visible as network interface ens19 inside the VM and is configured there via /etc/network/interfaces like this:

Code:
auto ens19
iface ens19 inet static
mtu 9000
address 192.168.0.122
netmask 255.255.255.0

After a reboot, the VM shows that it applied both the jumbo frames and the IP (via ip a):

Code:
3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP group default qlen 1000
    link/ether bc:24:11:f1:07:cc brd ff:ff:ff:ff:ff:ff
altname enp0s19
inet 192.168.0.122/24 brd 192.168.0.255 scope global ens19
valid_lft forever preferred_lft forever
inet6 fe80::be24:cb78:fef1:36c/64 scope link
valid_lft forever preferred_lft forever

Now strangely, when I try to ping other machines using this storage network, the ping only goes through for some of them but not others. The machines where the pings work can also ping back, the others cannot. The other machines can all ping each other perfectly fine.

The config on all machines (pingable and non-pingable) is the same (except for the assigned IP, of course) including the jumbo frames. There is no apparent reason why some should be reachable and some not. They all connect physically to the same switch. There is no firewall in play between them.

Running ip a on the other machines in this network show the same parameters as inside the VM except that the other machines don't show fq_codel but mq as the queueing mechanism (?) for their interface.

Example of ip a from one of the non-pingable machines:

Code:
3: enp1s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 90:e2:ba:7d:77:fd brd ff:ff:ff:ff:ff:ff
inet 192.168.0.101/24 brd 192.168.0.255 scope global enp1s0f1
valid_lft forever preferred_lft forever
inet6 fe80::92e2:7cff:fe7d:2fd/64 scope link
valid_lft forever preferred_lft forever

Example of ip a from one of the pingable machines:

Code:
4: enp129s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 3c:ec:ef:38:f4:ca brd ff:ff:ff:ff:ff:ff
inet 192.168.0.102/24 brd 192.168.0.255 scope global enp129s0f0
valid_lft forever preferred_lft forever
inet6 fe80::3eec:efff:fefe:84ba/64 scope link
valid_lft forever preferred_lft forever

Not sure what is happening here... any suggestions would be helpful!


-----------------------
Solution:

TL/DR: The ports on the switch, where the Proxmox host was connected, were set to run in LACP mode on the switch side. But the bond0 I created in Proxmox was set as "active-backup". This, for some reason, split the network in two for the Proxmox host and its VM.


If anybody can explain to me, WHY this is the case in this misconfiguration I'm more than happy to hear! ;-)
 
Last edited:
Hmm okay, I thought that maybe the differing queueing disciplines on the VM (fq_codel) and the physical machines on the network (mq) could be the problem, but the qdics are exactly the same on all physical machine, the pingable ones and the non-pingable ones. So, why should some work and some not?
 
Okay, got a little further info with nmap:
Code:
root@VM:/home/user# nmap -sn 192.168.0.0/24
Starting Nmap 7.93 ( https://nmap.org ) at 2024-03-19 11:44 CET
Nmap scan report for 192.168.0.10
Host is up (0.00047s latency).
MAC Address: 3C:EC:EF:38:AD:71 (Super Micro Computer)
Nmap scan report for 192.168.0.11
Host is up (0.00055s latency).
MAC Address: AC:1F:6B:CC:64:56 (Super Micro Computer)
Nmap scan report for 192.168.0.20
Host is up (0.00062s latency).
MAC Address: 3C:EC:EF:38:56:DC (Super Micro Computer)
Nmap scan report for 192.168.0.31
Host is up (0.00030s latency).
MAC Address: 52:34:9A:5A:E5:A7 (Unknown)
Nmap scan report for 192.168.0.41
Host is up (0.00028s latency).
MAC Address: C2:8A:A1:04:4F:7B (Unknown)
Nmap scan report for 192.168.0.102
Host is up (0.00029s latency).
MAC Address: 3C:EC:EF:FF:84:7A (Super Micro Computer)
Nmap scan report for 192.168.0.122
Host is up.
Nmap done: 256 IP addresses (7 hosts up) scanned in 1.98 seconds
root@VM:/home/user# ping 192.168.0.10
PING 192.168.0.10 (192.168.0.10) 56(84) bytes of data.
^C
--- 192.168.0.10 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2045ms

root@VM:/home/user# ping 192.168.0.11
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.297 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.312 ms
^C
--- 192.168.0.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1003ms
rtt min/avg/max/mdev = 0.297/0.304/0.312/0.007 ms
root@VM:/home/user# ping 192.168.0.20
PING 192.168.0.20 (192.168.0.20) 56(84) bytes of data.
^C
--- 192.168.0.20 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2044ms

root@VM:/home/user# ping 192.168.0.31
PING 192.168.0.31 (192.168.0.31) 56(84) bytes of data.
64 bytes from 192.168.0.31: icmp_seq=1 ttl=64 time=0.591 ms
64 bytes from 192.168.0.31: icmp_seq=2 ttl=64 time=0.305 ms
^C
--- 192.168.0.31 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1007ms
rtt min/avg/max/mdev = 0.305/0.448/0.591/0.143 ms
root@VM:/home/user# ping 192.168.0.41
PING 192.168.0.41 (192.168.0.41) 56(84) bytes of data.
^C
--- 192.168.0.41 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1019ms

root@VM:/home/user# ping 192.168.0.102
PING 192.168.0.102 (192.168.0.102) 56(84) bytes of data.
64 bytes from 192.168.0.102: icmp_seq=1 ttl=64 time=0.351 ms
64 bytes from 192.168.0.102: icmp_seq=2 ttl=64 time=0.327 ms
^C
--- 192.168.0.102 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1014ms
rtt min/avg/max/mdev = 0.327/0.339/0.351/0.012 ms
root@VM:/home/user#

So the VM CAN reach all of the physical hosts but just not via ping... it shows 7 hosts (including itself) to be up in the 192.168.0.0/24 network:

.10 / .11 / .20 / .31 / .41 / .102 / .122

but only .11, .31 and .102 can be pinged (.122 is the VM itself).

Now, .10 is a storage server from which the VM needs to import NFS shares and that specific server cannot be pinged and also does not respond when I run showmount -e 192.168.0.10 from the VM, so importing NFS shares is not possible like this. Running showmount like this will make bash hang for a minute and needs multiple Ctrl-C to stop.

Any ideas why this is the case? Why does the VM detect all machines with nmap but can only ping some? The is no firewall anywhere on or near the switch that is servicing this subnet...
 
Last edited:
Running ethtool for the interface shows nothing configured:

Code:
root@VM:/home/user# /usr/sbin/ethtool ens19
Settings for ens19:
Supported ports: [  ]
Supported link modes:   Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Unknown! (255)
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes

But I think this may be normal?? Since the other interface (the Proxmox management interface) shows exactly the same and also shows exactly this on all my other Proxmox VMs and containers...
 
Does ethtool simply not work on bonds and bridges?? Because if I run it on the physical machines on their bonds, it shows only a few parameters but if I run it on the individual NICs behind the bond, it shows everything correctly:

bond:
Code:
root@phys02:~# ethtool bond0
Settings for bond0:
Supported ports: [  ]
Supported link modes:   Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 20000Mb/s
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes

NIC:
Code:
root@phys02:~# ethtool enp129s0f0
Settings for enp129s0f0:
Supported ports: [ FIBRE ]
Supported link modes:   1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes:  1000baseT/Full
10000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Auto-negotiation: on
Port: FIBRE
PHYAD: 0
Transceiver: internal
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
 
Another weird thing about the nmap scan of the 192.168.0.0/24 network:

It doesn't show the physical machine with the IP 192.168.0.101 in the scan, but I when I check its ip a it shows that it has 192.168.0.101 AND it can ping all other physical machines in the 192.168.0.0/24 network perfectly fine and the other physical machines can ping it.... !?

Could this be some config issue of the ports on the switch? Because, nothing else makes sense to me here... :(
 
Last edited:
Checking ip neighbor show on the VM shows all of the physical machines that were discovered by nmap earlier as "STALE".
Then when I ping them, they change to "DELAY" and a second later to "REACHABLE" even the ones where the ping fails.
A few seconds later, they go back to "STALE".

The one machine that wasn't discovered by nmap shows as "FAILED".
Then when I ping it, it changes to "INCOMPLETE" and a second later back to "FAILED".

The physical machines which cannot ping the VM DO show its correct MAC address in lladdr when running ip neighbor show on them and the VM shows as "REACHABLE" after a failed ping.
 
Last edited:
Alright, now I changed the configuration so that the vmbr1 bridge also gets an IP on the Proxmox host (192.168.0.122) AND on the VM (192.168.0.123) and set the MTU to 9000 on:

- the bond0 on the Proxmox host via the web GUI
- the vmbr1 on the Proxmox host via the web GUI
- (on the two NICs making up bond0 the MTU setting is grayed out in the web GUI, I suppose because they're bonded?)
- the network interface (ens19) on the VM via /etc/network/interfaces

Now, on the VM I can:
- still only ping the same hosts as before
- only find the same hosts as before with nmap (.10/11/20/31/41/102/122/123)
- BUT: showmount -e 192.168.0.10 now shows the NFS shares on the storage server!

On the Proxmox host:
- showmount times out (just like it did previously on the VM)
- nmap shows completely different hosts as the nmap on the VM does: (.3/14/35/42/101/122/123/231/232/254) !???
- I can only ping the VM and some of the hosts found by nmap

How is this possible??? There is no VLAN-tagging going on in Proxmox and also not on the Debian installed in the VM?? The bond and vmbr and individual NICs are all set to "VLAN aware = no"...

Could this be some VLAN settings on the switch ports? But why, then, can all the physical machines perfectly reach each other and communicate? Is maybe just the port where the Proxmox host is connected set to some VLAN? Or maybe it's the only port NOT set to a VLAN? Which configuration would explain the behaviour here?

I find it especially strange that the VM and the Proxmox host see completely different machines in nmap even though they're obviously connected via the same physical port on the switch!?

I'll be on premises tomorrow, so I can finally check the switch itself (it's a separated storage network, so I can't access the switch management from anywhere outside but have to go there with a cable)! Really weird stuff going on here! :oops:
 
Last edited:
So, one restart of Proxmox host (and VM with it) later and now showmount -e 192.168.0.10 times out on BOTH machines again... wth? The machines detected by nmap are still the same. Pinging also the same.

I changed the title of the thread since this now better reflects what I found out about the problem since creation of the thread.

I'll go over to the server room in a minute and check the switch. Hopefully, this will finally give me more insight into this weird issue!
 
Okay, this is new:

I monitored the incoming packets with bmon on one of the servers that don't respond to ping while pinging it from the VM/PM host and monitored the outgoing packets from the VM/PM host as well.

The RX packet counter on the pinged host increases with every TX packet from the VM/PM host. So does the TX packet counter on the pinged host. Still, ping says "Destination Host Unreachable" on both, the VM and the PM host. So, the pinged hosts are actually receiving and responding to the ping, but the VM/PM host are not getting the response!
 
Soooo, checking the ARP tables of the servers revealed something new, again:

The ARP table of the VM (192.168.0.122) shows "(incomplete)" for all the hosts that it didn't detect with nmap (and the correct MACs for the ones it DID detect):

(some of the IPs only showed up in this table after I pinged them [with no response])

Code:
root@VM:~# /usr/sbin/arp | grep ens19
192.168.0.3                     (incomplete)                              ens19
192.168.0.10            ether   3c:ec:ef:38:ad:73   C                     ens19
192.168.0.11            ether   ac:1f:6b:f5:64:c4   C                     ens19
192.168.0.12            ether   ac:1f:6b:75:34:b7   C                     ens19
192.168.0.14                    (incomplete)                              ens19
192.168.0.20            ether   3c:ec:ef:38:56:bb   C                     ens19
192.168.0.31            ether   52:34:9a:fd:e5:12   C                     ens19
192.168.0.35                    (incomplete)                              ens19
192.168.0.41            ether   c2:8a:a1:04:4f:6b   C                     ens19
192.168.0.42                    (incomplete)                              ens19
192.168.0.101                   (incomplete)                              ens19
192.168.0.102           ether   3c:ec:ef:38:84:44   C                     ens19
192.168.0.231                   (incomplete)                              ens19
192.168.0.232                   (incomplete)                              ens19
192.168.0.254                   (incomplete)                              ens19


The ARP table of the Proxmox host (192.168.0.12) shows "(incomplete)" for all the hosts that it didn't detect with nmap (and the correct MACs for the ones it DID detect):

(some of the IPs only showed up in this table after I pinged them [with no response])

Code:
root@proxmox:~# arp | grep vmbr1
192.168.0.3             ether   9e:08:95:f5:3c:1a   C                     vmbr1
192.168.0.10                    (incomplete)                              vmbr1
192.168.0.11                    (incomplete)                              vmbr1
192.168.0.14            ether   b2:5a:49:af:b3:d3   C                     vmbr1
192.168.0.20            ether   3c:ec:ef:38:56:bb   C                     vmbr1
192.168.0.31                    (incomplete)                              vmbr1
192.168.0.35            ether   36:81:1a:a3:c4:c4   C                     vmbr1
192.168.0.41                    (incomplete)                              vmbr1
192.168.0.42            ether   c6:43:93:b2:72:3d   C                     vmbr1
192.168.0.101           ether   90:e2:ba:7d:02:48   C                     vmbr1
192.168.0.102                   (incomplete)                              vmbr1
192.168.0.122           ether   bc:24:11:f1:03:2a   C                     vmbr1
192.168.0.231           ether   3c:ec:ef:38:79:77   C                     vmbr1
192.168.0.232           ether   3c:ec:ef:38:51:92   C                     vmbr1
192.168.0.254           ether   b8:d4:e7:7d:fb:c1   C                     vmbr1

except for 192.168.0.20! This server has a complete (and correct!) MAC address on both, the PM host and the VM, even though it is shown as "Destination Host Unreachable" on both when pinging it and only shows in nmap on the VM.

The ARP table of 192.168.0.20 shows the correct MAC address for both, the PM host (.12) and the VM (.122):

Code:
root@dot-twenty:~# arp | grep bond-storage
192.168.0.3             ether   9e:08:95:f5:3c:1a   C                     bond-storage
192.168.0.10            ether   3c:ec:ef:38:ad:73   C                     bond-storage
192.168.0.12            ether   ac:1f:6b:75:34:b7   C                     bond-storage
192.168.0.31            ether   52:34:9a:fd:e5:12   C                     bond-storage
192.168.0.41            ether   c2:8a:a1:04:4f:6b   C                     bond-storage
192.168.0.101           ether   90:e2:ba:7d:02:48   C                     bond-storage
192.168.0.102           ether   3c:ec:ef:38:84:44   C                     bond-storage
192.168.0.122           ether   bc:24:11:f1:03:2a   C                     bond-storage
192.168.0.231           ether   3c:ec:ef:38:79:77   C                     bond-storage

Why just this one host, though?
 
After checking on the switch, things get even stranger:

- the switch is an Aruba 3810M with 16 SPF+ ports
- it has 2 VLANs: "DEFAULT" (1) and "STORAGE" (2) under "Switch Configuration -> VLAN -> VLAN Names"
- primary VLAN is set to "DEFAULT" under "Switch Configuration -> VLAN -> VLAN Support"
- all 16 ports are assigned to the "STORAGE" VLAN as "untagged" under "Switch Configuration -> VLAN -> VLAN Port Assignment" (for "DEFAULT" VLAN they all show "No")
- the "DEFAULT" VLAN has IP range 10.0.10.0/24 and "STORAGE" has 192.168.0.0/24 under "Switch Configuration -> Internet (IP) Service"
- all 16 ports show as "enabled", "mode -> auto" and "Flow Ctrl -> Disable" under "Switch Configuration -> Port/Trunk Settings"

So far, so normal, I think?


Now, two things on the switch could have something to do with the problem (as far as I assume with limited networking knowledge):

1. under "Switch Configuration -> System Information" it shows "Jumbo Max Frame Size" as 9216 and "Jumbo IP MTU" as 9198

In Proxmox and on all the hosts, MTU is set to 9000 for the interfaces going into this switch, is this problematic that it is not exactly 9198?


2. under "Switch Configuration -> Port/Trunk Settings" it shows that ports 1-8 are assigned to "Groups" and their "Type" is "LACP", while ports 9-16 show nothing under "Group" and "Type"

As far as I understand, LACP aggregates multiple ports so that they can transmit with their combined bandwidth (?).

This would make sense in this case, because two ports are in each group here and these two each go to one server:
- ports 1 + 2 are in Group "Trk1" and go to physical server with IP ending in .10
- ports 3 + 4 are in Group "Trk2" and go to physical server with IP ending in .20
- ports 5 + 6 are in Group "Trk3" and go to physical server with IP ending in .11
- ports 7 + 8 are in Group "Trk4" and go to physical server with IP ending in .12 (this is the Proxmox host in question here)

The other ports (which are not in groups) go to the following servers:
- ports 9 + 10 go to physical server with IP ending in .232
- port 11 goes to physical server with IP ending in .101
- port 12 goes to physical server with IP ending in .102
- ports 13 + 14 go to physical server with IP ending in .231 (this server is another PVE host which hosts the VMs with IPs .3/14/31/35/41/42)
- ports 15 + 16 have no cable attached

Now, none of this information helps me with the problem, it only makes it stranger.

As seen in the nmap scan above, the problematic PVE host (.12) and its VM (.122) both detect different VMs running on the same other PVE host (.231) using the same ports on the switch (13 + 14).

So now their view of the network differs even though the senders of the pings/nmap-scans (.12 and .122) are on the same physical ports (7 + 8) AND the receivers (for example: VMs .41 [detected only by .122] and .42 [detected only by .12]) are also on the same host (.231) which uses physical ports (13 + 14).


I thought this would likely be a VLAN issue, but after checking the switch, this seems not to be the case. I have no idea why the PVE host and its VM have different views of the network. Is there any other place (except VLAN and physical ports) where traffic could be separated? There is no firewall in this network.
 
Hold up!

Could the problem be that the switch ports where the Proxmox host is connected to are set to LACP but the bond0 of the two interfaces on the Proxmox host is set to "active-backup" ????
 
Lord have mercy on my soul, that was it!!

After setting the bond0 to LACP and rebooting, everything works!!!

Morale of the story: Check your switch and know your bond modes! o_O
 
Still don't understand why this split the network in half for the Proxmox host and the VM, though. In the bond0 there was only one interface set as "primary"... wouldn't both have used that? Or did they then "split" that interface among each other??
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!