Network traffic round trip fails across some VMs

aaron2

New Member
Aug 13, 2024
6
0
1
I've got a rather complicated setup for home use, I'll try to give enough detail up front.

The problem is with vm105:
100(cloud) -- runs caddy as reverse proxy on docker, serves traffic through to other docker containers on itself, and other VMs and LXC containers across vlan15 and vlan20
105(haos) -- home assistant, can't get any traffic to respond to caddy, but from vm100 I can ping and I see multicast arp traffic from vm105

All VMs sit on vmbr0 with all vlan's enabled on the interface, and terminate on the proxmox network device for each VM.
I also have vmbr1 which is used for some VMs to communicate without relying on a physical network device.

PVE VM configs:
vm100:
root@bedrock:~# cat /etc/pve/qemu-server/100.conf
name: cloud
net0: virtio=BC:24:11:3A:C7:69,bridge=vmbr0,tag=20
net1: virtio=BC:24:11:32:00:07,bridge=vmbr0,tag=15

vm105:
root@bedrock:~# cat /etc/pve/qemu-server/105.conf
name: haos
net0: virtio=02:8C:44:C6:3F:9C,bridge=vmbr0,tag=20

PVE network config:
root@bedrock:~# cat /etc/network/interfaces
...
iface vmbr0 inet manual
bridge-ports enp3s0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 10 15 20 25 30 50
#VLAN Bridge for Guests

auto vmbr1
iface vmbr1 inet static
address 192.168.1.1/24
bridge-ports none
bridge-stp off
bridge-fd 0

PVE bridge setup:
root@bedrock:~# brctl show
bridge name bridge id STP enabled interfaces
...
vmbr0 8000.408d5c780643 no enp3s0
fwpr101p0
fwpr103p0
fwpr104p0
fwpr104p1
fwpr200p0
fwpr200p1
fwpr200p2
fwpr201p0
fwpr202p0
fwpr204p0
tap100i0
tap100i1
tap105i0
...

PVE vlans:
root@bedrock:~# bridge vlan show
port vlan-id
enp3s0 1 PVID Egress Untagged
10
15
20
25
30
50
vmbr0 1 PVID Egress Untagged
vmbr1 1 PVID Egress Untagged
...
tap100i0 20 PVID Egress Untagged
tap100i1 15 PVID Egress Untagged
...
tap105i0 20 PVID Egress Untagged

I usually have firewall enabled for all guests, but disabled it for vm100 and vm105 for testing.
vm100 has IPs 10.15.1.220 and 10.20.1.220
vm105 has IP 10.20.1.228

Ping works:
root@cloud:~# tcpdump -nnvvS -i enp6s18 host 10.20.1.228
tcpdump: listening on enp6s18, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:10:03.687792 IP (tos 0x0, ttl 64, id 38461, offset 0, flags [DF], proto ICMP (1), length 84)
10.20.1.220 > 10.20.1.228: ICMP echo request, id 4, seq 1, length 64
09:10:03.688111 IP (tos 0x0, ttl 63, id 57101, offset 0, flags [none], proto ICMP (1), length 84)
10.20.1.228 > 10.20.1.220: ICMP echo reply, id 4, seq 1, length 64

But if I run "wget http://10.20.1.228:8123" from a separate terminal on vm100, nothing.

I know the traffic is leaving vm100, reaches vm105, then the return response comes back again all on vmbr0:
root@bedrock:~# tcpdump -i vmbr0 host 10.20.1.228 and host 10.20.1.220
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vmbr0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:17:56.342721 IP 10.20.1.220.47944 > 10.20.1.228.8123: Flags [S], seq 3301836575, win 64240, options [mss 1460,sackOK,TS val 3035091219 ecr 0,nop,wscale 7], length 0
09:17:56.342867 IP 10.20.1.228.8123 > 10.20.1.220.47944: Flags [S.], seq 269643130, ack 3301836576, win 65160, options [mss 1460,sackOK,TS val 2921087751 ecr 3035091219,nop,wscale 7], length 0
09:17:57.359655 IP 10.20.1.228.8123 > 10.20.1.220.47944: Flags [S.], seq 269643130, ack 3301836576, win 65160, options [mss 1460,sackOK,TS val 2921088768 ecr 3035091219,nop,wscale 7], length 0
09:17:57.374254 IP 10.20.1.220.47944 > 10.20.1.228.8123: Flags [S], seq 3301836575, win 64240, options [mss 1460,sackOK,TS val 3035092251 ecr 0,nop,wscale 7], length 0
09:17:57.374390 IP 10.20.1.228.8123 > 10.20.1.220.47944: Flags [S.], seq 269643130, ack 3301836576, win 65160, options [mss 1460,sackOK,TS val 2921088782 ecr 3035091219,nop,wscale 7], length 0
09:17:59.407710 IP 10.20.1.228.8123 > 10.20.1.220.47944: Flags [S.], seq 269643130, ack 3301836576, win 65160, options [mss 1460,sackOK,TS val 2921090816 ecr 3035091219,nop,wscale 7], length 0
09:18:01.374173 ARP, Request who-has 10.20.1.228 tell 10.20.1.220, length 28
09:18:01.374313 ARP, Reply 10.20.1.228 is-at 02:8c:44:c6:3f:9c (oui Unknown), length 28

But the traffic never seems to reach vm100's tap device again:
root@bedrock:~# tcpdump -i tap100i0 src 10.20.1.228
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tap100i0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:18:01.374310 ARP, Reply 10.20.1.228 is-at 02:8c:44:c6:3f:9c (oui Unknown), length 28
09:18:23.075582 IP 10.20.1.228.33981 > 239.255.255.250.1900: UDP, length 324
The only traffic that shows up is some ARP and multicast coming from vm105.


As a test, I've done the same thing in reverse and I believe its the same result.
From vm105, running "curl http://10.20.1.220":
root@bedrock:~# tcpdump -i vmbr0 host 10.20.1.228 and host 10.20.1.220
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vmbr0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:27:51.495515 IP 10.20.1.228.54682 > 10.20.1.220.http: Flags [S], seq 3539920993, win 64240, options [mss 1460,sackOK,TS val 2921682903 ecr 0,nop,wscale 7], length 0
09:27:51.495701 IP 10.20.1.228.54682 > 10.20.1.220.http: Flags [S], seq 3539920993, win 64240, options [mss 1460,sackOK,TS val 2921682903 ecr 0,nop,wscale 7], length 0
09:27:51.495895 IP 10.20.1.220.http > 10.20.1.228.54682: Flags [S.], seq 3076438070, ack 3539920994, win 65160, options [mss 1460,sackOK,TS val 1513592998 ecr 2921682903,nop,wscale 7], length 0
09:27:52.495805 IP 10.20.1.228.54682 > 10.20.1.220.http: Flags [S], seq 3539920993, win 64240, options [mss 1460,sackOK,TS val 2921683904 ecr 0,nop,wscale 7], length 0
Nothing ever arrives on tap105i0.

I also have vlan10, which my PCs reside on. I can http://10.20.1.228:8123 from 10.10.1.1 and it works fine.
So my conclusion is that there's *something* blocking return traffic between the bridge and some tap devices.
 
Last edited:
I've tested using vmbr1 to communicate instead of vmbr0, and confusingly it works.

Updated PVE VM configs:
vm100:
name: cloud
net0: virtio=BC:24:11:3A:C7:69,bridge=vmbr0,tag=20
net1: virtio=BC:24:11:32:00:07,bridge=vmbr0,tag=15
net2: virtio=BC:24:11:6E:3F:C7,bridge=vmbr1
net2 has IP 192.168.1.17

vm105:
name: haos
net0: virtio=02:8C:44:C6:3F:9C,bridge=vmbr0,tag=20
net1: virtio=BC:24:11:42:F1:50,bridge=vmbr1
net1 has IP 192.168.1.18

Updated PVE bridge setup:
root@bedrock:~# brctl show
...
vmbr1 8000.622663bf0b5e no fwpr103p1
tap100i2
tap105i1

Success!
root@cloud:~# wget http://192.168.1.18:8123
--2025-04-09 09:48:46-- http://192.168.1.18:8123/
Connecting to 192.168.1.18:8123... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5478 (5.3K) [text/html]
Saving to: ‘index.html’
index.html 100%[===============================>] 5.35K --.-KB/s in 0s
2025-04-09 09:48:46 (569 MB/s) - ‘index.html’ saved [5478/5478]

The only possible problem I can see is vmbr1 doesn't use vlans.

Here's my full bridge vlan output, as you can see I have many other guests using these same vlans.
root@bedrock:~# bridge vlan show
port vlan-id
enp3s0 1 PVID Egress Untagged
10
15
20
25
30
50
vmbr0 1 PVID Egress Untagged
vmbr1 1 PVID Egress Untagged
veth200i0 1 PVID Egress Untagged
fwbr200i0 1 PVID Egress Untagged
fwpr200p0 10 PVID Egress Untagged
fwln200i0 1 PVID Egress Untagged
veth200i1 1 PVID Egress Untagged
fwbr200i1 1 PVID Egress Untagged
fwpr200p1 20 PVID Egress Untagged
fwln200i1 1 PVID Egress Untagged
veth200i2 1 PVID Egress Untagged
fwbr200i2 1 PVID Egress Untagged
fwpr200p2 50 PVID Egress Untagged
fwln200i2 1 PVID Egress Untagged
tap100i0 20 PVID Egress Untagged
tap100i1 15 PVID Egress Untagged
tap101i0 1 PVID Egress Untagged
fwbr101i0 1 PVID Egress Untagged
fwpr101p0 20 PVID Egress Untagged
fwln101i0 1 PVID Egress Untagged
tap103i0 1 PVID Egress Untagged
fwbr103i0 1 PVID Egress Untagged
fwpr103p0 20 PVID Egress Untagged
fwln103i0 1 PVID Egress Untagged
tap103i1 1 PVID Egress Untagged
fwbr103i1 1 PVID Egress Untagged
fwpr103p1 1 PVID Egress Untagged
fwln103i1 1 PVID Egress Untagged
tap104i0 1 PVID Egress Untagged
fwbr104i0 1 PVID Egress Untagged
fwpr104p0 15 PVID Egress Untagged
fwln104i0 1 PVID Egress Untagged
tap104i1 1 PVID Egress Untagged
fwbr104i1 1 PVID Egress Untagged
fwpr104p1 25 PVID Egress Untagged
fwln104i1 1 PVID Egress Untagged
veth201i0 1 PVID Egress Untagged
fwbr201i0 1 PVID Egress Untagged
fwpr201p0 20 PVID Egress Untagged
fwln201i0 1 PVID Egress Untagged
veth202i0 1 PVID Egress Untagged
fwbr202i0 1 PVID Egress Untagged
fwpr202p0 15 PVID Egress Untagged
fwln202i0 1 PVID Egress Untagged
veth204i0 1 PVID Egress Untagged
fwbr204i0 1 PVID Egress Untagged
fwpr204p0 20 PVID Egress Untagged
fwln204i0 1 PVID Egress Untagged
tap105i0 20 PVID Egress Untagged
 
Could anyone please help with this? It would seem the problem is due to vlan20 but why only this vm is having problems is beyond me.