Hello,
I'm facing some strange random networking issues with LXCs on my PVE cluster not able to communicate.
For instance, sometimes, 10.0.10.51 which is a LXC will not be able to communicate with 10.0.1.23 which is one of my switches.
When this occurs, I see no trafic at all coming in on the gateway (using opnSense, made a packet capture, nothing there), meaning the trafic is not leaving the LXC or not leaving the network bridge. I think I did try a packet capture on the pve host of the lxc and did not see any trafic on the vmbr10 either...
I see that often thanks to my uptime-kuma instance runing on this LXC, and can't really understand why, there is a timeout (60 secs) during which uptime isn't able to either ping or curl http the switch, and doing nothing it starts working again a few minutes later...
The LXC in question is a ubuntu jammy attached with a static ip to vmbr10, the pve host is running v8.1.3 on kernel 6.5.11-7-pve.
While this is occurring, I can reproduce using ssh onside the LXC and communication is indeed down, and during this time I was able to ssh onto my opnsense gateway and confirm it is indeed able to ping or clurl the switch no problem, so were my opnSense to recieve the packets from the LXC it would pass them along correctly...
Uptime is running inside docker inside the LXC and I do believe I have similar issues within docker networking itself (some containers timeout between my traefik instance and the gitea container itselft for example...) but that seems unrelated as within docker itself...
The host is a 8365U so powerfull enough, it's sitting arount 30%CPU usage, no swapping with the 32GB of RAM I added, it is quite busy running around 100 containers total, some in LXCs, some in VMs, but overall no slowness or anything besides these random network dropouts*
I recently tries to increase
Any idea ?
Here is my
I do have the proper tables created I believe :
I'm facing some strange random networking issues with LXCs on my PVE cluster not able to communicate.
For instance, sometimes, 10.0.10.51 which is a LXC will not be able to communicate with 10.0.1.23 which is one of my switches.
When this occurs, I see no trafic at all coming in on the gateway (using opnSense, made a packet capture, nothing there), meaning the trafic is not leaving the LXC or not leaving the network bridge. I think I did try a packet capture on the pve host of the lxc and did not see any trafic on the vmbr10 either...
I see that often thanks to my uptime-kuma instance runing on this LXC, and can't really understand why, there is a timeout (60 secs) during which uptime isn't able to either ping or curl http the switch, and doing nothing it starts working again a few minutes later...
The LXC in question is a ubuntu jammy attached with a static ip to vmbr10, the pve host is running v8.1.3 on kernel 6.5.11-7-pve.
While this is occurring, I can reproduce using ssh onside the LXC and communication is indeed down, and during this time I was able to ssh onto my opnsense gateway and confirm it is indeed able to ping or clurl the switch no problem, so were my opnSense to recieve the packets from the LXC it would pass them along correctly...
Uptime is running inside docker inside the LXC and I do believe I have similar issues within docker networking itself (some containers timeout between my traefik instance and the gitea container itselft for example...) but that seems unrelated as within docker itself...
The host is a 8365U so powerfull enough, it's sitting arount 30%CPU usage, no swapping with the 32GB of RAM I added, it is quite busy running around 100 containers total, some in LXCs, some in VMs, but overall no slowness or anything besides these random network dropouts*
I recently tries to increase
ulimit -n 99999
(it was 1024 everywhere) but it doesn't seem to do any better...Any idea ?
Here is my
/etc/network/interfaces
:
Bash:
auto lo
iface lo inet loopback
auto enp1s0
iface enp1s0 inet manual
mtu 9000
#eth0
auto enp2s0
iface enp2s0 inet manual
mtu 9000
#eth1
auto enp3s0
iface enp3s0 inet manual
mtu 9000
#eth2
auto enp4s0
iface enp4s0 inet manual
mtu 9000
#eth3
auto enp5s0
iface enp5s0 inet manual
mtu 9000
#eth4
auto enp6s0
iface enp6s0 inet manual
mtu 9000
#eth5
iface enx00e04c534458 inet manual
auto bond1
iface bond1 inet manual
bond-slaves enp5s0 enp6s0
bond-miimon 100
bond-mode balance-xor
bond-xmit-hash-policy layer3+4
mtu 9000
#LAGG_WAN
auto bond0
iface bond0 inet manual
bond-slaves enp1s0 enp2s0 enp3s0 enp4s0
bond-miimon 100
bond-mode balance-xor
bond-xmit-hash-policy layer3+4
mtu 9000
#LAGG_Switch
auto vmbr1000
iface vmbr1000 inet manual
bridge-ports bond0
bridge-stp on
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 1-4094
mtu 9000
#Bridge All VLANs to SWITCH
auto vmbr2000
iface vmbr2000 inet manual
bridge-ports bond1
bridge-stp on
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 1-4094
mtu 9000
#Bidge WAN
auto vmbr1000.10
iface vmbr1000.10 inet manual
mtu 9000
#VMs
auto vmbr1000.99
iface vmbr1000.99 inet manual
mtu 9000
#VMs
auto vmbr10
iface vmbr10 inet static
address 10.0.10.9/24
gateway 10.0.10.1
bridge-ports vmbr1000.10
bridge-stp off
bridge-fd 0
post-up ip rule add from 10.0.10.0/24 table 10Server prio 1
post-up ip route add default via 10.0.10.1 dev vmbr10 table 10Server
post-up ip route add 10.0.10.0/24 dev vmbr10 table 10Server
mtu 9000
auto vmbr99
iface vmbr99 inet static
address 10.0.99.9/24
gateway 10.0.99.1
bridge-ports vmbr1000.99
bridge-stp off
bridge-fd 0
post-up ip rule add from 10.0.99.0/24 table 99Test prio 1
post-up ip route add default via 10.0.99.1 dev vmbr99 table 99Test
post-up ip route add 10.0.99.0/24 dev vmbr99 table 99Test
mtu 9000
I do have the proper tables created I believe :
Bash:
root@pve:~ # cat /etc/iproute2/rt_tables.d/200_10Server.conf
200 10Server
root@pve:~ # cat /etc/iproute2/rt_tables.d/204_99Test.conf
204 99Test
root@pve:~ #