[SOLVED] Strange network behaviour with LXC container and SDN on PVE

voriaz · Aug 6, 2024

Hi,

I'm experiencing a strange behaviour on my PVE cluster with an LXC container.

Context: I have a PVE cluster running on baremetal with version 8.1.3 with SDN Networking in place.
I created an LXC container (Ubuntu22.04) on one host and I'm trying to reach the cluster API using Proxmoxer.
The issue: is that the PVE API is not reacheable on any host even with curl. I get an SSL timeout. IP and TCP connectivity (ping and netcat) is OK but not TLS. Bellow is a curl example output where we see L3/L4 connectivity is working but TLS is hanging:

Bash:

root@Demo-CT:~# curl -v https://10.41.80.5:8006
*   Trying 10.41.80.5:8006...
* Connected to 10.41.80.5 (10.41.80.5) port 8006 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
^C

I then looked and tcpdump capture to understand what happens. Taking capture inside the container (on source) and on the PVE Host (destination) show that the TLS "Server Hello" package is not forwarded back to the container and gets lost somewhere.
Note: Node where the LXC is hosted is different that the Node where I'm trying to reach API, but it's the same issue if I try local node API.
Bellow are pcap capture comparison from both source an destination:

IP 10.41.81.55 is the container
IP 10.41.80.5 is the PVE node with API
We clearly see that TCP communication (SYN,SYN ACK, ACK) is Ok and then "Client hello" is send and received but "server hello" is sent by the PVE node and never received on container side.

I made some investigation to know where the paquet is lost, and it seems to be "dropped" after passing the bridge "vmbr0v3312" on the host where the container is running.
I made a small diagram bellow of the actual PVE node configuration (from what I observed) to better show where it's "dropped" (red arrow) observed by making tcpdump on all these interfaces one by one to see it disapear.
My cluster config is using PVE SDN with a zone called "external" of type vlan on Bridge "vmbr0". My container is on a Vnet called "OAM" tagged with id 3312 and my container id is 240.

I'm now lost on where to investigate to know why the "Server Hello" packet is dropped causing the timeout.
Does anyone can help to investigate on this ?
Don't hesitate to ask some questions or if I forgot to give some important details/logs.

I tried the same type of request with a VM on the same node and it's working although is using (almost) the same path. This seems to be related to LXC containers only. probably an issue with the kernel and or TLS libary ?
I also tried with another container (Rocky Linux9) on another node and it's the same.

BR,

A.

voriaz · Aug 30, 2024

Hi,

After some time, the issue is still here and it seems to randomly affect some destinations and not others... It's not related to the PVE API only.
I made some more tests on another cluster I have with exact same PVE version but without SDN and it's working as expected. Seems the issue can be around SDN.

Any help on how to debug this ?

voriaz · Aug 30, 2024

After testing again I realized that the packets that were dropped were bigger than traditional default MTU of 1500.
In fact the MTU of the vETH ln_<net> and pr_<net> were set to 1500 algouth all the other interfaces were to 9000.
This was a bug referenced here: https://bugzilla.proxmox.com/show_bug.cgi?id=5324

Search

Search

[SOLVED] Strange network behaviour with LXC container and SDN on PVE

voriaz

Member

voriaz

Member

voriaz

Member