[SOLVED] Strange network behaviour with LXC container and SDN on PVE

voriaz

Member
Mar 2, 2022
5
1
8
27
Hi,

I'm experiencing a strange behaviour on my PVE cluster with an LXC container.

Context: I have a PVE cluster running on baremetal with version 8.1.3 with SDN Networking in place.
I created an LXC container (Ubuntu22.04) on one host and I'm trying to reach the cluster API using Proxmoxer.
The issue: is that the PVE API is not reacheable on any host even with curl. I get an SSL timeout. IP and TCP connectivity (ping and netcat) is OK but not TLS. Bellow is a curl example output where we see L3/L4 connectivity is working but TLS is hanging:
Bash:
root@Demo-CT:~# curl -v https://10.41.80.5:8006
*   Trying 10.41.80.5:8006...
* Connected to 10.41.80.5 (10.41.80.5) port 8006 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
^C

I then looked and tcpdump capture to understand what happens. Taking capture inside the container (on source) and on the PVE Host (destination) show that the TLS "Server Hello" package is not forwarded back to the container and gets lost somewhere.
Note: Node where the LXC is hosted is different that the Node where I'm trying to reach API, but it's the same issue if I try local node API.
Bellow are pcap capture comparison from both source an destination:
qUflNAscQa.png
IP 10.41.81.55 is the container
IP 10.41.80.5 is the PVE node with API
We clearly see that TCP communication (SYN,SYN ACK, ACK) is Ok and then "Client hello" is send and received but "server hello" is sent by the PVE node and never received on container side.


I made some investigation to know where the paquet is lost, and it seems to be "dropped" after passing the bridge "vmbr0v3312" on the host where the container is running.
I made a small diagram bellow of the actual PVE node configuration (from what I observed) to better show where it's "dropped" (red arrow) observed by making tcpdump on all these interfaces one by one to see it disapear.
My cluster config is using PVE SDN with a zone called "external" of type vlan on Bridge "vmbr0". My container is on a Vnet called "OAM" tagged with id 3312 and my container id is 240.
Network_Diagram_pve10.drawio.png
I'm now lost on where to investigate to know why the "Server Hello" packet is dropped causing the timeout.
Does anyone can help to investigate on this ?
Don't hesitate to ask some questions or if I forgot to give some important details/logs.

I tried the same type of request with a VM on the same node and it's working although is using (almost) the same path. This seems to be related to LXC containers only. probably an issue with the kernel and or TLS libary ?
I also tried with another container (Rocky Linux9) on another node and it's the same.


BR,

A.
 
Hi,

After some time, the issue is still here and it seems to randomly affect some destinations and not others... It's not related to the PVE API only.
I made some more tests on another cluster I have with exact same PVE version but without SDN and it's working as expected. Seems the issue can be around SDN.

Any help on how to debug this ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!