Hi,
I'm experiencing a strange behaviour on my PVE cluster with an LXC container.
Context: I have a PVE cluster running on baremetal with version 8.1.3 with SDN Networking in place.
I created an LXC container (Ubuntu22.04) on one host and I'm trying to reach the cluster API using Proxmoxer.
The issue: is that the PVE API is not reacheable on any host even with curl. I get an SSL timeout. IP and TCP connectivity (ping and netcat) is OK but not TLS. Bellow is a curl example output where we see L3/L4 connectivity is working but TLS is hanging:
I then looked and tcpdump capture to understand what happens. Taking capture inside the container (on source) and on the PVE Host (destination) show that the TLS "Server Hello" package is not forwarded back to the container and gets lost somewhere.
Note: Node where the LXC is hosted is different that the Node where I'm trying to reach API, but it's the same issue if I try local node API.
Bellow are pcap capture comparison from both source an destination:
IP 10.41.81.55 is the container
IP 10.41.80.5 is the PVE node with API
We clearly see that TCP communication (SYN,SYN ACK, ACK) is Ok and then "Client hello" is send and received but "server hello" is sent by the PVE node and never received on container side.
I made some investigation to know where the paquet is lost, and it seems to be "dropped" after passing the bridge "vmbr0v3312" on the host where the container is running.
I made a small diagram bellow of the actual PVE node configuration (from what I observed) to better show where it's "dropped" (red arrow) observed by making tcpdump on all these interfaces one by one to see it disapear.
My cluster config is using PVE SDN with a zone called "external" of type vlan on Bridge "vmbr0". My container is on a Vnet called "OAM" tagged with id 3312 and my container id is 240.
I'm now lost on where to investigate to know why the "Server Hello" packet is dropped causing the timeout.
Does anyone can help to investigate on this ?
Don't hesitate to ask some questions or if I forgot to give some important details/logs.
I tried the same type of request with a VM on the same node and it's working although is using (almost) the same path. This seems to be related to LXC containers only. probably an issue with the kernel and or TLS libary ?
I also tried with another container (Rocky Linux9) on another node and it's the same.
BR,
A.
I'm experiencing a strange behaviour on my PVE cluster with an LXC container.
Context: I have a PVE cluster running on baremetal with version 8.1.3 with SDN Networking in place.
I created an LXC container (Ubuntu22.04) on one host and I'm trying to reach the cluster API using Proxmoxer.
The issue: is that the PVE API is not reacheable on any host even with curl. I get an SSL timeout. IP and TCP connectivity (ping and netcat) is OK but not TLS. Bellow is a curl example output where we see L3/L4 connectivity is working but TLS is hanging:
Bash:
root@Demo-CT:~# curl -v https://10.41.80.5:8006
* Trying 10.41.80.5:8006...
* Connected to 10.41.80.5 (10.41.80.5) port 8006 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
^C
I then looked and tcpdump capture to understand what happens. Taking capture inside the container (on source) and on the PVE Host (destination) show that the TLS "Server Hello" package is not forwarded back to the container and gets lost somewhere.
Note: Node where the LXC is hosted is different that the Node where I'm trying to reach API, but it's the same issue if I try local node API.
Bellow are pcap capture comparison from both source an destination:
IP 10.41.81.55 is the container
IP 10.41.80.5 is the PVE node with API
We clearly see that TCP communication (SYN,SYN ACK, ACK) is Ok and then "Client hello" is send and received but "server hello" is sent by the PVE node and never received on container side.
I made some investigation to know where the paquet is lost, and it seems to be "dropped" after passing the bridge "vmbr0v3312" on the host where the container is running.
I made a small diagram bellow of the actual PVE node configuration (from what I observed) to better show where it's "dropped" (red arrow) observed by making tcpdump on all these interfaces one by one to see it disapear.
My cluster config is using PVE SDN with a zone called "external" of type vlan on Bridge "vmbr0". My container is on a Vnet called "OAM" tagged with id 3312 and my container id is 240.
I'm now lost on where to investigate to know why the "Server Hello" packet is dropped causing the timeout.
Does anyone can help to investigate on this ?
Don't hesitate to ask some questions or if I forgot to give some important details/logs.
I tried the same type of request with a VM on the same node and it's working although is using (almost) the same path. This seems to be related to LXC containers only. probably an issue with the kernel and or TLS libary ?
I also tried with another container (Rocky Linux9) on another node and it's the same.
BR,
A.