I am looking into a move from VMWare > Proxmox and am stumped by what I believe to be a bug that completely breaks my automation tools when VLANs are configured for the management interface.
When the Proxmox host has management IPs on VLAN interfaces (either VLAN-aware bridge or traditional VLAN interfaces), certain applications fail while others work normally:
FAILS:
* Ansible (Python module execution)
* Proxmox API (HTTPS requests timeout)
* Terraform (relies on Proxmox API)
WORKS:
* SFTP/SCP file transfers
* SSH (authentication, simple commands, even manual piping)
* Ping
* General network connectivity including web UI
The connection establishes successfully (SSH auth works, TLS handshake completes), but then specific operations timeout:
Ansible:
But this works:
Proxmox API: Hangs/times out after TLS handshake completes
But manual SFTP works fine:
SSH Debug Output (from Ansible):
Configurations Tested (ALL FAIL with Ansible/API)
VLAN-aware bridge:
Traditional VLAN interfaces:
Forum-recommended config with custom names:
All produce the same failures.
Simple bridge (NO VLANs) - WORKS:
Note: The above is still on a VLAN at the switch side, It's just put on an access port/native VLAN.
Testing:
**Kernel 6.17.9-1-pve:** Bug present
**Kernel 6.17.2-1-pve:** Bug present
**Hardware offloading:** Disabled TSO/GSO/GRO - no change
**Multiple hosts:** Reproduced on different Proxmox hosts with different NICs - same result
**Manual SSH piping:** Works fine (can pipe Python scripts and execute with sudo)
**SFTP transfers:** Work perfectly
The only solution is to basically avoid VLAN trunking for host management which really shouldn't be the case.
I found the following which seemed to be the same issue but unfortunately did not resolve it for me:
* VLAN-aware configuration kills TCP handshake
* Connections to PVE in VLAN timeout
Packet Capture
Packet capture data between working and non working setup using the same ansible commands, with the same IP, this capture was done on the PVE host itself:
Working configuration (no VLANs):
Failing configuration (with VLANs):
With VLANs configured, the Proxmox host never seems to send any outbound data packets. The TCP connection establishes successfully (SYN/ACK works), and ACK packets flow bidirectionally, but actual data payloads from the server are being dropped/blocked by the kernel.
This probably? explains why:
* Small packet operations work (SSH auth, ping)
* Large data transfers fail (Ansible, API)
Environment:
Test Host 1:
Test Host 2:
I'm kind of lost now, hoping anyone here knows more than me and can point me in the right direction.
At the moment this seems to be some kind of kernel bug where outbound TCP data packets are silently dropped on VLAN interfaces (both VLAN-aware bridges and traditional VLAN interfaces), while connection management packets (SYN, ACK, FIN) pass through normally. Though I would be very happy to hear I am just being an idiot.
When the Proxmox host has management IPs on VLAN interfaces (either VLAN-aware bridge or traditional VLAN interfaces), certain applications fail while others work normally:
FAILS:
* Ansible (Python module execution)
* Proxmox API (HTTPS requests timeout)
* Terraform (relies on Proxmox API)
WORKS:
* SFTP/SCP file transfers
* SSH (authentication, simple commands, even manual piping)
* Ping
* General network connectivity including web UI
The connection establishes successfully (SSH auth works, TLS handshake completes), but then specific operations timeout:
Ansible:
Code:
fatal: [host]: UNREACHABLE! =>
msg: 'Data could not be sent to remote host'
But this works:
Code:
ansible host -m raw -a "echo test" # SUCCESS
ansible host -m ping # FAILS (requires Python module transfer)
Proxmox API: Hangs/times out after TLS handshake completes
Code:
curl -k https://10.83.2.40:8006/api2/json/version
But manual SFTP works fine:
Code:
scp file.txt root@10.83.2.40:/tmp/
SSH Debug Output (from Ansible):
Code:
debug2: channel 2: read failed rfd 6 maxlen 32768: Broken pipe
Read from remote host 10.83.2.40: Operation timed out
client_loop: send disconnect: Broken pipe
Configurations Tested (ALL FAIL with Ansible/API)
VLAN-aware bridge:
Code:
auto vmbr0
iface vmbr0 inet manual
bridge-ports nic1
bridge-vlan-aware yes
bridge-vids 2-4094
auto vmbr0.2
iface vmbr0.2 inet static
address 10.83.2.40/24
Traditional VLAN interfaces:
Code:
auto nic1.2
iface nic1.2 inet static
address 10.83.2.40/24
auto vmbr0
iface vmbr0 inet manual
bridge-ports nic1
Forum-recommended config with custom names:
Code:
auto mgmt
iface mgmt inet static
address 10.83.2.40/24
vlan-id 2
vlan-raw-device vmbr0
All produce the same failures.
Simple bridge (NO VLANs) - WORKS:
Code:
auto vmbr0
iface vmbr0 inet static
address 10.83.2.40/24
bridge-ports nic1
Note: The above is still on a VLAN at the switch side, It's just put on an access port/native VLAN.
Testing:
**Kernel 6.17.9-1-pve:** Bug present
**Kernel 6.17.2-1-pve:** Bug present
**Hardware offloading:** Disabled TSO/GSO/GRO - no change
**Multiple hosts:** Reproduced on different Proxmox hosts with different NICs - same result
**Manual SSH piping:** Works fine (can pipe Python scripts and execute with sudo)
**SFTP transfers:** Work perfectly
The only solution is to basically avoid VLAN trunking for host management which really shouldn't be the case.
I found the following which seemed to be the same issue but unfortunately did not resolve it for me:
* VLAN-aware configuration kills TCP handshake
* Connections to PVE in VLAN timeout
Packet Capture
Packet capture data between working and non working setup using the same ansible commands, with the same IP, this capture was done on the PVE host itself:
Working configuration (no VLANs):
Code:
10.83.2.40.22 > client: Flags [P.], seq [...], length 132 < Server sends data
client > 10.83.2.40.22: Flags [.], ack 132 < Client ACKs data
Failing configuration (with VLANs):
Code:
client > 10.83.2.40.22: Flags [.], ack 1, length 0 < Only ACKs, no data
client > 10.83.2.40.22: Flags [.], ack 1, length 0
client > 10.83.2.40.22: Flags [.], ack 1, length 0
client > 10.83.2.40.22: Flags [.], ack 1, length 0
With VLANs configured, the Proxmox host never seems to send any outbound data packets. The TCP connection establishes successfully (SYN/ACK works), and ACK packets flow bidirectionally, but actual data payloads from the server are being dropped/blocked by the kernel.
This probably? explains why:
* Small packet operations work (SSH auth, ping)
* Large data transfers fail (Ansible, API)
Environment:
Test Host 1:
Code:
Proxmox VE: 9.1.5
Kernel: 6.17.2-1-pve and 6.17.9-1-pve
NIC: Intel I226-V (2.5GbE) - igc driver
Test Host 2:
Code:
Proxmox VE: 9.1.5
Kernel: 6.17.2-1-pve and 6.17.9-1-pve
NIC: Intel x710 (10GbE)
I'm kind of lost now, hoping anyone here knows more than me and can point me in the right direction.
At the moment this seems to be some kind of kernel bug where outbound TCP data packets are silently dropped on VLAN interfaces (both VLAN-aware bridges and traditional VLAN interfaces), while connection management packets (SYN, ACK, FIN) pass through normally. Though I would be very happy to hear I am just being an idiot.