noVNC/SPICE console disconnects with "SSL routines::record layer failure" on PVE 9.2.3 / OpenSSL 3.5.6

atomic-dz

New Member
Jun 30, 2026
4
0
1

Summary​


VM (QEMU) consoles via noVNC reliably disconnect after a few seconds, on every QEMU VM, while LXC console (termproxy) works fine. This happens even when connecting from a Windows VM running on the same host, ruling out client-side network/proxy/firewall issues. pveproxy logs show SSL routines::record layer failure at the exact moment of disconnect, and packet capture confirms the server sends the first FIN, with the underlying TLS session failing mid-stream.


Environment​




pve-manager: 9.2.3 (running version: 9.2.3/d0fde103346cf89a)
libpve-http-server-perl: 6.0.5
novnc-pve: 1.7.0-1
libnet-ssleay-perl: 1.94-3
libssl3t64: 3.5.6-1~deb13u2
openssl: OpenSSL 3.5.6 7 Apr 2026
machine: pc-i440fx-11.0 (also reproduced on other machine types)

Single standalone node. No reverse proxy, no Cloudflare, no firewall/iptables rules affecting port 8006 (confirmed clean conntrack, no DROP/REJECT rules). Confirmed not MTU/fragmentation related (DF-ping at 1472 and 1400 bytes both succeed cleanly to all tested clients).


Steps to reproduce​


  1. Open any QEMU VM's console (noVNC) from the web UI.
  2. Console connects successfully (HTTP/1.1 101 Switching Protocols, WebSocket upgrade succeeds, VNC handshake completes).
  3. After roughly 1–10 seconds of normal operation, the connection drops with the noVNC error Failed when connecting: Connection closed (code: 1006).
  4. Browser console shows a burst of net::ERR_SSL_PROTOCOL_ERROR across all concurrent requests to port 8006 (API calls, static assets, the websocket itself) — not just the VNC socket.
  5. LXC console (termproxy) on the same host does not exhibit this issue and stays connected normally for the same duration of testing.

Reproduced with:


  • External client over public internet (different ISPs/source IPs)
  • A client browser running on a Windows VM hosted on the same Proxmox node (rules out external network entirely)
  • Multiple different QEMU VMs (different VMIDs, different vga settings tested: default, std, cirrus)

Server-side logs at the moment of disconnect​




pveproxy[...]: problem with client ::ffff:<client-ip>; error:0A000139:SSL routines::record layer failure

pveproxy -debug output around the failure shows a clean WebSocket upgrade followed shortly after by client_do_disconnect, with no error printed by the AnyEvent/http-server layer itself — the failure originates lower, in OpenSSL's record layer, not in PVE::APIServer::AnyEvent.


Packet capture findings​


Captured with tcpdump on the host (tap*, fwln*, fwpr*, vmbr0 chain for the relevant VM) while reproducing from a same-host Windows VM client (65.109.121.94), filtering only that client's traffic (excluding background bot/scanner traffic on the public interface):




server.8006 > client.<port>: Flags [F.] <- server sends FIN first
client.<port> > server.8006: Flags [R.] <- client RSTs in response

This pattern repeats across multiple concurrent TCP connections almost simultaneously (4 separate sessions, ports 54896/54897/54899/54900, all closed by the server within ~1.5 seconds of each other), correlating exactly with the record layer failure log lines. This strongly suggests the failure is not connection-specific but tied to a shared resource/state in the pveproxy worker process or the underlying OpenSSL context at that moment.


What has been ruled out​


  • Client-side network, VPN/proxy, browser, OS — reproduced from a VM on the same physical host
  • Firewall rules / iptables / Docker conntrack interference — clean conntrack table (376/262144), no relevant DROP/REJECT rules, no NAT rules touching port 8006
  • MTU/fragmentation — DF-ping tests at 1400 and 1472 bytes both succeed without fragmentation
  • pveproxy worker resource exhaustion — file descriptor count (14) and process limits (1024 soft / 524288 hard) far from any limit; memory usage normal
  • Certificate issues — certs valid, pvecm updatecerts --force and service restart did not change behavior
  • VM-specific config (vga type, machine type, cpu type) — reproduced across different VM configurations

Hypothesis​


Given the failure is isolated to noVNC/vncwebsocket sessions specifically (not LXC termproxy, not plain HTTPS API calls under openssl s_client), and correlates with larger/bursty TLS record traffic typical of VNC framebuffer updates, this looks like a possible regression in how libpve-http-server-perl 6.0.5 / Net::SSLeay 1.94-3 interacts with OpenSSL 3.5.6 when handling sustained WebSocket traffic with larger TLS records, possibly related to TLS session ticket rotation or buffer handling under this specific combination of versions.


Question for the Proxmox team​


  • Is anyone else seeing SSL routines::record layer failure specifically on noVNC/vncwebsocket sessions on PVE 9.2.x with OpenSSL 3.5.6?
  • Is there a known interaction between libpve-http-server-perl 6.0.5 and recent OpenSSL 3.5.x point releases affecting long-lived WebSocket TLS sessions?
  • Any recommended TLS cipher/protocol restriction in /etc/default/pveproxy known to work around this?

Happy to provide full pveproxy -debug output, complete pcap, or run further tests as needed.
 
Last edited:
Additional findings and workaround

After further testing, I can confirm that the SSL routines::record layer failure error and the subsequent noVNC console disconnects are triggered only by direct external TLS connections to pveproxy. The bug completely disappears when an nginx reverse proxy is placed in front of pveproxy and configured to terminate external TLS, while proxying to the local pveproxy over HTTPS on the loopback interface.

Workaround details:

  1. Configure pveproxy to listen only on localhost by setting LISTEN_IP="127.0.0.1" in /etc/default/pveproxy and restarting the service.
  2. Install nginx and use it to handle all external TLS traffic, forwarding requests to https://127.0.0.1:8006 with proxy_ssl_verify off.
Example nginx configuration:


Code:
server {
    listen server:8006 ssl;
    server_name server;

    ssl_certificate /etc/pve/local/pve-ssl.pem;
    ssl_certificate_key /etc/pve/local/pve-ssl.key;

    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_prefer_server_ciphers on;
    ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384;

    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
    proxy_connect_timeout 30s;

    location / {
        proxy_pass https://127.0.0.1:8006;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_ssl_verify off;
    }
}
With this setup, all noVNC and SPICE console sessions remain stable indefinitely, and no record layer failure errors are logged. External clients connect to nginx (port 8006), which decrypts TLS and then establishes a new local TLS connection to pveproxy via the loopback interface.

Interpretation:
The fact that the error never occurs over loopback — even when pveproxy is still processing TLS (just locally) — strongly suggests that the OpenSSL/Net::SSLeay bug depends on certain characteristics of the underlying TCP connection that differ between external network paths and the loopback interface (e.g., MSS, segmentation/fragmentation behaviour, timing of TLS record delivery). External connections, including those from a VM on the same host traversing a bridge, reproduce the bug; localhost connections do not.

This workaround eliminates the problem entirely until a proper fix is available in pveproxy or its dependencies. It also provides a reliable way to isolate the issue during further debugging.
 
Last edited:
Good catch filing the bug. Small note for anyone copying the workaround: the example needs a semicolon after `proxy_pass https://127.0.0.1:8006`. Also, if this node is reachable from the internet, I would still put an allowlist/VPN in front of nginx rather than just moving the public TLS endpoint from pveproxy to nginx. That keeps the workaround from accidentally becoming a broader exposure of the management UI while the record-layer issue is being tracked.
 
Good catch filing the bug. Small note for anyone copying the workaround: the example needs a semicolon after `proxy_pass https://127.0.0.1:8006`. Also, if this node is reachable from the internet, I would still put an allowlist/VPN in front of nginx rather than just moving the public TLS endpoint from pveproxy to nginx. That keeps the workaround from accidentally becoming a broader exposure of the management UI while the record-layer issue is being tracked.
Thank you, edited, added semicolon there.