Subject: Intermittent backup errors and repeated TLS handshake failures

Jly4

New Member
Aug 21, 2025
4
1
3
I am running PBS as a VM on the same Proxmox host where I back up LXC containers and VMs.
Backups usually work, but sometimes they fail with errors like:

Bash:
catalog upload error - pipelined request failed: connection closed because of a broken pipe
Error: error at "usr/include/xercesc/framework/psvi"
Caused by: sending on a closed channel
--or--
catalog upload error - channel closedError: pipelined request failed: connection closed because of a broken pipe
This happens intermittently and usually only for certain containers, while other containers back up successfully before and after.


Additionally, PBS is accessed through an Nginx reverse proxy running in a separate LXC container on the same Proxmox host.
in journalctl -f -b I constantly see the following message:

C-like:
Aug 21 17:31:58 pbs proxmox-backup-proxy[826]: [[::ffff:172.16.100.65]:54132] failed to check for TLS handshake: couldn't peek into incoming TCP stream
Aug 21 17:32:28 pbs proxmox-backup-proxy[826]: [[::ffff:172.16.100.65]:33634] failed to check for TLS handshake: couldn't peek into incoming TCP stream
Aug 21 17:32:58 pbs proxmox-backup-proxy[826]: [[::ffff:172.16.100.65]:43834] failed to check for TLS handshake: couldn't peek into incoming TCP stream
Aug 21 17:33:28 pbs proxmox-backup-proxy[826]: [[::ffff:172.16.100.65]:34628] failed to check for TLS handshake: couldn't peek into incoming TCP stream
Aug 21 17:33:58 pbs proxmox-backup-proxy[826]: [[::ffff:172.16.100.65]:57200] failed to check for TLS handshake: couldn't peek into incoming TCP stream


Nginx config:
NGINX:
server {
    listen 80 ;
    listen [::]:80;
    server_name domain.com;
    rewrite ^(.*) https://$host$1 permanent;
}
 
server {
    listen 443 ssl;
    listen [::]:443 ssl;
    server_name domain.com;
    ssl_certificate /etc/letsencrypt/live/domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/domain.com/privkey.pem;
    proxy_redirect off;
    location / {
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_pass https://172.16.100.62:8007;
        proxy_buffering off;
        client_max_body_size 0;
        proxy_connect_timeout  3600s;
        proxy_read_timeout  3600s;
        proxy_send_timeout  3600s;
        send_timeout  3600s;
    }
}


Environment:
  • PBS runs inside a VM on the Proxmox host
  • Reverse proxy: Nginx in a separate LXC container
  • 172.16.100.62 - pbs
  • 172.16.100.65 - lxc with nginx
Bash:
proxmox-backup                     4.0.0        running kernel: 6.14.8-2-pve
proxmox-backup-server              4.0.11-2     running version: 4.0.11
proxmox-kernel-helper              9.0.3
proxmox-kernel-6.14.8-2-pve-signed 6.14.8-2
proxmox-kernel-6.14                6.14.8-2
ifupdown2                          3.3.0-1+pmx9
libjs-extjs                        7.0.0-5
proxmox-backup-docs                4.0.11-2
proxmox-backup-client              4.0.11-1
proxmox-mail-forward               1.0.2
proxmox-mini-journalreader         1.6
proxmox-offline-mirror-helper      0.7.0
proxmox-widget-toolkit             5.0.5
pve-xtermjs                        5.5.0-2
smartmontools                      7.4-pve1
zfsutils-linux                     2.3.3-pve1


Questions:
  1. Could the repeated TLS handshake errors in the PBS journal be caused by my Nginx reverse proxy setup?
  2. Are the intermittent pxar backup failures (broken pipe, catalog upload error) related to the proxy, or are they more likely caused by PBS itself running inside the VM?
  3. What steps can I take to troubleshoot or handle these errors?

Thanks for any help
 
Last edited:
At a first glance, your reverse-proxy setup seems to be fine.

Some basic troubleshooting things:
  1. Does connecting to the web interface via both http://yourdomain.tld and https://yourdomain.tld work?.
    (I assume it does, otherwise you'd have already noticed. Right? ;P)

  2. Do the errors go away if you use https://172.16.100.62:8007 directly in your backup config and other places?

  3. Do you have any health monitoring service running or something similar that just checks whether your PBS VM is reachable? Perhaps by just establishing a TCP connection and then sending weird data? Because that would be enough to trigger the errors.

  4. What's the overall resource usage during backups? CPU, RAM, IOPS, etc. of your PVE host(s) and of your CTs and VMs as well. If there are any hangs or longer hickups, it might explain why you're running into that specific code path.

  5. You mentioned that this happens "usually only for certain containers, while other containers back up successfully before and after"—so for which containers / (and VMs?) does this happen? For which does it not happen? What kind of applications are running in those containers?

  6. Since your Nginx instance runs inside a container, are you also backing up that container itself to your PBS VM? What might happen is that a short interruption of the container causes the connection to hang / get messed up somehow, but that should be very unlikely.
That's all I can think of off the top of my head right now. Thanks for reporting this here by the way! (:
 
Last edited:
At a first glance, your reverse-proxy setup seems to be fine.

Some basic troubleshooting things:
  1. Does connecting to the web interface via both http://yourdomain.tld and https://yourdomain.tld work?.
    (I assume it does, otherwise you'd have already noticed. Right? ;P)

  2. Do the errors go away if you use https://172.16.100.62:8007 directly in your backup config and other places?

  3. Do you have any health monitoring service running or something similar that just checks whether your PBS VM is reachable? Perhaps by just establishing a TCP connection and then sending weird data? Because that would be enough to trigger the errors.

  4. What's the overall resource usage during backups? CPU, RAM, IOPS, etc. of your PVE host(s) and of your CTs and VMs as well. If there are any hangs or longer hickups, it might explain why you're running into that specific code path.

  5. You mentioned that this happens "usually only for certain containers, while other containers back up successfully before and after"—so for which containers / (and VMs?) does this happen? For which does it not happen? What kind of applications are running in those containers?

  6. Since your Nginx instance runs inside a container, are you also backing up that container itself to your PBS VM? What might happen is that a short interruption of the container causes the connection to hang / get messed up somehow, but that should be very unlikely.
That's all I can think of off the top of my head right now. Thanks for reporting this here by the way! (:
Thank you for your response.

  1. Yes, it does, it works fine.
  2. The problem is that backup errors do not appear consistently — roughly 1 out of 30 backups fails. Because of this, it is quite hard to confirm if the issue is resolved.
  3. Yes, I found the cause of the TLS handshake errors. It turned out to be the built-in health check in the Nginx UI, which is the web panel for managing Nginx.
  4. The load is usually <25% CPU, <50% I/O delay, <50% RAM on PVE, and <50% CPU, <60% RAM on PBS. I/O load is often around 40%, with occasional peaks. My PVE is not very big — most LXC containers do not exceed 10 GB, and due to incremental and daily backups, the backup size is often under 1GB.
  5. Once the issue happened with the PVE backup script, and a few times with this community LXC script. It is not very important for me, so today I just deleted this LXC. What bothers me more are the recurring TCP handshake errors.
  6. That is an interesting point, but backups of the LXC with Nginx always succeed.
 
Yes, I found the cause of the TLS handshake errors. It turned out to be the built-in health check in the Nginx UI, which is the web panel for managing Nginx.

Okay, that's good!

The load is usually <25% CPU, <50% I/O delay, <50% RAM on PVE, and <50% CPU, <60% RAM on PBS. I/O load is often around 40%, with occasional peaks. My PVE is not very big — most LXC containers do not exceed 10 GB, and due to incremental and daily backups, the backup size is often under 1GB.

What do you mean by <50% I/O delay? Do you mean <50ms I/O delay, perhaps?

Once the issue happened with the PVE backup script, and a few times with this community LXC script. It is not very important for me, so today I just deleted this LXC. What bothers me more are the recurring TCP handshake errors.

What PVE backup script are you referring to?


Regarding the TLS handshake errors: Those shouldn't be linked to your problem of backups sporadically failing, but given that they're rather annoying if one has to rely on a monitoring service with that kind of healthcheck mechanism, I'll see if I can improve the overall logic in that code path a little.
 
What do you mean by <50% I/O delay? Do you mean <50ms I/O delay, perhaps?
I've checked there

chrome_CaFacUOeNO.png

What PVE backup script are you referring to?
Script for proxmox-backup-client
Bash:
proxmox-backup-client backup $BACKUP_NAME.$BACKUP_TYPE:$BACKUP_PATH --repository $USER@$REPOSITORY:$PORT:$DATASTORE $ADDITIONAL

Regarding the TLS handshake errors: Those shouldn't be linked to your problem of backups sporadically failing, but given that they're rather annoying if one has to rely on a monitoring service with that kind of healthcheck mechanism, I'll see if I can improve the overall logic in that code path a little.
Yeah, and what's interesting is that the Proxmox host is also managed via the Nginx UI, but there aren’t any TLS handshake errors.
 
I've checked there

chrome_CaFacUOeNO.png

Oh right! My bad. I usually look at different stats.

... So, wait. Your IO delay stays below 50%? How high does it usually spike?

From man 1 iostat:
%iowait
Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

So, if your IO delay percentage is at around 40% for example, it means that 40% of your CPUs' time was spent waiting on your disk(s). That could very well be the reason why some backups might intermittently fail. The next time a backup fails / hangs / aborts, check if you find anything in journalctl -x and also see what the IO delay was at that time.

What kind of hardware are you using for your storage?

Yeah, and what's interesting is that the Proxmox host is also managed via the Nginx UI, but there aren’t any TLS handshake errors.

Yeah, PVE and PBS use different HTTP servers under the hood and the HTTP-to-HTTPS redirects have subtle differences in how they work.

Either way, I'll see if I can improve the error handling we have on the PBS side, since it's clearly distracting / worrying.
 
So, if your IO delay percentage is at around 40% for example, it means that 40% of your CPUs' time was spent waiting on your disk(s). That could very well be the reason why some backups might intermittently fail. The next time a backup fails / hangs / aborts, check if you find anything in journalctl -x and also see what the IO delay was at that time.

What kind of hardware are you using for your storage?
I don’t want to take up too much of your time — looking at it now, it might just be an issue with my hardware. If so, I don’t think it makes sense to pursue this further.

Still, in case it’s useful for you, I’ve made a short video with some details. Google Drive
Yeah, PVE and PBS use different HTTP servers under the hood and the HTTP-to-HTTPS redirects have subtle differences in how they work.

Either way, I'll see if I can improve the error handling we have on the PBS side, since it's clearly distracting / worrying.
I see, thank you, I really appreciate your work!
 
Last edited:
  • Like
Reactions: Max Carrara