Crashing API PBS 4.0.11

kitsune242

Member
Oct 27, 2022
5
4
8
Hi,
I’m running into a rather strange and specific issue.
I have the latest version of PBS 4.0.11 installed on bare metal. It’s connected to an S3 endpoint as the main storage and an NFS share used as cache. The whole setup is configured correctly and works perfectly fine for single backups.
However, when I start running dozens of backups in parallel, after anywhere from a few minutes up to ~3 hours, the system essentially becomes unavailable. More precisely:
  • The PBS web interface on port 8007 stops responding.
  • According to the logs, the proxmox-backup-proxy service is still running normally (no errors or crashes are shown).
  • Backups stop progressing
If I run:
systemctl restart proxmox-backup-proxy.service
everything immediately goes back to normal, and backups work again — but only for a while, until the issue repeats.
All traffic runs over 10Gb NICs. I’m not sure where exactly to look for the root cause — whether it’s PBS itself, the proxy service, or something related to network/storage performance.

Any pointers on how to debug this would be greatly appreciated.
 
  • Like
Reactions: Kosh
What are the specs of your hardware? Have you monitored I/O pressure, memory & CPU usage, etc.?
 
As a coworker just pointed out to me (thanks Chris!), you might be running into a bug that was (very) recently fixed.

If possible, can you try activating the pbs-test repository and install proxmox-backup-server version 4.0.16-1 or greater, and then see if the issue persists?
 
  • Like
Reactions: Kosh
I've been running into this same thing, pretty much the exact same setup as well.. except, I'm only using S3 for a sync endpoint. I'm running latest 4.1.4. I have monit running in the background, an it picks up that the api is not responding and eventually restarts it.. but a bunch of backups fail.
 
I've been running into this same thing, pretty much the exact same setup as well.. except, I'm only using S3 for a sync endpoint. I'm running latest 4.1.4. I have monit running in the background, an it picks up that the api is not responding and eventually restarts it.. but a bunch of backups fail.
Please share an excerpt of the systemd journal from around the time when the issue appears. And include the PBS and PVE task logs for the backup jobs which fail. Further note that placing the s3 cache onto an NFS share is not a good idea, the cache should be a fast local filesystem.

Edit: Also, are you able to identify which operation does trigger this behavior? Does it correlate with sync job, gc job, verify job, ...
 
Last edited:
It starts out with a backup job at midnight:
root@pbs:/var/log# journalctl --since=today
Mar 27 00:00:00 pbs proxmox-backup-proxy[111567]: unable to start datastore verification job v-9206d00d-24cb - no such datastore 'backup_ceph'
Mar 27 00:00:00 pbs proxmox-backup-proxy[111567]: rotated access log, telling daemons to re-open log file
Mar 27 00:00:00 pbs proxmox-backup-api[3280144]: re-opening access-log file
Mar 27 00:00:00 pbs proxmox-backup-proxy[111567]: re-opening access-log file
Mar 27 00:03:45 pbs systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Mar 27 00:03:45 pbs systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
Mar 27 00:03:45 pbs systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
Mar 27 00:03:45 pbs systemd[1]: systemd-journald.service: Consumed 17min 14.299s CPU time, 1.9G memory peak, 416K memory swap peak.
Mar 27 00:03:45 pbs systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 5.
Mar 27 00:03:45 pbs systemd[1]: Starting systemd-journald.service - Journal Service...
Mar 27 00:03:45 pbs systemd[1]: proxmox-backup.service: Deactivated successfully.
Mar 27 00:03:45 pbs systemd[1]: proxmox-backup.service: Consumed 1d 14h 10min 43.908s CPU time, 179.4M memory peak, 2.9M memory swap peak.
Mar 27 00:03:45 pbs systemd-journald[954095]: Collecting audit messages is disabled.
Mar 27 00:03:45 pbs systemd-journald[954095]: Journal started
Mar 27 00:03:45 pbs systemd-journald[954095]: System Journal (/var/log/journal/9f533500588b419f99b01763f891755d) is 3.9G, max 4G, 3.7M free.
Mar 27 00:03:44 pbs systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
Mar 27 00:01:44 pbs sshd-session[954064]: Connection closed by 127.0.0.1 port 60334 [preauth]
Mar 27 00:03:45 pbs proxmox-backup-proxy[111567]: unable to start datastore verification job v-9206d00d-24cb - no such datastore 'backup_ceph'
Mar 27 00:03:44 pbs systemd[1]: systemd-journald.service: Killing process 1037865 (systemd-journal) with signal SIGABRT.
Mar 27 00:02:44 pbs sshd-session[954073]: Connection closed by 127.0.0.1 port 51310 [preauth]
Mar 27 00:03:44 pbs proxmox-backup-api[3280144]: server error: Transport endpoint is not connected (os error 107)
Mar 27 00:03:44 pbs sshd-session[954088]: Connection closed by 127.0.0.1 port 38280 [preauth]
Mar 27 00:03:45 pbs systemd[1]: Started systemd-journald.service - Journal Service.
Mar 27 00:03:44 pbs proxmox-backup-api[3280144]: request_shutdown
Mar 27 00:03:44 pbs proxmox-backup-api[3280144]: daemon shutting down...
Mar 27 00:03:44 pbs proxmox-backup-api[3280144]: daemon shut down.
Mar 27 00:03:44 pbs proxmox-backup-api[3280144]: server shutting down, waiting for active workers to complete
Mar 27 00:03:44 pbs proxmox-backup-api[3280144]: done - exit server
Mar 27 00:03:44 pbs proxmox-backup-proxy[111567]: POST //api2/json/access/ticket: 400 Bad Request: [client [::ffff:10.200.2.2]:55668] client error (SendRequest)
Mar 27 00:03:44 pbs proxmox-backup-proxy[111567]: POST /api2/json/access/ticket: 400 Bad Request: [client [::ffff:10.200.1.103]:33254] client error (SendRequest)
Mar 27 00:03:44 pbs proxmox-backup-proxy[111567]: POST /api2/json/access/ticket: 400 Bad Request: [client [::ffff:10.200.2.15]:37366] client error (SendRequest)

Looking at this I'll have to find the sync job to remove the backups_ceph

S3 cache is local



Backup log from PVE:
INFO: starting new backup job: vzdump 201 209 203 204 205 211 208 210 212 202 --quiet 1 --notes-template '{{guestname}}' --storage pbs_qnap --mailnotification always --mode snapshot --fleecing 0 --node pve1 --notification-mode notification-system --prune-backups 'keep-daily=7,keep-monthly=1,keep-weekly=2'
INFO: Starting Backup of VM 201 (qemu)
INFO: Backup started at 2026-03-27 00:00:05
INFO: status = running
INFO: VM Name: FoxPro
INFO: include disk 'scsi0' 'QNAP-PVE2-LVM:vm-201-disk-0' 100G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/201/2026-03-27T04:00:05Z'
INFO: enabling encryption
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 201 qmp command 'backup' failed - backup connect failed: command error: http request timed out
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 201 failed - VM 201 qmp command 'backup' failed - backup connect failed: command error: http request timed out
INFO: Failed at 2026-03-27 00:02:22
INFO: Starting Backup of VM 202 (qemu)
INFO: Backup started at 2026-03-27 00:02:22
INFO: status = running
INFO: VM Name: EDI
INFO: include disk 'scsi0' 'local_lvm:vm-202-disk-0' 100G
INFO: include disk 'scsi1' 'local_lvm:vm-202-disk-1' 512G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'QNAP-PVE2-LVM:vm-202-disk-0' (not included into backup)
INFO: skip unused drive 'QNAP-PVE2-LVM:vm-202-disk-1' (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/202/2026-03-27T04:02:22Z'
INFO: enabling encryption
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 202 qmp command 'backup' failed - backup connect failed: command error: client error (SendRequest)
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 202 failed - VM 202 qmp command 'backup' failed - backup connect failed: command error: client error (SendRequest)
INFO: Failed at 2026-03-27 00:03:44
INFO: Starting Backup of VM 203 (qemu)
INFO: Backup started at 2026-03-27 00:03:44
INFO: status = running
INFO: VM Name: RAS1
INFO: include disk 'scsi0' 'QNAP-PVE2-LVM:vm-203-disk-0' 350G
INFO: include disk 'scsi1' 'QNAP-PVE2-LVM:vm-203-disk-1' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/203/2026-03-27T04:03:44Z'
INFO: enabling encryption
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 203 qmp command 'backup' failed - backup connect failed: command error: client error (Connect)
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 203 failed - VM 203 qmp command 'backup' failed - backup connect failed: command error: client error (Connect)
INFO: Failed at 2026-03-27 00:04:43
INFO: Starting Backup of VM 204 (qemu)
INFO: Backup started at 2026-03-27 00:04:43
INFO: status = running
INFO: VM Name: Webserver
INFO: include disk 'scsi0' 'QNAP-PVE2-LVM:vm-204-disk-0' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/204/2026-03-27T04:04:43Z'
INFO: enabling encryption
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 204 qmp command 'backup' failed - backup connect failed: command error: client error (Connect)
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 204 failed - VM 204 qmp command 'backup' failed - backup connect failed: command error: client error (Connect)
INFO: Failed at 2026-03-27 00:05:30
INFO: Starting Backup of VM 205 (qemu)
INFO: Backup started at 2026-03-27 00:05:30
INFO: status = running
INFO: VM Name: SFTP
INFO: include disk 'scsi0' 'QNAP-PVE2-LVM:vm-205-disk-0' 200G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/205/2026-03-27T04:05:30Z'
INFO: enabling encryption
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 205 qmp command 'backup' failed - backup connect failed: command error: client error (Connect)
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 205 failed - VM 205 qmp command 'backup' failed - backup connect failed: command error: client error (Connect)
INFO: Failed at 2026-03-27 00:05:36
INFO: Starting Backup of VM 208 (qemu)
INFO: Backup started at 2026-03-27 00:05:36
INFO: status = running
INFO: VM Name: ToolsA
INFO: include disk 'scsi0' 'QNAP-PVE2-LVM:vm-208-disk-0' 200G
INFO: include disk 'scsi1' 'QNAP-PVE2-LVM:vm-208-disk-1' 1000G
INFO: include disk 'scsi2' 'local_lvm:vm-208-disk-0' 350G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/208/2026-03-27T04:05:36Z'
INFO: enabling encryption
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 208 qmp command 'backup' failed - backup connect failed: command error: client error (Connect)
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 208 failed - VM 208 qmp command 'backup' failed - backup connect failed: command error: client error (Connect)
INFO: Failed at 2026-03-27 00:06:28

**** Here is when monit restarted the proxmox-backup service ****

INFO: Starting Backup of VM 209 (qemu)
INFO: Backup started at 2026-03-27 00:06:28
INFO: status = running
INFO: VM Name: Test01
INFO: include disk 'scsi0' 'QNAP-PVE2-LVM:vm-209-disk-0' 200G
INFO: include disk 'scsi1' 'QNAP-PVE2-LVM:vm-209-disk-1' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/209/2026-03-27T04:06:28Z'
INFO: enabling encryption
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '2cad1ab1-e57d-447d-98b1-75672581147b'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (19.3 GiB of 200.0 GiB dirty)
INFO: scsi1: dirty-bitmap status: OK (332.0 MiB of 500.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 19.7 GiB dirty of 700.0 GiB total
INFO: 1% (324.0 MiB of 19.7 GiB) in 3s, read: 108.0 MiB/s, write: 108.0 MiB/s
INFO: 3% (796.0 MiB of 19.7 GiB) in 6s, read: 157.3 MiB/s, write: 156.0 MiB/s
INFO: 5% (1.1 GiB of 19.7 GiB) in 9s, read: 108.0 MiB/s, write: 108.0 MiB/s
INFO: 7% (1.5 GiB of 19.7 GiB) in 12s, read: 133.3 MiB/s, write: 128.0 MiB/s
INFO: 9% (1.9 GiB of 19.7 GiB) in 15s, read: 125.3 MiB/s, write: 125.3 MiB/s
INFO: 10% (2.1 GiB of 19.7 GiB) in 18s, read: 77.3 MiB/s, write: 77.3 MiB/s
INFO: 13% (2.6 GiB of 19.7 GiB) in 21s, read: 165.3 MiB/s, write: 164.0 MiB/s
INFO: 14% (2.9 GiB of 19.7 GiB) in 25s, read: 77.0 MiB/s, write: 76.0 MiB/s


I was also getting errors on /etc/proxmox-backup/.datastore.cfg being locked, but that seems to have gone away

Monit checks:
# Verify the proxy is serving on port 8007 (bare TLS check, no HTTP)
# PBS uses a self-signed cert so we disable certificate verification
check host proxmox-backup-proxy-port with address localhost
if failed port 8007 type tcpssl
with ssl options {verify: disable}
then alert

###############################################################################
## Datastore Lock Check
## Alerts if .datastore.lck is held for more than 15 minutes (15 cycles @ 60s)
###############################################################################

check program datastore_lock
with path "/bin/bash -c 'test $(lsof /etc/proxmox-backup/.datastore.lck 2>/dev/null | wc -l) -gt 0'"
if status == 0 for 15 cycles then alert



[2026-03-23T14:37:21+0000] info : 'pbs' Monit reloaded
[2026-03-25T03:20:05+0000] warning : 'datastore_lock' status failed (0) -- no output
[2026-03-27T04:04:45+0000] warning : 'proxmox-backup-api' process is not running
[2026-03-27T04:05:45+0000] error : 'proxmox-backup-api' process is not running
[2026-03-27T04:06:28+0000] info : 'proxmox-backup-api' trying to restart
[2026-03-27T04:06:28+0000] info : 'proxmox-backup-api' start: '/bin/systemctl start proxmox-backup'
[2026-03-27T04:08:25+0000] info : 'proxmox-backup-api' process is running with pid 954154
root@pbs:/var/log#

There is a prune job starting at 00:00:02 which completed successfully, and then I don't see any of the backup tasks in my logs until after the proxmox-backup service started at 00:08:25.
 
Also the datastore backups_ceph does not exist any more.. I've removed the verify job.. With your question on the "trigger event" this is a good suspect!
 
Last edited:
Mar 27 00:03:45 pbs systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Mar 27 00:03:45 pbs systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
Mar 27 00:03:45 pbs systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
Seems like your system IO might be completely overloaded at that point? Not just the Proxmox Backup Server services. Do you see significant io delay at that point? Is the S3 cache located on the same disk/filesystem as the PBS installation?

Edit: see also https://github.com/systemd/systemd/issues/14478
 
Thanks! I have metrics from the PVE machines, I'm going to throw this into the mix and see what it's doing. Also going to space out my jobs, and see what that does...