S3 Upload Fails with Large VMs (Cache Issue?)

Oachkatze

Member
Jan 15, 2023
32
5
13
Hi everyone,


since upgrading to Proxmox Backup Server 4.0, I’ve started using the new S3 target support, which is absolutely great so far. However, I’m running into an issue with larger VMs:
My upload bandwidth is limited to around 10 Mbit/s, and for some larger backups, the upload process fails partway through. From what I understand, PBS uses a local cache for S3, which defaults to 25 GB.

Smaller CT runs great with the S3 Bucket.

My suspicion is that the cache fills up before the data can be uploaded to the S3 storage, causing the backup to abort.


My questions:

  • Is there a way to increase or configure the S3 cache size in PBS?
  • Has anyone else experienced similar issues when uploading large backups with limited upstream bandwidth?
Any help or pointers would be greatly appreciated!
Thanks in advance,
Simon

1754646392430.png
 
HI,
please post the contents of the systemd journal from around the time when the chunk upload error happens. This should give more error context.
PBS uses a local cache for S3, which defaults to 25 GB.
No, the cache capacity is dynamically determined when first accessing the datastore. Did you follow the recommendation and use a dedicated device, partition or zfs datastet with quota for the local cache? See https://pbs.proxmox.com/docs/storage.html#datastores-with-s3-backend

My suspicion is that the cache fills up before the data can be uploaded to the S3 storage, causing the backup to abort.
If you usable storage changed since the last time the cache capacity was determined, then this can be the cause, yes. Also, there is a known issue with the
cache not being rewarmed correctly when the PBS services are restarted, the host is rebooted or the datastore is set into maintenance mode. This is actively being worked on, see https://lore.proxmox.com/pbs-devel/20250801141024.626365-1-c.ebner@proxmox.com/T/

For the time being, manual intervention is required to reclaim this space for the cases mentioned above.
 
HALLO,
Bitte posten Sie den Inhalt des Systemd-Journals aus der Zeit, als der Chunk-Upload-Fehler auftrat. Dies sollte mehr Fehlerkontext liefern.

Nein, die Cache-Kapazität wird beim ersten Zugriff auf den Datenspeicher dynamisch ermittelt. Haben Sie der Empfehlung gefolgt und ein dediziertes Gerät, eine Partition oder ein ZFS-Dataset mit Kontingent für den lokalen Cache verwendet? Siehe https://pbs.proxmox.com/docs/storage.html#datastores-with-s3-backend


Wenn sich Ihr nutzbarer Speicher seit der letzten Ermittlung der Cache-Kapazität geändert hat, kann dies die Ursache sein. Außerdem gibt es ein bekanntes Problem mit dem
Der Cache wird nicht korrekt wieder aufgewärmt, wenn die PBS-Dienste neu gestartet werden, der Host neu gestartet wird oder der Datenspeicher in den Wartungsmodus versetzt wird. Daran wird aktiv gearbeitet, siehe https://lore.proxmox.com/pbs-devel/20250801141024.626365-1-c.ebner@proxmox.com/T/

Um diesen Platz für die oben genannten Fälle zurückzugewinnen, ist vorerst ein manueller Eingriff erforderlich.
1754652101124.png


I was able to quickly grab a screenshot – I’ll take a closer look at the other things a bit later. But thank you already for your help!


Best regards,
Simon
 
View attachment 89108


I was able to quickly grab a screenshot – I’ll take a closer look at the other things a bit later. But thank you already for your help!


Best regards,
Simon
Sorry, I was not precise enough. The interesting part is to be found in the systemd journal of the PBS host.
 
I think i have the same issue.

Small Containers backup just fine, bigger ones, or VMs end up with an error

the relevant parts of journalctl are:

Its between two reboots, and the first one is with a VM, the second one with is a CT.
Both were manual backup tasks for testing.

journalctl on PBS
Code:
Aug 09 02:05:34 pbs proxmox-backup-proxy[719]: TASK ERROR: backup ended but finished flag is not set.


-- Boot 0a0d5fa66f1d4c6ea1f85e17ad4223f0 --
Aug 09 02:19:07 pbs systemd[1]: Starting proxmox-backup-proxy.service - Proxmox Backup API Proxy Server...
Aug 09 02:19:08 pbs proxmox-backup-proxy[650]: applied rrd journal (2 entries in 0.005 seconds)
Aug 09 02:19:08 pbs proxmox-backup-proxy[650]: rrd journal successfully committed (0 files in 0.000 seconds)
Aug 09 02:19:08 pbs proxmox-backup-proxy[650]: catching shutdown signal
Aug 09 02:19:08 pbs proxmox-backup-proxy[650]: catching reload signal
Aug 09 02:19:08 pbs systemd[1]: Started proxmox-backup-proxy.service - Proxmox Backup API Proxy Server.
Aug 09 02:19:08 pbs proxmox-backup-proxy[650]: Using datastore cache with capacity 7499 for store s3-backup
Aug 09 02:21:11 pbs proxmox-backup-proxy[650]: TASK ERROR: backup ended but finished flag is not set.



And After that the corresponding logs from the PVE:
Code:
INFO: starting new backup job: vzdump 400 --storage pbs --node pve-i9 --notification-mode notification-system --mode snapshot --remove 0 --notes-template '{{guestname}}'
INFO: Starting Backup of VM 400 (qemu)
INFO: Backup started at 2025-08-09 02:04:54
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: win10
INFO: include disk 'scsi0' 'vmstorage:vm-400-disk-1' 100G
INFO: include disk 'efidisk0' 'vmstorage:vm-400-disk-0' 1M
INFO: include disk 'tpmstate0' 'vmstorage:vm-400-disk-2' 4M
INFO: creating Proxmox Backup Server archive 'vm/400/2025-08-09T00:04:54Z'
INFO: starting kvm to execute backup task
swtpm_setup: Not overwriting existing state file.
INFO: attaching TPM drive to QEMU for backup
INFO: started backup task '0f2d4dfb-8b11-4bd0-b6f4-7a14634f98f6'
INFO: efidisk0: dirty-bitmap status: created new
INFO: scsi0: dirty-bitmap status: created new
INFO: tpmstate0-backup: dirty-bitmap status: created new
INFO:   0% (432.0 MiB of 100.0 GiB) in 3s, read: 144.0 MiB/s, write: 17.3 MiB/s
INFO:   1% (1.0 GiB of 100.0 GiB) in 6s, read: 208.0 MiB/s, write: 133.3 MiB/s
INFO:   1% (1.1 GiB of 100.0 GiB) in 37s, read: 1.4 MiB/s, write: 1.4 MiB/s
ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 400 failed - backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
INFO: Failed at 2025-08-09 02:05:34
INFO: Backup job finished with errors
INFO: notified via target `smtp-mail`
TASK ERROR: job errors




()
INFO: starting new backup job: vzdump 206 --notification-mode notification-system --storage pbs --node pve-i9 --remove 0 --notes-template '{{guestname}}' --mode snapshot
INFO: Starting Backup of VM 206 (lxc)
INFO: Backup started at 2025-08-09 02:19:55
INFO: status = running
INFO: CT Name: immich.******.**
INFO: including mount point rootfs ('/') in backup
INFO: excluding bind mount point mp0 ('/mnt/fotos') from backup (not a volume)
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
INFO: creating Proxmox Backup Server archive 'ct/206/2025-08-09T00:19:55Z'
INFO: set max number of entries in memory for file-based backups to 1048576
INFO: run: lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bin/proxmox-backup-client backup --crypt-mode=none pct.conf:/var/tmp/vzdumptmp1980205_206/etc/vzdump/pct.conf root.pxar:/mnt/vzsnap0 --include-dev /mnt/vzsnap0/./ --skip-lost-and-found --exclude=/tmp/?* --exclude=/var/tmp/?* --exclude=/var/run/?*.pid --backup-type ct --backup-id 206 --backup-time 1754698795 --entries-max 1048576 --repository root@pam@192.168.179.26:s3-backup --ns pve-i9
INFO: Starting backup: [pve-i9]:ct/206/2025-08-09T00:19:55Z  
INFO: Client name: pve-i9  
INFO: Starting backup protocol: Sat Aug  9 02:19:55 2025  
INFO: No previous manifest available.  
INFO: Upload config file '/var/tmp/vzdumptmp1980205_206/etc/vzdump/pct.conf' to 'root@pam@192.168.179.26:8007:s3-backup' as pct.conf.blob  
INFO: Upload directory '/mnt/vzsnap0' to 'root@pam@192.168.179.26:8007:s3-backup' as root.pxar.didx  
INFO: processed 900.984 MiB in 1m, uploaded 602.117 MiB
INFO: unclosed encoder dropped
INFO: closed encoder dropped with state
INFO: unfinished encoder state dropped
INFO: finished encoder state with errors
INFO: catalog upload error - channel closed  
INFO: Error: failed to upload chunk to s3 backend
INFO: cleanup temporary 'vzdump' snapshot
ERROR: Backup of VM 206 failed - command 'lxc-usernsexec -m u:0:100000:65536 -m g:0:100000:65536 -- /usr/bin/proxmox-backup-client backup '--crypt-mode=none' pct.conf:/var/tmp/vzdumptmp1980205_206/etc/vzdump/pct.conf root.pxar:/mnt/vzsnap0 --include-dev /mnt/vzsnap0/./ --skip-lost-and-found '--exclude=/tmp/?*' '--exclude=/var/tmp/?*' '--exclude=/var/run/?*.pid' --backup-type ct --backup-id 206 --backup-time 1754698795 --entries-max 1048576 --repository root@pam@192.168.179.26:s3-backup --ns pve-i9' failed: exit code 255
INFO: Failed at 2025-08-09 02:21:12
INFO: Backup job finished with errors
INFO: notified via target `smtp-mail`
TASK ERROR: job errors
 
backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
Double checked the code on this one, the error points at the upload failing and not an issue with the cache. Since you have no further error in the systemd journal, the requests to the API seem however to be just fine. Which would point more towards errors while sending the requests, running into a request timeout or concurrent uploads of the same chunk object hindering each other.

Is this the only backup running to this datastore at the time? Can you exclude network dropouts by e.g. continuously pinging the s3 endpoint from the PBS host during the backup run.

Also, in order to reproduce, what provider are you using?
 
  • Like
Reactions: Oachkatze
Double checked the code on this one, the error points at the upload failing and not an issue with the cache. Since you have no further error in the systemd journal, the requests to the API seem however to be just fine. Which would point more towards errors while sending the requests, running into a request timeout or concurrent uploads of the same chunk object hindering each other.

Is this the only backup running to this datastore at the time? Can you exclude network dropouts by e.g. continuously pinging the s3 endpoint from the PBS host during the backup run.

Also, in order to reproduce, what provider are you using?

Hi Chris,

I just saw that CRZ has already sent the log file – yeah, it’s really strange...
The only thing I can imagine is that since my upload speed is only 10 Mbit/s and the cache shows just 25 GB, it might somehow be playing a role here. With smaller CTs it works perfectly, since the cache doesn’t fill up.

The S3 provider I’m using is Hetzner.
The connection setup itself went without any problems.
 
Right now it's from my Homelab to Hetzners S3 for evaluation purposes.
And yes, only manually backing up 1 vm/ct at a time.


I have 50mbit upload to it.
And i have set up an extra cache disk as described with 128GB as ext4
(https://pbs.proxmox.com/docs/storage.html#datastore-s3-backend)

Is there a way to set a maximum number of concurrent connections?
This way i could test if this is a problem

Best regards,
crz
 
Last edited: