S3 backup and Tuxis daDup

p-user

Member
Jan 26, 2024
66
4
8
I have set up the S3 storage daDup of Tuxis on my PBS 4 machine. Created a cache directory /cache and since there is about 160 GB free on the filesystem I assumed that should be enough.

It first did not work, and Tuxis pointed me to the fix of putting a line "put-rate-limit 10" in the s3.cfg file. And that seemed to work. I was able to backup and restore from the s3 datastorage. Also for a file restore I could browse the filesystem, great!

There are a few things which doesn't make it 100% perfect, these are my findings:

- Automatically verify after the backup is not a good idea on S3 storage, it takes a long time. Better to do it using a verify job at another time.
- I have two cluster nodes, backing up all vms gives problems. Due to the fact that with two nodes, two backups can run concurrently some will fail some don't. Probably a network overload issue or something.
- When I ran a backup job for all machines on one node (which will run in succession, one after the other) the system (PBS) more or less froze after three backups. Since I had to reboot the system there was not much to find, only the message "backup close image failed: command error: stream closed because of a broken pipe"

Here some idea's, maybe already implemented:

- Flush the backup at the end, and then start a new one
- Flush the cache at the end of each backup, or maybe at the start.
- Able to set a maximum space use for the cache, to prevent the system locking up when the cache runs out. Small systems don't have the option of putting an extra disk in for cache space.
- Backup up from different nodes, make sure only one is running at the same time, maybe round-robin style.

Regards,
Albert
 
I think it's the maximum amount of transactions per second. I asume the S3 backend needs time to process the transported data.
 
Everything works, but as long as you don't have two concurrent backups running (my cluster has two nodes, both can start a backup) it works.
As soon as I start backups on both nodes it fails, here is the output of both backup failure reports:

First one:
100: 2025-09-05 11:43:21 INFO: Starting Backup of VM 100 (qemu)
100: 2025-09-05 11:43:21 INFO: status = stopped
100: 2025-09-05 11:43:21 INFO: backup mode: stop
100: 2025-09-05 11:43:21 INFO: ionice priority: 7
100: 2025-09-05 11:43:21 INFO: VM Name: docky
100: 2025-09-05 11:43:21 INFO: include disk 'scsi0' 'Storage_NFS:100/vm-100-disk-0.qcow2' 32G
100: 2025-09-05 11:43:21 INFO: creating Proxmox Backup Server archive 'vm/100/2025-09-05T09:43:21Z'
100: 2025-09-05 11:43:21 INFO: starting kvm to execute backup task
100: 2025-09-05 11:43:23 INFO: started backup task '2e6e22bf-0e1d-45db-9f11-49e2ab375ea3'
100: 2025-09-05 11:43:23 INFO: scsi0: dirty-bitmap status: created new
100: 2025-09-05 11:43:26 INFO: 0% (324.0 MiB of 32.0 GiB) in 3s, read: 108.0 MiB/s, write: 102.7 MiB/s
100: 2025-09-05 11:43:29 INFO: 1% (596.0 MiB of 32.0 GiB) in 6s, read: 90.7 MiB/s, write: 90.7 MiB/s
100: 2025-09-05 11:43:32 INFO: 2% (692.0 MiB of 32.0 GiB) in 9s, read: 32.0 MiB/s, write: 32.0 MiB/s
100: 2025-09-05 11:43:39 INFO: 3% (1004.0 MiB of 32.0 GiB) in 16s, read: 44.6 MiB/s, write: 44.6 MiB/s
100: 2025-09-05 11:43:46 INFO: 4% (1.3 GiB of 32.0 GiB) in 23s, read: 46.3 MiB/s, write: 46.3 MiB/s
100: 2025-09-05 11:44:12 INFO: 4% (1.4 GiB of 32.0 GiB) in 49s, read: 5.4 MiB/s, write: 5.4 MiB/s
100: 2025-09-05 11:44:12 ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
100: 2025-09-05 11:44:12 INFO: aborting backup job
100: 2025-09-05 11:44:12 INFO: stopping kvm after backup task
100: 2025-09-05 11:44:12 ERROR: Backup of VM 100 failed - backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend

Second one:
06: 2025-09-05 11:43:43 INFO: Starting Backup of VM 106 (qemu)
106: 2025-09-05 11:43:43 INFO: status = stopped
106: 2025-09-05 11:43:43 INFO: backup mode: stop
106: 2025-09-05 11:43:43 INFO: ionice priority: 7
106: 2025-09-05 11:43:43 INFO: VM Name: freebsd14
106: 2025-09-05 11:43:43 INFO: include disk 'ide0' 'Storage_NFS:106/vm-106-disk-1.qcow2' 20G
106: 2025-09-05 11:43:43 INFO: creating Proxmox Backup Server archive 'vm/106/2025-09-05T09:43:43Z'
106: 2025-09-05 11:43:43 INFO: starting kvm to execute backup task
106: 2025-09-05 11:43:45 INFO: started backup task '49f3c06f-2807-470c-88b3-d321617f9e1b'
106: 2025-09-05 11:43:45 INFO: ide0: dirty-bitmap status: created new
106: 2025-09-05 11:43:48 INFO: 12% (2.4 GiB of 20.0 GiB) in 3s, read: 834.7 MiB/s, write: 153.3 MiB/s
106: 2025-09-05 11:44:09 INFO: 12% (2.5 GiB of 20.0 GiB) in 24s, read: 2.5 MiB/s, write: 2.5 MiB/s
106: 2025-09-05 11:44:09 ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
106: 2025-09-05 11:44:09 INFO: aborting backup job
106: 2025-09-05 11:44:09 INFO: stopping kvm after backup task
106: 2025-09-05 11:44:09 ERROR: Backup of VM 106 failed - backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
 
Ok, addition here. I'm using the Tuxis daDup S3 storage.
For the S3 configuration I set Profider Quirks to "Skip If-None-Match header", and now I can backup a 64GB Windows vm, before that smaller backups work fine. I will investigate further with multiple backups from multiple nodes and see if this fixes that too.
 
Ok, I am unable to backup a (in this case Windows 11 vm) of 64GB. And it is reproducable.

I had the taskviewer on both the pve node and pbs open. The total cache use was well below space available, so that wasn't the issue.
I also had the pbs open on an ssh session, showing htop.

The backup runs on the pve node until it reaches 100%, then the backup processes on the pbs suddenly disappear and the dashboard of the pbs becomes unresponsive.

Here is the output of the last stuff on the pve taskviewer window:
Taskviewer pve.jpg

And here is the last bit of the taskviewer on the pbs:

Taskviewer pbs.jpg

It all looks ok, but at the end of the backup something happens that kills the proxmox-backup processes. And the backup never finishes. A reboot of the pbs is necessary to regain control. Then the ghost backup entry on the pve node can be deleted. and on the pbs a garbage collect is run to clean op the mess, here the last stuff of the Task viewer of the Garbage Collect:

Taskviewer garbage collect.jpg
 
Sounds indeed like the same problem. I tried to set the rate limit from 10 to 3, and that made no difference. Exactly the same behaviour as mentioned above. Hopefully it will be fixed. So far I can use it for my mail backup, 6GB using the local backup client on the mail server. That works fine, encrypted and all and I can succesfully restore as well. Looks like it something between the cache and the actual upload to S3, My /cache fills between 20-50 GB, and there's plenty of space available.

Hopefully it will be found and fixed soon.
 
Here is some more info.
I started a backup of a 32GB disk vm on node pve1, which normally runs fine. Then started a new backup on pve2, which also started and this one finished fine, however the job on pve1 failed with the following log:

INFO: starting new backup job: vzdump 101 --storage Backup_daDup --remove 0 --node pve1 --notes-template '{{guestname}}' --mode snapshot --notification-mode notification-system
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2025-09-10 11:06:23
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: ubuntu24.04
INFO: include disk 'scsi0' 'Storage_NFS:101/vm-101-disk-0.qcow2' 32G
INFO: creating Proxmox Backup Server archive 'vm/101/2025-09-10T09:06:23Z'
INFO: starting kvm to execute backup task
INFO: started backup task '55c18753-0c4e-468d-b12d-a0b10ed5e7b1'
INFO: scsi0: dirty-bitmap status: created new
INFO: 1% (524.0 MiB of 32.0 GiB) in 3s, read: 174.7 MiB/s, write: 169.3 MiB/s
INFO: 2% (656.0 MiB of 32.0 GiB) in 7s, read: 33.0 MiB/s, write: 33.0 MiB/s
INFO: 3% (988.0 MiB of 32.0 GiB) in 17s, read: 33.2 MiB/s, write: 33.2 MiB/s
INFO: 3% (1.0 GiB of 32.0 GiB) in 35s, read: 2.4 MiB/s, write: 2.4 MiB/s
ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 101 failed - backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
INFO: Failed at 2025-09-10 11:07:00
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
TASK ERROR: job errors


Running that same job again after the second job finished, the first one ran succesfully.
 
I hope something could be done here. I started a "large" backup, 64GB disk and it run in the Taskviewer all the way to 100%, the moment it reached 100% the Taskviewer on the pbs (showing the chunks) immediately stops, and the web page of pbs becomes unresponse (Connection error). On the command line of pbs I see that no proxmox-backup processes are running.
 
I've updated to 4.04.15 through the test repository. It's working fine ever since, at least for the last 5 days.

Here are my settings (obviously access-key and secret-key are obfuscated:

root@pbs:/etc/proxmox-backup# cat s3.cfg
s3-endpoint: Tuxis
access-key ####################
endpoint nl.dadup.eu
path-style true
port 443
provider-quirks skip-if-none-match-header
put-rate-limit 10
region default
secret-key ########################################

root@pbs:~# cat /etc/proxmox-backup/datastore.cfg
datastore: daDup
backend bucket=backup,client=Tuxis,type=s3
comment Tuxis daDup Datatstore
gc-schedule sat 18:15
notification-mode legacy-sendmail
notify gc=error,prune=error,sync=error,verify=error
path /mnt/datastore/cache