S3 backup and Tuxis daDup

p-user

Member
Jan 26, 2024
46
4
8
I have set up the S3 storage daDup of Tuxis on my PBS 4 machine. Created a cache directory /cache and since there is about 160 GB free on the filesystem I assumed that should be enough.

It first did not work, and Tuxis pointed me to the fix of putting a line "put-rate-limit 10" in the s3.cfg file. And that seemed to work. I was able to backup and restore from the s3 datastorage. Also for a file restore I could browse the filesystem, great!

There are a few things which doesn't make it 100% perfect, these are my findings:

- Automatically verify after the backup is not a good idea on S3 storage, it takes a long time. Better to do it using a verify job at another time.
- I have two cluster nodes, backing up all vms gives problems. Due to the fact that with two nodes, two backups can run concurrently some will fail some don't. Probably a network overload issue or something.
- When I ran a backup job for all machines on one node (which will run in succession, one after the other) the system (PBS) more or less froze after three backups. Since I had to reboot the system there was not much to find, only the message "backup close image failed: command error: stream closed because of a broken pipe"

Here some idea's, maybe already implemented:

- Flush the backup at the end, and then start a new one
- Flush the cache at the end of each backup, or maybe at the start.
- Able to set a maximum space use for the cache, to prevent the system locking up when the cache runs out. Small systems don't have the option of putting an extra disk in for cache space.
- Backup up from different nodes, make sure only one is running at the same time, maybe round-robin style.

Regards,
Albert
 
I think it's the maximum amount of transactions per second. I asume the S3 backend needs time to process the transported data.
 
Everything works, but as long as you don't have two concurrent backups running (my cluster has two nodes, both can start a backup) it works.
As soon as I start backups on both nodes it fails, here is the output of both backup failure reports:

First one:
100: 2025-09-05 11:43:21 INFO: Starting Backup of VM 100 (qemu)
100: 2025-09-05 11:43:21 INFO: status = stopped
100: 2025-09-05 11:43:21 INFO: backup mode: stop
100: 2025-09-05 11:43:21 INFO: ionice priority: 7
100: 2025-09-05 11:43:21 INFO: VM Name: docky
100: 2025-09-05 11:43:21 INFO: include disk 'scsi0' 'Storage_NFS:100/vm-100-disk-0.qcow2' 32G
100: 2025-09-05 11:43:21 INFO: creating Proxmox Backup Server archive 'vm/100/2025-09-05T09:43:21Z'
100: 2025-09-05 11:43:21 INFO: starting kvm to execute backup task
100: 2025-09-05 11:43:23 INFO: started backup task '2e6e22bf-0e1d-45db-9f11-49e2ab375ea3'
100: 2025-09-05 11:43:23 INFO: scsi0: dirty-bitmap status: created new
100: 2025-09-05 11:43:26 INFO: 0% (324.0 MiB of 32.0 GiB) in 3s, read: 108.0 MiB/s, write: 102.7 MiB/s
100: 2025-09-05 11:43:29 INFO: 1% (596.0 MiB of 32.0 GiB) in 6s, read: 90.7 MiB/s, write: 90.7 MiB/s
100: 2025-09-05 11:43:32 INFO: 2% (692.0 MiB of 32.0 GiB) in 9s, read: 32.0 MiB/s, write: 32.0 MiB/s
100: 2025-09-05 11:43:39 INFO: 3% (1004.0 MiB of 32.0 GiB) in 16s, read: 44.6 MiB/s, write: 44.6 MiB/s
100: 2025-09-05 11:43:46 INFO: 4% (1.3 GiB of 32.0 GiB) in 23s, read: 46.3 MiB/s, write: 46.3 MiB/s
100: 2025-09-05 11:44:12 INFO: 4% (1.4 GiB of 32.0 GiB) in 49s, read: 5.4 MiB/s, write: 5.4 MiB/s
100: 2025-09-05 11:44:12 ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
100: 2025-09-05 11:44:12 INFO: aborting backup job
100: 2025-09-05 11:44:12 INFO: stopping kvm after backup task
100: 2025-09-05 11:44:12 ERROR: Backup of VM 100 failed - backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend

Second one:
06: 2025-09-05 11:43:43 INFO: Starting Backup of VM 106 (qemu)
106: 2025-09-05 11:43:43 INFO: status = stopped
106: 2025-09-05 11:43:43 INFO: backup mode: stop
106: 2025-09-05 11:43:43 INFO: ionice priority: 7
106: 2025-09-05 11:43:43 INFO: VM Name: freebsd14
106: 2025-09-05 11:43:43 INFO: include disk 'ide0' 'Storage_NFS:106/vm-106-disk-1.qcow2' 20G
106: 2025-09-05 11:43:43 INFO: creating Proxmox Backup Server archive 'vm/106/2025-09-05T09:43:43Z'
106: 2025-09-05 11:43:43 INFO: starting kvm to execute backup task
106: 2025-09-05 11:43:45 INFO: started backup task '49f3c06f-2807-470c-88b3-d321617f9e1b'
106: 2025-09-05 11:43:45 INFO: ide0: dirty-bitmap status: created new
106: 2025-09-05 11:43:48 INFO: 12% (2.4 GiB of 20.0 GiB) in 3s, read: 834.7 MiB/s, write: 153.3 MiB/s
106: 2025-09-05 11:44:09 INFO: 12% (2.5 GiB of 20.0 GiB) in 24s, read: 2.5 MiB/s, write: 2.5 MiB/s
106: 2025-09-05 11:44:09 ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
106: 2025-09-05 11:44:09 INFO: aborting backup job
106: 2025-09-05 11:44:09 INFO: stopping kvm after backup task
106: 2025-09-05 11:44:09 ERROR: Backup of VM 106 failed - backup write data failed: command error: write_data upload error: pipelined request failed: failed to upload chunk to s3 backend
 
Ok, addition here. I'm using the Tuxis daDup S3 storage.
For the S3 configuration I set Profider Quirks to "Skip If-None-Match header", and now I can backup a 64GB Windows vm, before that smaller backups work fine. I will investigate further with multiple backups from multiple nodes and see if this fixes that too.
 
Ok, I am unable to backup a (in this case Windows 11 vm) of 64GB. And it is reproducable.

I had the taskviewer on both the pve node and pbs open. The total cache use was well below space available, so that wasn't the issue.
I also had the pbs open on an ssh session, showing htop.

The backup runs on the pve node until it reaches 100%, then the backup processes on the pbs suddenly disappear and the dashboard of the pbs becomes unresponsive.

Here is the output of the last stuff on the pve taskviewer window:
Taskviewer pve.jpg

And here is the last bit of the taskviewer on the pbs:

Taskviewer pbs.jpg

It all looks ok, but at the end of the backup something happens that kills the proxmox-backup processes. And the backup never finishes. A reboot of the pbs is necessary to regain control. Then the ghost backup entry on the pve node can be deleted. and on the pbs a garbage collect is run to clean op the mess, here the last stuff of the Task viewer of the Garbage Collect:

Taskviewer garbage collect.jpg