S3 backup status

p-user

Member
Jan 26, 2024
64
4
8
I see many people with lots of problems with the S3 option for backup. Obviously, on forums like this only the not so succesful stories are placed, usually.
So, how good is it working, are there any caveats to look out for?

I'm looking in going to use this kind of backup soon. I want to use a European storage (as I'm living in the Netherlands), any recommendations? I was thinking of Hetzner.

Also, is it better to backup to a local disk, and use a remote sync to the S3 storage (if that is possible).

I also read that you need a local cache storage for S3, any recommendations on that? Does it depend on the maximum file size to be transported?

Regards,
Albert
 
I've managed to setup a connection to "Intercolo", but it's not reliable. The initial backups of my VMs seems to work fine. But for the following, incremental backups, it fails for my largest VM. The backup starts fine, all the way up to 100%, but then my PBS-server fails. The backup job never finishes. The PBS web page doesn't answer, and systemctl tells me I have a degraded server. I haven't been able to spot anything in the logs. After a reboot, the PBS server seems to work again.

The cache for me is just the local directory /opt/s3-cache.
 
I'm using it now for two weeks, having an S3 connection to Tuxis daDup as a datastore on my pbs:

- It works, but only for vm's smaller than at most 32GB, vm's of 64GB fail.
- The failed backups work up until 100%, then the backup jobs fails on the pve node, and pbs freezes immediately, and the web interface of pbs becomes inaccessable.
- Using the backup-proxmox-client works, also with encryption enabled, but only for backups not exceeding 32GB in size (see above).
- Restore works. When a backup has been made succesfully, it can be restored succesfully.
- Verify after backup (an option) is not a good idea, as verification takes quite some time.
- Garbage collection on the S3 datastore works fine. I needed it after the failing of large backups, space is recovered with this.
- Use of cache is ok < 60GB max. A separate partition or disk is recommended. However, it is not easy to replace the cache disk if it fails. You might have to clear the S3 datastore and start fresh to get rid of the errors, when the cache is started empty on a new disk. Procedure to swap the cache disk is needed.
- Synchronizing a local datastore with an S3 datastore on the same pbs does not work (yet, I will investigate). Do I need to use the actual IP address of the pbs, or can I use localhost in setting up the sync. Using root@pam as the user to sync fails, it will see the "remote" datastore and namespaces, but on actual sync it fails as the user does not have access rights.

I know it is a technological preview, and currently it should be treated as such, as it is not stable yet.
The freeze mentioned above is mentioned by several people.
Hopefully fixes will come soon.

Regards,
Albert
 
- It works, but only for vm's smaller than at most 32GB, vm's of 64GB fail.
Most likely you are running into the timeout due to network congestion issue, a fix for this has already been applied in git an will be included in proxmox-backup version 4.0.15. https://git.proxmox.com/?p=proxmox.git;a=commit;h=1c33b8bcdab1e0415084e058670d149b1015143e Also, a shared rate limiter implementation is in the working, see https://lore.proxmox.com/pbs-devel/20250828102604.463662-1-c.ebner@proxmox.com/T/
- The failed backups work up until 100%, then the backup jobs fails on the pve node, and pbs freezes immediately, and the web interface of pbs becomes inaccessable.
Fixed in git, will be included in the proxmox-backup version 4.0.15 as well, see https://git.proxmox.com/?p=proxmox-backup.git;a=commit;h=dbaa4b6992e504007c0ce9b26c34c6f52f996dac
- Using the backup-proxmox-client works, also with encryption enabled, but only for backups not exceeding 32GB in size (see above).
Same request timeout issue as above I guess.
- Restore works. When a backup has been made succesfully, it can be restored succesfully.
- Verify after backup (an option) is not a good idea, as verification takes quite some time.
- Garbage collection on the S3 datastore works fine. I needed it after the failing of large backups, space is recovered with this.
- Use of cache is ok < 60GB max. A separate partition or disk is recommended. However, it is not easy to replace the cache disk if it fails. You might have to clear the S3 datastore and start fresh to get rid of the errors, when the cache is started empty on a new disk. Procedure to swap the cache disk is needed.
You can set the datastore into maintenance mode offline and then manually adapt the path in the datastore config. After that you clear the maintenance mode and do an s3 refresh.
- Synchronizing a local datastore with an S3 datastore on the same pbs does not work (yet, I will investigate). Do I need to use the actual IP address of the pbs, or can I use localhost in setting up the sync. Using root@pam as the user to sync fails, it will see the "remote" datastore and namespaces, but on actual sync it fails as the user does not have access rights.
This does work without issues, you can either use a local pull sync job or just set up the PBS instance as its own remote using localhost.
 
  • Like
Reactions: Johannes S
Thanks!

That's great news. I never realised to put the datastore in maintenance, but will try it out. Easy solution!

Regards
Albert
 
Hi Chris,

I tried it out, but it is not really what it should be yet:

- Put S3 datastore in maintenance
- Edited /etc/proxmox-backup/datastores.cfg for a new (empty) cache in /cache
- Created empty directory /cache/.chunks (I needed the .chunks as I got an error on that)
- Went to the S3 datastore Content and found under the More button the option for refresh (maybe better to have it a direct button as Reload)
- Put S3 datastore out of maintenance, go to Content got this error:
Error2.png
- Did an S3 refresh, this worked, but got the following error, even with a Reload:
Error1.png
- Checking the /cache/.chunks directory shows no files
- Set the owner, group, and access the same as original cache:
chmod 755 /cache, chmod 750 /cache/.chunks, chown -R backup:backup /cache
- Errors are gone, I can do an S3 refresh, but the contents stays empty for the S3 datastore, even after a reload
- Check /cache/.chunks, still empty

I might be missing something here.

Regards,
Albert
P.S. The same procedure in reverse (going back to the original cache) works well. Except for one (empty) namespace that just disappears after the procedure.
 
- Put S3 datastore in maintenance
- Edited /etc/proxmox-backup/datastores.cfg for a new (empty) cache in /cache
- Created empty directory /cache/.chunks (I needed the .chunks as I got an error on that)
- Went to the S3 datastore Content and found under the More button the option for refresh (maybe better to have it a direct button as Reload)
- Put S3 datastore out of maintenance, go to Content got this error:
Error2.png

- Did an S3 refresh, this worked, but got the following error, even with a Reload:
Error1.png

- Checking the /cache/.chunks directory shows no files
- Set the owner, group, and access the same as original cache:
chmod 755 /cache, chmod 750 /cache/.chunks, chown -R backup:backup /cache
well, the PBS expects the cache folder to contain the required folder layout and permissions, so you will either have to create that yourself (you could do it by creating a regular datastore there and then remove it without destroying its content), or you will have to move the folder structure over. There is currently no automated option to move/recreate the cache, so manual adaption is required

- Errors are gone, I can do an S3 refresh, but the contents stays empty for the S3 datastore, even after a reload
- Check /cache/.chunks, still empty
What does the s3 refresh task log show? Are there any errors? The chunks will only be fetched on-demand, so it is not surprising that these are not present, the backup namespace, group and snapshot files should however be fetched.
 
I understand that the actual chunks will be fetched on demand, it's a cache. The refresh command shows stuff to get retrieved, but that's not put in the new cache. I see no namespaces nor data. Does something to be restarted in order to pick up the new cache? Maybe the refresh uses the original location still.
I will try again, this time reboot the pbs to pick up the new cache.

Ok, I also have to create an ns directory, so the layout is

root@pbs:/# ls -laR /cache
.:
total 12
drwxr-xr-x 3 root root 4096 Sep 16 09:57 .
drwx------ 7 root root 4096 Sep 16 09:57 ..
drwxr-xr-x 4 backup backup 4096 Sep 16 09:57 cache

/cache:
total 16
drwxr-xr-x 4 backup backup 4096 Sep 16 09:57 .
drwxr-xr-x 3 root root 4096 Sep 16 09:57 ..
drwxr-x--- 2 backup backup 4096 Sep 16 09:57 .chunks
drwxr-xr-x 2 backup backup 4096 Sep 16 09:57 ns

/cache/.chunks:
total 8
drwxr-x--- 2 backup backup 4096 Sep 16 09:57 .
drwxr-xr-x 4 backup backup 4096 Sep 16 09:57 ..

/cache/ns:
total 8
drwxr-xr-x 2 backup backup 4096 Sep 16 09:57 .
drwxr-xr-x 4 backup backup 4096 Sep 16 09:57 ..

Where backup is my backup user.

No need for a reboot, although it could take a minute or so for the backups to show up under the namespaces. As for the namespace which disappeared (as mentioned before). I haven't made a backup to it yet, so it probably does not exist in the S3 bucket, so it's explainable.

A few recommendations/requests:

- Make the "Refresh contents from S3 bucket" a standalone button, e.g. "S3 Refresh" or something, and not hidden under More, that would be clearer.
- Cache button, so you can see the location, clear the cache, initialize the cache (i.e. create the directory structure), that would be helpful. This could actually be hidden under the More button

But the procedure above works fine. Thanks for the help.

Kind regards,
Albert
 
Most likely you are running into the timeout due to network congestion issue, a fix for this has already been applied in git an will be included in proxmox-backup version 4.0.15. https://git.proxmox.com/?p=proxmox.git;a=commit;h=1c33b8bcdab1e0415084e058670d149b1015143e Also, a shared rate limiter implementation is in the working, see https://lore.proxmox.com/pbs-devel/20250828102604.463662-1-c.ebner@proxmox.com/T/

Fixed in git, will be included in the proxmox-backup version 4.0.15 as well, see https://git.proxmox.com/?p=proxmox-backup.git;a=commit;h=dbaa4b6992e504007c0ce9b26c34c6f52f996dac

Same request timeout issue as above I guess.

You can set the datastore into maintenance mode offline and then manually adapt the path in the datastore config. After that you clear the maintenance mode and do an s3 refresh.

This does work without issues, you can either use a local pull sync job or just set up the PBS instance as its own remote using localhost.
proxmox-backup-server version 4.0.15-1 is available in the pbs-test repository. You might want to test if the included bugfixes solve your issue, thanks. To activate the test repo, please see https://pbs.proxmox.com/docs/installation.html#proxmox-backup-test-repository
 
I started two jobs, a backup client job of 140GB, and a backup of a Windows 11 vm with 64 GB, at the same time.

The Windows 11 backup reached 100% and then nothing. The system did not freeze, but there is no progress anymore.
Even more the backup of the client doesn't progress anymore either, here is the last output of the backup client, as you can see nothing is backed up anymore:

processed 50.504 GiB in 14m, uploaded 50.347 GiB
processed 53.766 GiB in 15m, uploaded 53.598 GiB
processed 57.187 GiB in 16m, uploaded 57.008 GiB
processed 60.259 GiB in 17m, uploaded 60.086 GiB
processed 63.095 GiB in 18m, uploaded 62.912 GiB
processed 66.685 GiB in 19m, uploaded 66.512 GiB
processed 69.986 GiB in 20m, uploaded 69.815 GiB
processed 73.421 GiB in 21m, uploaded 73.25 GiB
processed 75.701 GiB in 22m, uploaded 75.311 GiB
processed 75.701 GiB in 23m, uploaded 75.311 GiB
processed 75.701 GiB in 24m, uploaded 75.311 GiB
processed 75.701 GiB in 25m, uploaded 75.311 GiB
processed 75.701 GiB in 26m, uploaded 75.311 GiB


And there the bit from the vm backup:

vm backup.jpg


The web interface is still operable, when I check the status of the vm backup it says "running":

vm status.jpg

When I look in the Task viewer on the pbs I see that no chunks are added, this is the last bit and nothing is added anymore:

chunks.jpg
 
FYI, I was confused and thought that the web interface was still working, only when I clicked away the task viewer window and tried to open it again, the web interface timed out. Also, the moment I rebooted the pbs, the proxmox backup client (which was still running with no progress exited with an error.
Looks like a deadlock somewhere.
 
Anyway, I just did a single backup of a pve vm (Windows 11, 64 GB) and it freezes at 100%, so the fact that I had two backup running at the same time (one vm and one backup-client) made no difference.
 
Ok, I'll eat my words, partly.

This morning I encountered a problem backup up to the S3 storage, but managed to fix this with a S3 Refresh *). So I thought, let's try again with the proxmox-backup-client, backup up (mounted) smb shares, one was 47GB, the other 130GB. And low and behold, they succeeded. Success!

Then I backed up my 64GB WIndows 11 vm, first to the (local) storage without an issue.
Then I backed it up to my S3 storage, obviously lots of messages that the chunks are already in the cache, which is exactly what you expect.
However, this backup freezes again after doing 100%, locking up the pbs webinterface.

So, in short yes the test-repository version works fine for backup-client backups, but not for pve vm backups.

Hope this helps,

Regards,
Albert

*) I had made a script, mounting the smb share, then starting the backup. For one it worked, for the other I got the message "proxmox backup client Error: fetching owner failed", the access settings were fine, so I couldn't find the issue. As a last resort I did an S3 Refresh, and that fixed it. Must have been leftovers from all the failures before I guess.
 
Last edited:
Then I backed up my 64GB WIndows 11 vm, first to the (local) storage without an issue.
Then I backed it up to my S3 storage, obviously lots of messages that the chunks are already in the cache, which is exactly what you expect.
However, this backup freezes again after doing 100%, locking up the pbs webinterface.
Can you please share:
  • the VM config for that particular VM, qm config <VMID> --current
  • The backup task log on the PBS as well as on the PVE host
  • The systemd journal of the PBS host from the full backup run timespan, journalctl --since <DATETIME> --until <DATETIME> > journal.txt
  • Are there other tasks being run while the backup is ongoing? E.g. prune, gc, ...
Just did a backup of a Window 11 VM with 150GB disk without issues, so other factors might play a role here.
 
Hi Chris,
Here's the info requested, the config:

root@pve1:~# qm config 103 --current
agent: 1
bios: ovmf
boot: order=scsi0;ide0;net0
cores: 2
cpu: x86-64-v2-AES
efidisk0: Storage_NFS:103/vm-103-disk-2.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide0: none,media=cdrom
lock: backup
machine: pc-q35-8.1
memory: 8192
meta: creation-qemu=8.1.5,ctime=1716035235
name: Windows11
net0: virtio=BC:24:11:C4:CC:93,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsi0: Storage_NFS:103/vm-103-disk-3.qcow2,discard=on,iothread=1,size=64G
scsihw: virtio-scsi-single
smbios1: uuid=8231ea81-9d77-40e4-bb26-d3be1e1c7723
sockets: 1
tpmstate0: Storage_NFS:103/vm-103-disk-1.raw,size=4M,version=v2.0
vga: virtio
vmgenid: 3c399401-fec8-4d8f-b7d3-6942bc680005


The logs on the pbs:

root@pbs:~# journalctl --since "30 min ago" > journal.txt
Sep 18 09:49:45 pbs proxmox-backup-proxy[874]: rrd journal successfully committed (33 files in 0.009 seconds)
Sep 18 10:12:28 pbs sshd-session[12006]: Accepted password for root from 192.168.2.110 port 58217 ssh2
Sep 18 10:12:28 pbs sshd-session[12006]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Sep 18 10:12:28 pbs systemd-logind[682]: New session 2 of user root.
Sep 18 10:12:28 pbs systemd[1]: Created slice user-0.slice - User Slice of UID 0.
Sep 18 10:12:28 pbs systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Sep 18 10:12:28 pbs systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Sep 18 10:12:28 pbs systemd[1]: Starting user@0.service - User Manager for UID 0...
Sep 18 10:12:28 pbs (systemd)[12012]: pam_unix(systemd-user:session): session opened for user root(uid=0) by root(uid=0)
Sep 18 10:12:28 pbs systemd-logind[682]: New session 3 of user root.
Sep 18 10:12:28 pbs systemd[12012]: Queued start job for default target default.target.
Sep 18 10:12:28 pbs systemd[12012]: Created slice app.slice - User Application Slice.
Sep 18 10:12:28 pbs systemd[12012]: Reached target paths.target - Paths.
Sep 18 10:12:28 pbs systemd[12012]: Reached target timers.target - Timers.
Sep 18 10:12:28 pbs systemd[12012]: Listening on dirmngr.socket - GnuPG network certificate management daemon.
Sep 18 10:12:28 pbs systemd[12012]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Sep 18 10:12:28 pbs systemd[12012]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Sep 18 10:12:28 pbs systemd[12012]: Starting gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation)...
Sep 18 10:12:28 pbs systemd[12012]: Starting gpg-agent.socket - GnuPG cryptographic agent and passphrase cache...
Sep 18 10:12:28 pbs systemd[12012]: Listening on keyboxd.socket - GnuPG public key management service.
Sep 18 10:12:28 pbs systemd[12012]: Starting ssh-agent.socket - OpenSSH Agent socket...
Sep 18 10:12:28 pbs systemd[12012]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Sep 18 10:12:28 pbs systemd[12012]: Listening on ssh-agent.socket - OpenSSH Agent socket.
Sep 18 10:12:28 pbs systemd[12012]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Sep 18 10:12:28 pbs systemd[12012]: Reached target sockets.target - Sockets.
Sep 18 10:12:28 pbs systemd[12012]: Reached target basic.target - Basic System.
Sep 18 10:12:28 pbs systemd[12012]: Reached target default.target - Main User Target.
Sep 18 10:12:28 pbs systemd[12012]: Startup finished in 182ms.
Sep 18 10:12:28 pbs systemd[1]: Started user@0.service - User Manager for UID 0.
Sep 18 10:12:28 pbs systemd[1]: Started session-2.scope - Session 2 of User root.

As for the logs on the pbs of the backup. Since the system freezes I cannot download it. A screenshot would only show the chunks, and then it stops, no other messages to be seen. I have to reboot the pbs to regain control.
There are no other tasks running.

If you have more files I can check, then please let me know.

I tried a ubuntu vm with 64GB disk and that worked fine, so it looks like backing up the Windows 11 64GB vm gives a problem. Just a thought, could it be that the small tpm disk is a problem here?
 
Ok Chris,

I got a bit further:

- The Windows 11 vm is the only vm I have with an EFI disk.
- The Windows 11 vm has a TPM disk as well.

So I cloned the Windows 11 vm to a new machine, I deleted the TPM disk and started a backup to the S3 datastore, and bingo it works!
Then I added the TPM disk again to this vm, and the backup still works.
So I cloned the Windows 11 vm again to another new machine, didn't remove the TPM or anything, and tried the backup of that, it works.
Now I backed up the original Windows 11 vm and low and behold it works now.

So, whatever it was, it seems to work now (but not this morning), I have no idea why.

Usually I would say "hope this helps" but I don't think that's appropirate.

Regards,
Albert
 
Did something else change on the initial VM? Are the disks still backed up in the same order? Could of course just be a slightly different access pattern or timing not leading to the hang anymore.

Thanks for your testing and sharing your findings!
 
Hi Chris,
no nothing changed on the vm. Only change on the pve nodes was the update of yesterday, I think there was a qemu update as well, but it didn't look to be any relation to the problem. Since it is a (non-registered) Windows 11 vm, just for testing, I will delete it, and restore it from the local backup (which always worked fine) and see if I can back that up to the S3 datastore. I'm totally blank in why it suddenly worked, but happy that it all starts to work.

Kind regards,
Albert
P.S. After the pbs updates are officially released (i.e. not in testing) i assume I can disable the test repository and roll along with the normal updates?
 
P.S. After the pbs updates are officially released (i.e. not in testing) i assume I can disable the test repository and roll along with the normal updates?
Yes, disabling the pbs-test repo is enough.