[SOLVED] PBS 3.3.1: Backup tasks hang after concluding uploads

Hi Chris,
thanks for the speedy reply.
When will this patch be available?
Do you mean downgrading from 3.3.2 to 3.2.11? If so, could you tell me how?

TIA
 
Last edited:
When will this patch be available?
I cannot give a precise ETA, but a new version will most likely be available within the next 2 weeks if no issues are found during internal testing and QA.

Do you mean downgrading from 3.3.2 to 3.2.11? If so, could you tell me how?
Downgrading is only an option if you do not rely/use already some of the features introduced in versions higher than 3.2.11-1.

Most notably:
  • the sync jobs in push direction (as the config was changed with 3.2.13-1)
  • the removable datastores
  • notificaton webhooks
  • resync of corrupt snapshots
If such features are already in use/configured, a downgrade will fail without removing these first. Otherwise a downgrade can be performed via an apt install proxmox-backup-server=3.2.11-1. It is however strongly recommend to test this in a testing environment with your configuration first, to avoid issues.

Another option would be to build a debian package from the sources directly, by following the Build section of the readme found at https://git.proxmox.com/?p=proxmox-...29a761867d98247097f4fdb3f44c00d58fe90;hb=HEAD
 
  • Like
Reactions: inno-forum
The proxmox-backup-server package version including the reverting commit has been bumped to 3.3.3-1 (see the changelog [0]) and is available in the pbstest repo at the time of writing.

So if you do not want to wait until the package is moved over to the no-subscription and enterprise repositories after a broader testing phase, you can upgrade to this version by enabling the pbstest repo as described in the docs [1]. Please do not forget to revert back to your regular repository after upgrading.

[0] https://git.proxmox.com/?p=proxmox-...0;hb=d986714201591c167e5e23cb1a293557679d4ec7
[1] https://pbs.proxmox.com/docs/installation.html#proxmox-backup-test-repository
 
  • Like
Reactions: christophe
The patch reverting the additional check has already been applied [0] and will be packaged with the next version of proxmox-backup-server. Optionally, downgrading to 3.2.11-1 can be an option, if none of the features like removable datastores or sync jobs in push direction are configured.

[0] https://git.proxmox.com/?p=proxmox-backup.git;a=commit;h=b72bdf4156b694bc5404bf18f7e1e59dc1195c86

Edit: Included link to patch for reference.
We have a problem with our current backup(PBS-3.3.3-1).

We used to backup to two different Proxmox.
This worked fine despite the dirty bitmap.

After updating to 3.3 with the feature (backup: stat known chunks on backup finish) the problems started.
Everything took so long that we could no longer do it.
While waiting for the fix, we increased the intervals between backups so we didn't get timeout (qmp command 'backup' failed - got timeout). We also switched to sync on the other backup.
With this configuration everything should work better, unfortunately it doesn't.
Backups are not as fast as before.
There are several timeouts, especially at the beginning of the week (we tried to reduce the number of workers, and got a slight improvement).
The sync when the problems with the backups occur takes so long that it does not finish in time (it seems faster to do a direct backup with a corrupted bitmap).
We would like to downgrade to the previous version (3.2.11-1) What should we look out for?

1) Sync does not work, should we go back to direct backup?
2) We need to downgrade the two Backup servers. Is there a special command?
3) Do we also need to downgrade the Backup clients on PVE?

Thanks
 
After updating to 3.3 with the feature (backup: stat known chunks on backup finish) the problems started.
Everything took so long that we could no longer do it.
This should however not be the case anymore with the latest version, as the patch introducing the stating of known chunks has been reverted.

While waiting for the fix, we increased the intervals between backups so we didn't get timeout (qmp command 'backup' failed - got timeout). We also switched to sync on the other backup.
Can you share some configurations for the backup job and the sync job, are their schedules overlapping.

Backups are not as fast as before.
There are several timeouts, especially at the beginning of the week (we tried to reduce the number of workers, and got a slight improvement).
The sync when the problems with the backups occur takes so long that it does not finish in time (it seems faster to do a direct backup with a corrupted bitmap).
Please share some more information here, the backup task log and sync job task log would be of interest.

We would like to downgrade to the previous version (3.2.11-1) What should we look out for?
It would make sense to investigate the issue rather than downgrade, for it to be fixed if there is a bug or misconfiguration.
 
This should however not be the case anymore with the latest version, as the patch introducing the stating of known chunks has been reverted.


Can you share some configurations for the backup job and the sync job, are their schedules overlapping.
1742903740753.png

previously started every 10 minutes and always was fine.
Synchronisation will be started when the backups are completed. ( about 02:30)
Please share some more information here, the backup task log and sync job task log would be of interest.
Logs are attached. We stopped the synchronisation because it takes too long. For 1 snapshot from 02:54 to 11:11.

The synchronised backup was done on Sunday and it took long because the garbage collection was running.

Hardware
Site 1
PBS -> NFS -> 1 Synology (HDD)

Site 2
PBS -> NFS -> 1 Synology (HDD) NAS

As I said, it worked with 3.2 and 2 backups with daily dirti-bitmap.
It would make sense to investigate the issue rather than downgrade, for it to be fixed if there is a bug or misconfiguration.
I understand, but can I still get my question answered? What steps do I need to take?
If you need more information, please let me know.





EDIT:
I also noticed that the host swap is 96.97%, we have swappiness at 20.
Before the update, swap was never high.
 

Attachments

Last edited:
I also noticed that the host swap is 96.97%, we have swappiness at 20.

So I assume that your PBS host might suffer from high memory pressure then? This could explain why you see both, slow sync and timeout errors during backup. Check which process is taking up your memory and post your proxmox-backup-manager version --verbose. Do you see the same for different kernel versions? Are you using ZFS?

1) Sync does not work, should we go back to direct backup?
I assume this is a sync job in push direction? If so, the sync source acts just like a regular client. What is your PBS version on the sync target? Check the task logs on the target host, they could tell more about issues.
2) We need to downgrade the two Backup servers. Is there a special command?
See my previous response, although I do not recommend downgrading, that is not a tested procedure https://forum.proxmox.com/threads/p...g-after-concluding-uploads.158812/post-747163
3) Do we also need to downgrade the Backup clients on PVE?
No, the clients are compatible to these server versions.
 
So I assume that your PBS host might suffer from high memory pressure then? This could explain why you see both, slow sync and timeout errors during backup. Check which process is taking up your memory and post your proxmox-backup-manager version --verbose. Do you see the same for different kernel versions? Are you using ZFS?
Sorry the Swap is on PVE Hosts.
PVE 1 ->
RAM usage​

58.96% (148.20 GiB of 251.35 GiB)​


SWAP usage

0.01% (512.00 KiB of 8.00 GiB)


PVE 2 ->
RAM usage​

58.69% (147.51 GiB of 251.35 GiB)​


SWAP usage

10.98% (899.50 MiB of 8.00 GiB)


PVE 4 ->
RAM usage​

70.50% (88.41 GiB of 125.40 GiB)​


SWAP usage

N/A




PVE 5->
RAM Usage 62.74% (78.67 GiB of 125.40 GiB)
SWAP usage

99.45% (7.96 GiB of 8.00 GiB)



PVE 6 ->
RAM usage

53.01% (133.16 GiB of 251.20 GiB)

SWAP usage

100.00% (8.00 GiB of 8.00 GiB)

PVE 7 ->
RAM usage​

29.33% (73.71 GiB of 251.31 GiB)​


SWAP usage

100.00% (8.00 GiB of 8.00 GiB)






I assume this is a sync job in push direction? If so, the sync source acts just like a regular client. What is your PBS version on the sync target? Check the task logs on the target host, they could tell more about issues.
Same Version
proxmox-backup: 3.3.0 (running kernel: 6.8.12-8-pve)
proxmox-backup-server: 3.3.3-1 (running version: 3.3.3)

Sync log are attached

See my previous response, although I do not recommend downgrading, that is not a tested procedure https://forum.proxmox.com/threads/p...g-after-concluding-uploads.158812/post-747163

No, the clients are compatible to these server versions.
Thanks.

If you need more information or Tests, please let me know.
 

Attachments

Nothing wrong with the PVE host swapping out less frequently used pages. So that is not related at all to the PBS sync job.

How does the network in-between the PVE host and the PBS host look like? Are you sure you are not congesting it during backup? What about the network between the 2 PBS instances, what network speed are you expecting there?

Also, since both of your datastores are located on network attached storages, you will have to take into account that the reading, sending and writing has to go over the network, not just the sending. That adds additional latency. I suggest you start out testing the sync speeds from local storage to local storage for both PBS instances. Same goes for the PVE backups. Can you exclude the NFS storage is not overloaded during backup? Check the metrics there...

Also check your I/O delay during sync...