Backup hang on Waiting for server to finish backup validation...

romanm

Member
Nov 26, 2022
6
0
6
Hello,

we are investigating "hanging" backup jobs in our environment. We have 7 PVE nodes which use 1 PBS server and sometimes backup jobs hang on "Waiting for server to finish backup validation..."

Configuration of PBS:
- AMD EPYC 7313P
- MDADM RAID5 from 12x Samsung PM9A3 15,36TB (NVMe)
- stripe_cache_size = 8192
- group_thread_cnt = 8
- using raw LVM
- mounted 75TB logical volume with ext4
- server is in same network as PVE, no firewall

Size of the datastore 720 groups, 7 days retention, On-Disk usage: 41.895 TiB and On-Disk chunks: 19077285.

Code:
Part of backup log:
INFO:  43% (5.4 GiB of 12.4 GiB) in 6s, read: 945.3 MiB/s, write: 150.7 MiB/s
INFO:  65% (8.2 GiB of 12.4 GiB) in 9s, read: 953.3 MiB/s, write: 132.0 MiB/s
INFO:  91% (11.4 GiB of 12.4 GiB) in 12s, read: 1.1 GiB/s, write: 89.3 MiB/s
INFO: 100% (12.4 GiB of 12.4 GiB) in 15s, read: 361.3 MiB/s, write: 53.3 MiB/s
INFO: Waiting for server to finish backup validation...
INFO: backup is sparse: 1.25 GiB (10%) total zero data
INFO: backup was done incrementally, reused 28.34 GiB (94%)
INFO: transferred 12.45 GiB in 123 seconds (103.6 MiB/s)

As you can see, there is no problem with "write" or networking but on the "confirmation" of backup?

At this moment I`m looking for configuration sync level and chunk order. When we change sync level to none, then situation is for few backups fast but in few minutes we are at the start (probably disk cache?). How about chunk order? I read it in doc but I dont understand it well.

Please does anyone have any ideas how to tune storage server?
 
Hello,

no, at the moment is there only about 6-7 backup tasks. Verify new snapshosts is not checked.

This looks like some problem with filesystem but I`m looking in a lots of hours but no result. Thread should be as brainstorming or sharing ideas how to operate "big" PBS
 
Yep, I read this thread and as I wrote, I tried this settings, but this only change the time when the storage hang. I will try to change chunk-order...
 
dmesg is clear, there are no reported problems.

Second idea was about the count of chunks - over the 19 mio... maybe to split to more datastores? But I'm just guessing.
 
Last edited:
Second idea was about the count of chunks - over the 19 mio... maybe to split to more datastores? But I'm just guessing.
That does not really matter for the backup job itself as the backup job is only concerned with the chunks it references, and since your VM disk has about 12 GiB that should be around 3000 chunks (not taking into accound zero chunks and other deduplicated chunks).

The total size of the datastore is however relevant for verify jobs and garbage collection runtime.
 
Last edited:
Hi,
what is your PBS version? Please post the output of proxmox-backup-manager version --verbose. Maybe you are running into this [0] issue, in that case upgrading to the latest version should help.

[0] https://forum.proxmox.com/threads/pbs-3-3-1-backup-tasks-hang-after-concluding-uploads.158812/

Edit: If that's not it, than you might want to inspect what the PBS is doing while it seems stuck by running strace -fp $(pidof proxmox-backup-proxy)
Hello, sorry for late reply.

Code:
proxmox-backup-manager version --verbose
proxmox-backup                2.4-1        running kernel: 5.15.158-2-pve
proxmox-backup-server         2.4.7-1      running version: 2.4.7       
pve-kernel-5.15               7.4-15                                     
pve-kernel-5.15.158-2-pve     5.15.158-2                                 
pve-kernel-5.15.102-1-pve     5.15.102-1                                 
ifupdown2                     3.1.0-1+pmx4                               
libjs-extjs                   7.0.0-1                                   
proxmox-backup-docs           2.4.7-1                                   
proxmox-backup-client         2.4.7-1                                   
proxmox-mail-forward          0.1.1-1                                   
proxmox-mini-journalreader    1.2-1                                     
proxmox-offline-mirror-helper unknown                                   
proxmox-widget-toolkit        3.7.4                                     
pve-xtermjs                   4.16.0-2                                   
smartmontools                 7.2-pve3                                   
zfsutils-linux                2.1.15-pve1

We are running one old PVE cluster version 6, but slow speed we have at PVE 8 too.

Biggest issue is, that in same setup we are running second PBS with SSD drives and speed is same, sometime slightly faster.

Thank you
 
Hi,
Hello, sorry for late reply.

Code:
proxmox-backup-manager version --verbose
proxmox-backup                2.4-1        running kernel: 5.15.158-2-pve
proxmox-backup-server         2.4.7-1      running version: 2.4.7      
[/QUOTE]

this version of Proxmox Backup Server is EOL already for some time now, you should upgrade to the latest version to get all the security updates, bugfixes and new features. Please follow the steps as described in https://pbs.proxmox.com/wiki/index.php/Upgrade_from_2_to_3

[QUOTE="romanm, post: 757116, member: 168980"]

pve-kernel-5.15               7.4-15                                    
pve-kernel-5.15.158-2-pve     5.15.158-2                                
pve-kernel-5.15.102-1-pve     5.15.102-1                                
ifupdown2                     3.1.0-1+pmx4                              
libjs-extjs                   7.0.0-1                                  
proxmox-backup-docs           2.4.7-1                                  
proxmox-backup-client         2.4.7-1                                  
proxmox-mail-forward          0.1.1-1                                  
proxmox-mini-journalreader    1.2-1                                    
proxmox-offline-mirror-helper unknown                                  
proxmox-widget-toolkit        3.7.4                                    
pve-xtermjs                   4.16.0-2                                  
smartmontools                 7.2-pve3                                  
zfsutils-linux                2.1.15-pve1

We are running one old PVE cluster version 6, but slow speed we have at PVE 8 too.

So what exactly are you trying to investigate? In your initial comment you wrote about

we are investigating "hanging" backup jobs in our environment

and now you are talking about backup speed

Biggest issue is, that in same setup we are running second PBS with SSD drives and speed is same, sometime slightly faster.

So please specify what exactly the issue is and maybe share some more information, e.g. the full backup task log. Also, if the task is "hanging" as you described, the strace output might help to investigate further.

But before any of that, please upgrade to the latest versions of both, PVE and PBS.
 
Ah, okay, sorry. The problem is in the backup speed - both in the MB/s write itself, and the aforementioned waiting for the backup to complete (the aforementioned hanging)

I searched for the thread https://forum.proxmox.com/threads/h...store-benchmark-tool.72750/page-4#post-634136 where the backup speed limit is described - probably the limit of the PBS itself and the single/multi thread corresponds to what we are also observing as a problem. We have two PBS, one with SSD disks and the other with NVMe disks and the backup speed is identical, sometimes the SSD is even faster.

I will try to perform strace, but the problem is that when 6-7 backups are running at the same time, it will be difficult to find "something" in it.

Have you tested what the maximum backup speeds are to PBS? Are we not already hitting the limit of technology and not hardware.

Regarding updates, it is currently not in our power to upgrade the old cluster. Anyway, as I wrote, we have both a new and an old cluster and we have a problem with speed in both environments. That is also why I keep going in circles when we have two environments, two PBSs, one of which is with NVMe disks (and therefore faster), but in terms of backup speed itself or problems with "Waiting for server to finish backup validation..." they are the same.
 
Have you tested what the maximum backup speeds are to PBS? Are we not already hitting the limit of technology and not hardware.
Did you run the proxmox-backup-client benchmark using your PBS datastore as repository (when no other backups are running)? That might help identify possible bottlenecks.
I will try to perform strace, but the problem is that when 6-7 backups are running at the same time, it will be difficult to find "something" in it.
Did you check your I/O delay on the PBS while the backups are running?