PBS backup stopped working - pls help !

chudak

Well-Known Member
May 11, 2019
322
16
58
Hello all,

I have a PBS backup server running on stand alone NUC and it was working fine for several months.
Today I found my scheduled back up process hung.

I tried running backup manually (after rebooting PVE and PBS) for a single small VM and on PVE side I see task log:


Task viewer: VM/CT 105 - Backup
OutputStatus
Stop
INFO: starting new backup job: vzdump 105 --remove 0 --node pve --storage PBS --mode stop
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2021-04-08 09:14:59
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: rancher
INFO: include disk 'scsi0' 'VMs:vm-105-disk-0' 60G
INFO: creating Proxmox Backup Server archive 'vm/105/2021-04-08T16:14:59Z'
INFO: starting kvm to execute backup task
INFO: started backup task '6503eafd-0d91-42eb-874f-0e8d8d6842a0'
INFO: scsi0: dirty-bitmap status: created new
INFO: 0% (404.0 MiB of 60.0 GiB) in 3s, read: 134.7 MiB/s, write: 9.3 MiB/s

and on PBS task:


Task viewer: Datastore PBS_ZFS Backup vm/105

OutputStatus
Stop
2021-04-08T09:15:00-07:00: starting new backup on datastore 'PBS_ZFS': "vm/105/2021-04-08T16:14:59Z"
2021-04-08T09:15:00-07:00: download 'index.json.blob' from previous backup.
2021-04-08T09:15:00-07:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2021-04-08T09:15:00-07:00: download 'drive-scsi0.img.fidx' from previous backup.
2021-04-08T09:15:00-07:00: created new fixed index 1 ("vm/105/2021-04-08T16:14:59Z/drive-scsi0.img.fidx")
2021-04-08T09:15:00-07:00: add blob "/mnt/datastore/PBS_ZFS/vm/105/2021-04-08T16:14:59Z/qemu-server.conf.blob" (345 bytes, comp: 345)

And that's it, no progress.
How can i fix this ?

Thx in advance!
 
maybe it's just too slow ?

INFO: starting new backup job: vzdump 105 --remove 0 --storage PBS --node pve --mode stop
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2021-04-08 14:51:20
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: rancher
INFO: include disk 'scsi0' 'VMs:vm-105-disk-0' 60G
INFO: creating Proxmox Backup Server archive 'vm/105/2021-04-08T21:51:20Z'
INFO: starting kvm to execute backup task
INFO: started backup task 'ddb97396-fa75-4c14-baf5-d5eaee3e6a1d'
INFO: scsi0: dirty-bitmap status: created new
INFO: 0% (448.0 MiB of 60.0 GiB) in 3s, read: 149.3 MiB/s, write: 9.3 MiB/s
INFO: 1% (800.0 MiB of 60.0 GiB) in 1m 13s, read: 5.0 MiB/s, write: 643.7 KiB/s
INFO: 2% (1.4 GiB of 60.0 GiB) in 11m 49s, read: 1.0 MiB/s, write: 0 B/s
INFO: 3% (2.1 GiB of 60.0 GiB) in 11m 52s, read: 246.7 MiB/s, write: 88.0 MiB/s
INFO: 4% (2.5 GiB of 60.0 GiB) in 43m 35s, read: 219.5 KiB/s, write: 219.5 KiB/s

Don't know how to approach this ...
 
what kind of storage is the source and backup storage? how's the load on either end?
 
what kind of storage is the source and backup storage? how's the load on either end?

PVE - LVM on SSD (Samsung SSD 970 EVO Plus 1TB)
PBS - ZFS (WD Red disk)

No load spikes on either end

(Note: I have a parallel backup to CIFS and it still works fine)

(
root@pbs:~# zpool status -v PBS_ZFS
pool: PBS_ZFS
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0B in 02:30:09 with 0 errors on Sun Mar 14 03:54:11 2021
config:

NAME STATE READ WRITE CKSUM
PBS_ZFS ONLINE 0 0 0
sda ONLINE 0 0 0

errors: No known data errors
)
 
Last edited:
what kind of WD Red?
 
what kind of WD Red?

Western Digital 1TB WD Red Plus NAS Internal Hard Drive - 5400 RPM Class, SATA 6 Gb/s, CMR, 16 MB Cache, 2.5" - WD10JFCX​



(
Also tested performance to this box:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.09 GBytes 934 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 932 Mbits/sec receiver

so network is fine
)
 
Last edited:
@fabian

In general I don't mind to reformat the HD if you say so (altho not sure about all steps how to reinitialize the drive for PBS), but it'd be good if we try to find the root cause. Maybe PBS s/w related, who knows.

What do you think ?
 
well a single such spinning disk won't manage a lot of random I/O.. your network seems to be gbit, so more than 100MB/s is not possible on that front. I am not sure why the performance fluctuates that much, but watching with atop on both ends might give you a hint whether you hit some bottle neck.
 
Actually about the network

iperf3 PVE to PBS - normal:

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.01 GBytes 870 Mbits/sec 19 sender
[ 5] 0.00-10.00 sec 1.01 GBytes 869 Mbits/sec receiver


but from PBS to PVE

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 4.63 MBytes 3.88 Mbits/sec 434 sender
[ 5] 0.00-10.00 sec 4.48 MBytes 3.76 Mbits/sec receiver

WTH ?!

What do yo think?
 
Last edited:
well a single such spinning disk won't manage a lot of random I/O.. your network seems to be gbit, so more than 100MB/s is not possible on that front. I am not sure why the performance fluctuates that much, but watching with atop on both ends might give you a hint whether you hit some bottle neck.


I am not expecting top performance from that PBS, but it has been working fine so something is different now.
How would you address this ?
 
Actually about the network

iperf3 PVE to PBS - normal:

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.01 GBytes 870 Mbits/sec 19 sender
[ 5] 0.00-10.00 sec 1.01 GBytes 869 Mbits/sec receiver


but from PBS to PVE

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 4.63 MBytes 3.88 Mbits/sec 434 sender
[ 5] 0.00-10.00 sec 4.48 MBytes 3.76 Mbits/sec receiver

WTH ?!

What do yo think?
well.. that is outside of our control - check your switch, NICs, ...?
 
well.. that is outside of our control - check your switch, NICs, ...?

Fixed the network (changed the cable), now I get 1GB both ways.

I think it was an unrelated issue and connection from PVE to PBS was still running at high speed.

The problem still remains.
When I run GC it practically hangs:

Task viewer: Datastore PBS_ZFS - Garbage Collect

OutputStatus
Stop
2021-04-09T07:34:52-07:00: starting garbage collection on store PBS_ZFS
2021-04-09T07:34:52-07:00: Start GC phase1 (mark used chunks)

Anymore clues ?

How can the drive be reformatted/reinitialized ?

Thx for helping @fabian !
 
@fabian

Update

I solved the problem by removing my PBS storage from PVE and adding it back.

Why did that happen? Kill me !

If you wish to look for some logs to try to see the root cause, pls let me know and I will be happy to do so.

Meanwhile testing on one VM backup and it runs reasonably:


INFO: 0% (588.0 MiB of 60.0 GiB) in 3s, read: 196.0 MiB/s, write: 161.3 MiB/s
INFO: 1% (964.0 MiB of 60.0 GiB) in 6s, read: 125.3 MiB/s, write: 125.3 MiB/s
INFO: 2% (1.2 GiB of 60.0 GiB) in 10s, read: 76.0 MiB/s, write: 76.0 MiB/s
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!