PBS backup stopped working - pls help !

chudak · Apr 8, 2021

Hello all,

I have a PBS backup server running on stand alone NUC and it was working fine for several months.
Today I found my scheduled back up process hung.

I tried running backup manually (after rebooting PVE and PBS) for a single small VM and on PVE side I see task log:

Task viewer: VM/CT 105 - Backup
OutputStatus
Stop
INFO: starting new backup job: vzdump 105 --remove 0 --node pve --storage PBS --mode stop
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2021-04-08 09:14:59
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: rancher
INFO: include disk 'scsi0' 'VMs:vm-105-disk-0' 60G
INFO: creating Proxmox Backup Server archive 'vm/105/2021-04-08T16:14:59Z'
INFO: starting kvm to execute backup task
INFO: started backup task '6503eafd-0d91-42eb-874f-0e8d8d6842a0'
INFO: scsi0: dirty-bitmap status: created new
INFO: 0% (404.0 MiB of 60.0 GiB) in 3s, read: 134.7 MiB/s, write: 9.3 MiB/s

and on PBS task:

Task viewer: Datastore PBS_ZFS Backup vm/105
OutputStatus
Stop
2021-04-08T09:15:00-07:00: starting new backup on datastore 'PBS_ZFS': "vm/105/2021-04-08T16:14:59Z"
2021-04-08T09:15:00-07:00: download 'index.json.blob' from previous backup.
2021-04-08T09:15:00-07:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2021-04-08T09:15:00-07:00: download 'drive-scsi0.img.fidx' from previous backup.
2021-04-08T09:15:00-07:00: created new fixed index 1 ("vm/105/2021-04-08T16:14:59Z/drive-scsi0.img.fidx")
2021-04-08T09:15:00-07:00: add blob "/mnt/datastore/PBS_ZFS/vm/105/2021-04-08T16:14:59Z/qemu-server.conf.blob" (345 bytes, comp: 345)

And that's it, no progress.
How can i fix this ?

Thx in advance!

chudak · Apr 9, 2021

maybe it's just too slow ?

INFO: starting new backup job: vzdump 105 --remove 0 --storage PBS --node pve --mode stop
INFO: Starting Backup of VM 105 (qemu)
INFO: Backup started at 2021-04-08 14:51:20
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: rancher
INFO: include disk 'scsi0' 'VMs:vm-105-disk-0' 60G
INFO: creating Proxmox Backup Server archive 'vm/105/2021-04-08T21:51:20Z'
INFO: starting kvm to execute backup task
INFO: started backup task 'ddb97396-fa75-4c14-baf5-d5eaee3e6a1d'
INFO: scsi0: dirty-bitmap status: created new
INFO: 0% (448.0 MiB of 60.0 GiB) in 3s, read: 149.3 MiB/s, write: 9.3 MiB/s
INFO: 1% (800.0 MiB of 60.0 GiB) in 1m 13s, read: 5.0 MiB/s, write: 643.7 KiB/s
INFO: 2% (1.4 GiB of 60.0 GiB) in 11m 49s, read: 1.0 MiB/s, write: 0 B/s
INFO: 3% (2.1 GiB of 60.0 GiB) in 11m 52s, read: 246.7 MiB/s, write: 88.0 MiB/s
INFO: 4% (2.5 GiB of 60.0 GiB) in 43m 35s, read: 219.5 KiB/s, write: 219.5 KiB/s

Don't know how to approach this ...

fabian · Apr 9, 2021

what kind of storage is the source and backup storage? how's the load on either end?

chudak · Apr 9, 2021

fabian said:
what kind of storage is the source and backup storage? how's the load on either end?

PVE - LVM on SSD (Samsung SSD 970 EVO Plus 1TB)
PBS - ZFS (WD Red disk)

No load spikes on either end

(Note: I have a parallel backup to CIFS and it still works fine)

(
root@pbs:~# zpool status -v PBS_ZFS
pool: PBS_ZFS
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0B in 02:30:09 with 0 errors on Sun Mar 14 03:54:11 2021
config:

NAME STATE READ WRITE CKSUM
PBS_ZFS ONLINE 0 0 0
sda ONLINE 0 0 0

errors: No known data errors
)

fabian · Apr 9, 2021

what kind of WD Red?

chudak · Apr 9, 2021

fabian said:
what kind of WD Red?

Western Digital 1TB WD Red Plus NAS Internal Hard Drive - 5400 RPM Class, SATA 6 Gb/s, CMR, 16 MB Cache, 2.5" - WD10JFCX

(
Also tested performance to this box:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.09 GBytes 934 Mbits/sec 0 sender
[ 5] 0.00-10.00 sec 1.09 GBytes 932 Mbits/sec receiver

so network is fine
)

chudak · Apr 9, 2021

@fabian

In general I don't mind to reformat the HD if you say so (altho not sure about all steps how to reinitialize the drive for PBS), but it'd be good if we try to find the root cause. Maybe PBS s/w related, who knows.

What do you think ?

fabian · Apr 9, 2021

well a single such spinning disk won't manage a lot of random I/O.. your network seems to be gbit, so more than 100MB/s is not possible on that front. I am not sure why the performance fluctuates that much, but watching with atop on both ends might give you a hint whether you hit some bottle neck.

chudak · Apr 9, 2021

Actually about the network

iperf3 PVE to PBS - normal:

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.01 GBytes 870 Mbits/sec 19 sender
[ 5] 0.00-10.00 sec 1.01 GBytes 869 Mbits/sec receiver

but from PBS to PVE

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 4.63 MBytes 3.88 Mbits/sec 434 sender
[ 5] 0.00-10.00 sec 4.48 MBytes 3.76 Mbits/sec receiver

WTH ?!

What do yo think?

chudak · Apr 9, 2021

fabian said:
well a single such spinning disk won't manage a lot of random I/O.. your network seems to be gbit, so more than 100MB/s is not possible on that front. I am not sure why the performance fluctuates that much, but watching with atop on both ends might give you a hint whether you hit some bottle neck.

I am not expecting top performance from that PBS, but it has been working fine so something is different now.
How would you address this ?

fabian · Apr 9, 2021

chudak said:
Actually about the network

iperf3 PVE to PBS - normal:

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.01 GBytes 870 Mbits/sec 19 sender
[ 5] 0.00-10.00 sec 1.01 GBytes 869 Mbits/sec receiver

but from PBS to PVE

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 4.63 MBytes 3.88 Mbits/sec 434 sender
[ 5] 0.00-10.00 sec 4.48 MBytes 3.76 Mbits/sec receiver

WTH ?!

What do yo think?

well.. that is outside of our control - check your switch, NICs, ...?

chudak · Apr 9, 2021

fabian said:
well.. that is outside of our control - check your switch, NICs, ...?

Fixed the network (changed the cable), now I get 1GB both ways.

I think it was an unrelated issue and connection from PVE to PBS was still running at high speed.

The problem still remains.
When I run GC it practically hangs:

Task viewer: Datastore PBS_ZFS - Garbage Collect

OutputStatus
Stop
2021-04-09T07:34:52-07:00: starting garbage collection on store PBS_ZFS
2021-04-09T07:34:52-07:00: Start GC phase1 (mark used chunks)

Anymore clues ?

How can the drive be reformatted/reinitialized ?

Thx for helping @fabian !

chudak · Apr 9, 2021

@fabian

Update

I solved the problem by removing my PBS storage from PVE and adding it back.

Why did that happen? Kill me !

If you wish to look for some logs to try to see the root cause, pls let me know and I will be happy to do so.

Meanwhile testing on one VM backup and it runs reasonably:

INFO: 0% (588.0 MiB of 60.0 GiB) in 3s, read: 196.0 MiB/s, write: 161.3 MiB/s
INFO: 1% (964.0 MiB of 60.0 GiB) in 6s, read: 125.3 MiB/s, write: 125.3 MiB/s
INFO: 2% (1.2 GiB of 60.0 GiB) in 10s, read: 76.0 MiB/s, write: 76.0 MiB/s

Search

Search

PBS backup stopped working - pls help !

chudak

Well-Known Member

chudak

Well-Known Member

fabian

Proxmox Staff Member

chudak

Well-Known Member

fabian

Proxmox Staff Member

chudak

Well-Known Member

Western Digital 1TB WD Red Plus NAS Internal Hard Drive - 5400 RPM Class, SATA 6 Gb/s, CMR, 16 MB Cache, 2.5" - WD10JFCX

chudak

Well-Known Member

fabian

Proxmox Staff Member

chudak

Well-Known Member

chudak

Well-Known Member

fabian

Proxmox Staff Member

chudak

Well-Known Member

chudak

Well-Known Member

PBS backup stopped working - pls help !

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Western Digital 1TB WD Red Plus NAS Internal Hard Drive - 5400 RPM Class, SATA 6 Gb/s, CMR, 16 MB Cache, 2.5" - WD10JFCX​

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Western Digital 1TB WD Red Plus NAS Internal Hard Drive - 5400 RPM Class, SATA 6 Gb/s, CMR, 16 MB Cache, 2.5" - WD10JFCX