Persistent Backup Problems on Consumer Hardware

aKoR

New Member
Aug 2, 2024
17
4
3
Hello everyone,

I'm reaching out to you, because I can't resolve my backup issues for a few months now.

Most of the time my backups are running fine and are working, but I keep getting hickups.

First my Setup:
- PVE: HP ProDesk 400 G9 i5-12500T, 64GB DDR4, NVME SSD as local datastore with lvm
- PBS Intel N100 Mini PC with 8GB RAM and a SATA SSD
- Offsite PBS on a VPS with 1Core and 1GB RAM
- TrueNAS Scale: HP EliteDesk 800 G3 i5-6500 8GB DDR RAM with RAIDZ1 over 3 SATA SSDs with LOGDEV on a SATA SSD

PVE got local-lvm and an iSCSI-Connection to the PBS for another datastore. Most of the data is on the iSCSI Storage.

PBS is backing up onto its local SSD every evening at 8pm. Its about 700GB with incremental only about 100MB to 1GB usually.
Then theres a Push-Sync Job to the Offsite PBS at 10pm.

I'm totally aware that this is consumer hardware and that the non-existence of ECC-RAM most likely causes my problems. But I'm, still hoping for some tweaking tips :)

Now to my Problem:
Most of the time a backup takes under 5 minutes. But about once a week or when I restart it takes about an hour, causing high load on the NAS disks. As a reboot makes it take long at least once it must be some kind of cache.
Where does it get cached here? When the backups get created fast everything is fine, when they take long I often get bad chunks in the verify process.

Leading to failed backups, for example:

backup write data failed: command error: write_data upload error: pipelined request failed: detected chunk with wrong digest.

The long running backups lead to a big sync of data to the Offsite PBS. This month it happened a lot and I ran into my Bandwith-Quota twice. Usually 1TB Bandwith is sufficient, this month I'm already at 2.5TB and keep rebuying more bandwith.

As this is kinda vague I didn't know which logs to uploaded initialy. I'll upload them as someone requests them.

Thanks for any help in advance!

Regards

Alex
 
Last edited:
Hi,
Most of the time a backup takes under 5 minutes. But about once a week or when I restart it takes about an hour, causing high load on the NAS disks. As a reboot makes it take long at least once it must be some kind of cache.
Where does it get cached here? When the backups get created fast everything is fine, when they take long I often get bad chunks in the verify process.
PVE uses a dirty bitmap to keep track of changed blocks which have to be included in a new backup snapshot, reusing and avoiding re-upload for already existing ones [0]. The bitmap is however lost if you change backup target or if you stop/restart the VM in between backups, as then consistency cannot be guaranteed anymore.

backup write data failed: command error: write_data upload error: pipelined request failed: detected chunk with wrong digest.
This does look like the chunk got either corrupted in transit or bad memory, yes.

The long running backups lead to a big sync of data to the Offsite PBS. This month it happened a lot and I ran into my Bandwith-Quota twice. Usually 1TB Bandwith is sufficient, this month I'm already at 2.5TB and keep rebuying more bandwith.
Are you maybe able to setup a pull sync? as that might be more efficient when it comes to incremental syncs, as it has a different implementation.

[0] https://pbs.proxmox.com/docs/technical-overview.html#fixed-sized-chunks
 
PVE uses a dirty bitmap to keep track of changed blocks which have to be included in a new backup snapshot, reusing and avoiding re-upload for already existing ones [0]. The bitmap is however lost if you change backup target or if you stop/restart the VM in between backups, as then consistency cannot be guaranteed anymore.

That makes total sense. I will keep that in mind and do some testing around it.

Are you maybe able to setup a pull sync? as that might be more efficient when it comes to incremental syncs, as it has a different implementation.
I can do that, it's running on a wireguard tunnel either way, so should be fine.
I'll update this post as soon as I've got new results.

Thanks a lot for clearing things up!
 
Yesterday i had a long running backup again.

In the log I see those lines:

Code:
131: 2025-05-28 20:02:21 INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
131: 2025-05-28 20:02:21 INFO: scsi1: dirty-bitmap status: existing bitmap was invalid and has been cleared
131: 2025-05-28 20:02:21 INFO: scsi3: dirty-bitmap status: existing bitmap was invalid and has been cleared
131: 2025-05-28 20:02:21 INFO: scsi4: dirty-bitmap status: existing bitmap was invalid and has been cleared
131: 2025-05-28 20:02:21 INFO: scsi6: dirty-bitmap status: existing bitmap was invalid and has been cleared

The VM didn't get restarted though.

07:02:17 up 9 days, 18:24, 1 user, load average: 0.07, 0.13, 0.14

What other reasons are there for the bitmap getting cleared?
 
Yesterday i had a long running backup again.

In the log I see those lines:

Code:
131: 2025-05-28 20:02:21 INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
131: 2025-05-28 20:02:21 INFO: scsi1: dirty-bitmap status: existing bitmap was invalid and has been cleared
131: 2025-05-28 20:02:21 INFO: scsi3: dirty-bitmap status: existing bitmap was invalid and has been cleared
131: 2025-05-28 20:02:21 INFO: scsi4: dirty-bitmap status: existing bitmap was invalid and has been cleared
131: 2025-05-28 20:02:21 INFO: scsi6: dirty-bitmap status: existing bitmap was invalid and has been cleared

The VM didn't get restarted though.

07:02:17 up 9 days, 18:24, 1 user, load average: 0.07, 0.13, 0.14

What other reasons are there for the bitmap getting cleared?
Most likely you are backing up the same VM to a different PBS target or namespace in-between?

Some more cases when the bitmap gets invalidated you can find here: https://forum.proxmox.com/threads/existing-bitmap-was-invalid-and-has-been-cleared.80445/post-647272
 
  • Like
Reactions: Johannes S
The target always remains the same and I didn't change anything on the vm.
Whats special about the VM is, that it's got some raw device disks onto an iSCSI LUN.


Bash:
ako@vmh03:/etc/pve/qemu-server$ sudo cat 131.conf
agent: 1
balloon: 0
boot: order=scsi0;net0;ide2
cores: 4
cpu: host
ide2: none,media=cdrom
memory: 16384
meta: creation-qemu=9.0.2,ctime=1741413401
name: akodocker03
net0: virtio=BC:24:11:B0:BC:17,bridge=vmbr0,firewall=1,tag=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-131-disk-0,iothread=1,size=50G
scsi1: nas01_01_thinpool01:vm-131-disk-0,iothread=1,size=20G
scsi3: nas01_01_thinpool01:vm-131-disk-2,iothread=1,size=300G
scsi4: nas01_01_thinpool01:vm-131-disk-3,iothread=1,size=200G
scsi6: nas01_01_thinpool01:vm-131-disk-5,iothread=1,size=50G
scsihw: virtio-scsi-single
smbios1: uuid=c187c1db-eb94-4a15-a55f-a230069d7b5e
sockets: 1
startup: order=5
vmgenid: dfb68341-09b8-4b45-88e2-20bfc71e52d5
ako@akovmh03:/etc/pve/qemu-server$

None of the involved systems got updated or restarted between the backup with the remove bitmap and the backup before.

I got problems with the verifaction of the backup though, as described earlier.

But the between the quick running one and the slow one both had a predecessor with failed verfication.
 
But the between the quick running one and the slow one both had a predecessor with failed verfication.
That too will invalidate the bitmap, as the previous snapshot cannot be used as reference if the verification failed (as otherwise you would reuse parts of a corrupt snapshot).
 
Makes sense.

Is there anything I can do to make my backups persistent from the software side? Like throttling the backup speed?
 
Makes sense.

Is there anything I can do to make my backups persistent from the software side? Like throttling the backup speed?
Not sure if I understand your question: Your backups are persisted to your local datastore. And if you did not change the datastore tuning parameters, for each uploaded chunk it is assured that it's content is correct and synced to disk before finishing the backup.

backup write data failed: command error: write_data upload error: pipelined request failed: detected chunk with wrong digest.
Do you say that also the new, full backups fail upload verification with above messages? Did you perform a full datastore verification?
 
Last edited:
What i meant:

Is there anything I can do to prevent bad chunks in my backups from the software site?
 
PBS already assures that uploaded chunks are verified and persisted with the expected contents, this does however not protect from silent data corruption. Using filesystems with additional redundancy and checksuming features such as ZFS and the use of ECC memory will help to avoid such corruptions.

In general, verify jobs were introduced to detect such corruptions, see https://pbs.proxmox.com/docs/maintenance.html#verification.
 
I understand. I will rebuild my PBS storage with zfs, maybe that helps in that case. Thanks a lot for for your quick and helpful comments!
 
Already did.

I will go with some tiny PC with 2 NVMe in a ZFS RAID-Z1 as a datastore.

Worst case it doesn't help and I can add it to my PVE cluster instead ;)
 
I will go with some tiny PC with 2 NVMe in a ZFS RAID-Z1 as a datastore.
RAID Z1 requires at least 3 disks though? And RAID 10 will give you better performance.
 
Yeah you right, its not called Raid-Z1 in that case, I got the wording wrong. But AFAIK I can use a mirror vdev with provides the self-heal functionality.

Performance is not my concern, I only need reliable backups for my home infrastructure.

I will give that a shot :)
 
Quick Update for other people with this problem:

I got the new server running with a zfs mirror of 2 NVME SSDs.

Everything working perfectly fine, no more bad chunks so far.

Seems to be the solution, will update if something changes