CRC checksum errors on remote-sync

LandRave · Aug 23, 2020

I´m fairly new to Proxmox and just in an evaluation Stage. At the moment i experimenting with the Backup Server. While the Backups working fine the remote sync dosent work for me. The remote Backup Server is connectet via VPN to the primary Backup Server, and on every test synch i run i get the CRC checksum errors.

2020-08-23T12:31:44+02:00: Starting datastore sync job 'PBS1'
2020-08-23T12:31:44+02:00: Sync datastore 'Backup' from '192.168.22.65/Test2'
2020-08-23T12:31:44+02:00: re-sync snapshot "ct/100/2020-08-23T09:39:50Z"
2020-08-23T12:31:44+02:00: sync archive root.pxar.didx
2020-08-23T12:32:14+02:00: sync group ct/100 failed - Data blob has wrong CRC checksum.
2020-08-23T12:32:14+02:00: sync snapshot "vm/102/2020-08-23T09:40:28Z"
2020-08-23T12:32:15+02:00: sync archive qemu-server.conf.blob
2020-08-23T12:32:15+02:00: sync archive drive-scsi0.img.fidx
2020-08-23T12:32:41+02:00: sync group vm/102 failed - Data blob has wrong CRC checksum.
2020-08-23T12:32:41+02:00: TASK ERROR: sync failed with some errors.

Any ideas?

wolfgang · Aug 25, 2020

Hi,

LandRave said:
The remote Backup Server is connectet via VPN to the primary Backup Server

Why you use a VPN?
The traffic of the Backup server is encrypted.

Does your VPN use compression?

galipi · Mar 8, 2021

I have the same problem, I replicate a datastorage through an ipsec tunnel, when I try to restore a virtual machine in the replicated environment I get this error:
restore failed: Data blob has wrong CRC checksum.
TASK ERROR: command '/usr/bin/pbs-restore --repository root@pam@1xxx.xxx.xxx.xxx

bs-local vm/102 drive-virtio0.img.fidx /dev/zvol/ssd1/vm-401-disk-0 --verbose --format raw --keyfile /etc/pve/priv/storage/pbs-local.enc --skip-zero' failed: exit code 255

fabian · Mar 8, 2021

what does a verify of that snapshot say on the server side?

galipi · Mar 8, 2021

This is the output of verify task:

2021-03-08T12:24:09+01:00: verify pbs-local-px3-BACKUP:vm/106/2021-01-23T23:02:03Z
2021-03-08T12:24:09+01:00:   check qemu-server.conf.blob
2021-03-08T12:24:09+01:00:   check drive-virtio1.img.fidx
2021-03-08T12:34:06+01:00: can't verify chunk, load failed - store 'pbs-local-px3-BACKUP', unable to load chunk '86c862c12eae95ef891c24cf2936e4bed245cd1c3166393ddb9556b27208e726' - Data blob has wrong CRC checksum.
2021-03-08T12:34:06+01:00: corrupted chunk renamed to "/mnt/local-px3-BACKUP/BACKUP/.chunks/86c8/86c862c12eae95ef891c24cf2936e4bed245cd1c3166393ddb9556b27208e726.0.bad"
2021-03-08T12:34:13+01:00: can't verify chunk, load failed - store 'pbs-local-px3-BACKUP', unable to load chunk '90dfe074300c6b160b45f0315cb3d00829eca0dc2b591a6e00ee7baf1ad429da' - Data blob has wrong CRC checksum.
2021-03-08T12:34:13+01:00: corrupted chunk renamed to "/mnt/local-px3-BACKUP/BACKUP/.chunks/90df/90dfe074300c6b160b45f0315cb3d00829eca0dc2b591a6e00ee7baf1ad429da.0.bad"
2021-03-08T12:34:41+01:00: can't verify chunk, load failed - store 'pbs-local-px3-BACKUP', unable to load chunk 'cc612edfe167361ec5846638362aaf2d16f0d7470d9002f4aa7e75a1955cee43' - Data blob has wrong CRC checksum.
2021-03-08T12:34:41+01:00: corrupted chunk renamed to "/mnt/local-px3-BACKUP/BACKUP/.chunks/cc61/cc612edfe167361ec5846638362aaf2d16f0d7470d9002f4aa7e75a1955cee43.0.bad"
2021-03-08T12:41:16+01:00:   verified 61996.71/125196.00 MiB in 1026.85 seconds, speed 60.38/121.92 MiB/s (3 errors)
2021-03-08T12:41:16+01:00: verify pbs-local-px3-BACKUP:vm/106/2021-01-23T23:02:03Z/drive-virtio1.img.fidx failed: chunks could not be verified
2021-03-08T12:41:16+01:00:   check drive-virtio0.img.fidx
2021-03-08T12:46:50+01:00:   verified 18796.04/32708.00 MiB in 334.25 seconds, speed 56.23/97.85 MiB/s (0 errors)
2021-03-08T12:46:50+01:00: Failed to verify the following snapshots/groups:
2021-03-08T12:46:50+01:00:     vm/106/2021-01-23T23:02:03Z
2021-03-08T12:46:50+01:00: TASK ERROR: verification failed - please check the log for details

Thanks

fabian · Mar 8, 2021

what storage are you using? did you experience any corruption issues on snapshots that are not synced ,but created directly via backups? do you have a verification job set up? if yes, did it run since those chunks were synced?

galipi · Mar 8, 2021

I am using a NFS mounted storage.
I have a proxmox environment with PBS as a copy system and another environment at a different site with a replica of the copies with PBS through an IPSec tunnel.
In the initial environment where I have the copies the VMs are restored correctly, it is through the sync where they get corrupted.

fabian · Mar 9, 2021

no, when pulling the chunk we actually verify the CRC (and the full digest if it's an unencrypted chunk) before writing it to disk - the corruption has to come from your storage...

timdonovan · May 2, 2021

I'm having the same thing..

Installed PBS to test, added a NFS share, added the PBS to Proxmox as a storage target. Initiated a guest backup from within Proxmox. Switched to PBS and ran a verify:

Code:

2021-05-02T00:18:41+01:00: Starting datastore verify job 'zee-nfs:v-21eb29c4-30c6'
2021-05-02T00:18:41+01:00: verify datastore zee-nfs
2021-05-02T00:18:41+01:00: found 2 groups
2021-05-02T00:18:41+01:00: verify group zee-nfs:host/docker (0 snapshots)
2021-05-02T00:18:41+01:00: verify group zee-nfs:vm/104 (1 snapshots)
2021-05-02T00:18:41+01:00: verify zee-nfs:vm/104/2021-05-01T23:12:46Z
2021-05-02T00:18:41+01:00:   check qemu-server.conf.blob
2021-05-02T00:18:41+01:00:   check drive-scsi0.img.fidx
2021-05-02T00:19:26+01:00: can't verify chunk, load failed - store 'zee-nfs', unable to load chunk 'bcbebcce3b85a392929a2b3e7e47c7cced81ad69e5c7d61f4a447e647af97b66' - Data blob has wrong CRC checksum.
2021-05-02T00:19:26+01:00: corrupted chunk renamed to "/mnt/zee-nfs/.chunks/bcbe/bcbebcce3b85a392929a2b3e7e47c7cced81ad69e5c7d61f4a447e647af97b66.0.bad"
2021-05-02T00:19:35+01:00:   verified 826.09/2768.00 MiB in 53.21 seconds, speed 15.53/52.02 MiB/s (1 errors)
2021-05-02T00:19:35+01:00: verify zee-nfs:vm/104/2021-05-01T23:12:46Z/drive-scsi0.img.fidx failed: chunks could not be verified
2021-05-02T00:19:35+01:00: percentage done: 100.00% (1 of 2 groups, 1 of 1 group snapshots)
2021-05-02T00:19:35+01:00: Failed to verify the following snapshots/groups:
2021-05-02T00:19:35+01:00:     vm/104/2021-05-01T23:12:46Z
2021-05-02T00:19:35+01:00: TASK ERROR: verification failed - please check the log for details

My NAS where the NFS mount sits is running on RAID5, and has never thrown or shown any signs of storage errors...

Uhm...if I run the verify job again it completes:

Code:

Task viewer: Verify Job zee-nfs:v-21eb29c4-30c6 - Scheduled Verification
2021-05-02T00:23:37+01:00: Starting datastore verify job 'zee-nfs:v-21eb29c4-30c6'
2021-05-02T00:23:37+01:00: verify datastore zee-nfs
2021-05-02T00:23:37+01:00: found 2 groups
2021-05-02T00:23:37+01:00: verify group zee-nfs:host/docker (0 snapshots)
2021-05-02T00:23:37+01:00: verify group zee-nfs:vm/104 (1 snapshots)
2021-05-02T00:23:37+01:00: SKIPPED: verify zee-nfs:vm/104/2021-05-01T23:12:46Z (recently verified)
2021-05-02T00:23:37+01:00: percentage done: 100.00% (1 of 2 groups, 1 of 1 group snapshots)
2021-05-02T00:23:37+01:00: TASK OK

dcsapak · May 3, 2021

timdonovan said:
2021-05-02T00:23:37+01:00: SKIPPED: verify zee-nfs:vm/104/2021-05-01T23:12:46Z (recently verified)

the relevant snapshot was skipped because it only was recently verified.

timdonovan said:
Installed PBS to test, added a NFS share, added the PBS to Proxmox as a storage target. Initiated a guest backup from within Proxmox. Switched to PBS and ran a verify:

i would strongly suggest to check you underlying storages for errors (fsck/disks/smart/whatever you have)

timdonovan said:
My NAS where the NFS mount sits is running on RAID5, and has never thrown or shown any signs of storage errors...

do you scrub your raid in regular intervals?

timdonovan · May 3, 2021

Thanks @dcsapak it's a good idea - to be honest my QNAP reports SMART as good, and a fsck doesn't show anything. I can't seem to find a RAID5 scrubbing option, its old QNAP but should still be there (will report back!).

After running fsck I managed to get one backup stored and verified successfully, but now they've all stopped working again. No backups seem to run - I've tried multiple VM's from within PVE and they all throw something like:

Code:

INFO: starting new backup job: vzdump 107 --mode snapshot --storage pbs --remove 0 --node proxmox2
INFO: Starting Backup of VM 107 (qemu)
INFO: Backup started at 2021-05-03 14:21:57
INFO: status = running
INFO: VM Name: pihole-1
INFO: include disk 'scsi0' 'local-zfs2-dir:107/vm-107-disk-0.raw' 8G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/107/2021-05-03T13:21:57Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 107 qmp command 'backup' failed - backup register image failed: command error: unable to get shared lock - EBADF: Bad file number
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 107 failed - VM 107 qmp command 'backup' failed - backup register image failed: command error: unable to get shared lock - EBADF: Bad file number
INFO: Failed at 2021-05-03 14:21:57
INFO: Backup job finished with errors
TASK ERROR: job errors

timdonovan · May 3, 2021

I've read this might be to do with locking, but I've tried turning opslock on/off on my NAS, I've confirmed lockd is running on the NAS host. I've also rebooted the NAS and unmounted/remounted the NFS share on the PBS VM, just in case. Still no dice.

timdonovan · May 3, 2021

So still not sure why every backup was corrupt when my NAS reports 0 filestore errors. But fixed it - I had to remove the datastore config, then re-add it.

For others doing this, I came across these issues:

a) you cannot add storage to PBS with an existing PBS backup on it using the GUI or the command line tool (".chunks" - EEXIST: File exists" error). You can do it by manually editing /etc/proxmox-backup/datastore.cfg

b) there is an issue manually editing datastore.cfg. If you copy paste an existing example, and space does not get put after 'comments', you'll get parsing '/etc/proxmox-backup/datastore.cfg' failed: line 8 - syntax error (expected section properties)

Bumpy ride but got there eventually! Thanks

timdonovan · May 5, 2021

Getting Data blob has wrong CRC checksum errors again on larger backups

timdonovan · May 5, 2021

Trying to find the root cause of this. My QNAP doesn't run ZFS so cannot scrub it, but all the inbuilt RAID(5) tools, fsck and smart disk checkers report no issues. I've also been using it as a filestore and VM backup for a month or two without any issues (I've restored VM's from it multiple times).

Out of a 1.8TB backup (although sparce data means it's about 800GB) I think about 1.5GB is corrupt (from looking it seems chunks are around 4mb?):

Code:

[/share/backups/.chunks] # find . -name "*.bad" | wc -l
    356

Quite often the .bad chunks are a few mb less than other chunks in the same dir, but then again I've found chunks not flagged as bad that are only 500kb, so I guess that is a coincidence.

I'm running a GC now and then will rerun the backup - from my understanding this should fill in the corrupt chunks?

Is there any way I can investigate these chunks in more detail? E.g. What is the mechanism that PBS uses for copying across - I could try copying across some files manually to my NFS mount and compare the crc of source and destination (does it use sha256sum to perform the check)?

Thanks!

guletz · May 5, 2021

Hi,

I would try to test the NFS, without PBS like this:

- create a big file on any PMX host, and then create a md5sum / sha256sum from this file
- check after 1-2 hours if the checksum is OK
- copy this file(from PMX host) to your NFS server and verify after that the checksum is OK or not(on NFS server)
- check after 1-2 hours again on NFS server if checksum is OK or not(on NFS server)

If all this test are OK, you could think that your NFS server is OK.

- repeat the same 2 test using the PBS server -> NFS

I can speculate that you have some disk problem somewere and / or the coruption are somewere in the RAM(PBS, NFS)

Good luck / Bafta!

timdonovan · May 5, 2021

Thanks @guletz ! Thats excatly the sort of process I was thinking of doing, helpful to have it written out!

So far with a 5GB file the checksum is fine/identical both sides, will wait a few hours and check again. Although I've had checksum issues with smaller PBS backups (~8GB) previously, these seem to have gone away. The errors now happen with the larger 800GB disk set. It'll take me a while to test something of that size (on 1gb lan) - might have to limit my tests to 200GB!

guletz · May 5, 2021

timdonovan said:
The errors now happen with the larger 800GB disk se

Hi,

Because I do not want to go in such problems(especialy for backups), I use zfs for PBS, lucky me

Good luck / Bafta !

timdonovan · May 5, 2021

Well I just dropped £600 on a new 2-bay NAS. I couldn't handle the r/w speeds of my 10 year old QNAP running on an ARM chip less powerful than my phone. At best it was doing 20mb/s transfers -_-. I'm still not seeing any corruption on it, but will test with my new NAS once it's up and running.

CRC checksum errors on remote-sync

New Member

Proxmox Retired Staff

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Active Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

We value your privacy