One corrupted chunk constantly happen on only one backup...

AddiCgn

New Member
May 22, 2023
18
2
3
Hi,

we are doing backup for 4 machines to latest PBS. Verification job on the next day finds exactly one chunk with a wrong digest

Code:
2023-06-13T12:54:38+02:00:   check drive-virtio1.img.fidx
2023-06-13T14:31:13+02:00: detected chunk with wrong digest.
2023-06-13T14:31:13+02:00: corrupted chunk renamed to ".../.chunks/d2a6/d2a6c5962829143602db19d5e3c8467ac32b25295923d53c34d1ef5c125b23d9.0.bad"

We deleted all snapshots for this one VM and made another backup. Next day, verification again found one bad chunk on exactly the same source disk. Chunk name/digest was different, but one bad chunk at all (doing a find on the .chunks directory)

Any hints why from all 4 backups to the same location only one backup has this issues and this even reappears? Anything we can reset in addition to deleting all snapshots on PBS?

Thanks,
 
Last edited:
As an update: I created a completely new datastore (local drive) and started the procedure again:

Same result: Verification showed that one chunk of the same VM reports
Code:
2023-06-14T07:22:41+02:00:   check drive-virtio1.img.fidx
2023-06-14T07:53:51+02:00: detected chunk with wrong digest.

Q: Why has this VM the observed issue with just one chunk constantly getting a "wrong digest" on different datastores while all other VM backups verify ok? I assume the VM itself has an issue then?

Thanks for any pointers how to debug this...
 
i'd do a memory test on the pbs side as well as checking the underlying storage for errors
 
Thanks for the reply. As I changed to a different datastore and got the same issue, I'm not sure the datastore will cause this.

And the "memory test on the pbs side": You are suggesting a physical memory test? Would bad memory on the PBS not cause other issues like kernel crashes etc when memory is used for CPU purposes?

Q: This "digest calculation": Is the digest value for backup and later for verification both calculated on the PBS? Or could different CPU archs between PVE host running the VM and starting the backup and later the PBS doing the verification and checking could come to different digest values?

Thanks.
 
Last edited:
And the "memory test on the pbs side": You are suggesting a physical memory test? Would bad memory on the PBS not cause other issues like kernel crashes etc when memory is used for CPU purposes?
bad memory can show all sort of behaviour

Q: This "digest calculation": Is the digest value for backup and later for verification both calculated on the PBS? Or could different CPU archs between PVE host running the VM and starting the backup and later the PBS doing the verification and checking could come to different digest values?
when a chunk is uploaded the hash first calculated by the client (not the pbs) then uploaded. for non encrypted chunks the pbs reverifies that hash then before it's written to the disk (for encrypted chunks the pbs cannot do this as it does not have the encryption key)
then on verify it reads the chunk and recalculates the has and compares it

if a pbs accepts an (unencrypted) chunk and later the verify fails one of the following things happened:
* the memory is bad causing the verify or the initial check to be faulty
* the storage gives sometimes bad data
* the chunk was modified on disk
* the chunk had silent data corruption
 
Hey,

to close this thread, the CPU memory test of system did not show any issue. But after moving PBS to a different CPU system (with the same storage but different CPU arch), the issue did not appear anymore.

So, either the (memory) issue was not detectable by the memory/hardware test or I hit some incompatibility with software/hardware etc (sadly, no kernel etc messages)

Recap: Solved by moving to different hardware (with 100% same software config)

Thanks for help.
 
Last edited:
great that you fixed but what do you mean with:
(with the same storage but different CPU arch)
like did you change from intel to amd? which processor before /after do you use?
 
We changed from AMD GX-420 to Intel i7-8559U. But of course this also changed all components, motherboard, RAM etc. Only the hard drive stayed the same and was moved to the Intel system.

So, we cannot really say if the AMD system had some hardware "issues" (possible, but at least the memory test did not show anything) or if we had some software incompatibilities (without log messages)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!