Data corruption in Windows Server VM

Andreas_H

Member
Jan 6, 2021
14
2
8
46
Hello,

I ran into a strange data corruption issue which I need help with. The VM in question is Windows Server 2016 Standard, used primarily as a file server.

On this VM, I found files which could not be read or copied anymore, not even on the VM itself. I only get "the file is unreadable or damaged" - something I would expect from a defect physical disk. Since these files were downloaded, I downloaded them again and replaced the defective ones. That worked for this day, a few days later I found the same files defective again!
I began to take a closer look and ran a powershell script, trying to read all files on that disk - it revealed that there are ~300 defective files, scattered across all directories. I will run a "chkdsk /r" tonight and see what it gives.

Even worse, I took a look on the backups on our PBS server - in addition to the defective files, I found more files in the backups which can be downloaded using file restore but contain only garbage, while the original on the server is still readable and OK. SHA256SUMs of original and backup differ. I have additional backups on usb disk, but since these cannot restore individual files, I would have to restore them completely to another VM to take a look. I will try that too tonight.

Both problems are restricted to this VM disk - I ran the powershell script and checked the backups of other VMs, everythings fine. This is the second disk of this VM, the first one containing the OS is also fine.

The disk resides on a zfs mirror made up of two NVMe SSDs. 'zpool status' says everything OK, 'zpool scrub' also finds no problems. Backup verification on the pbs is on and also says everything OK. The VM is configured with VirtIO SCSI (not single), the disk is VirtIO block with discard on. VirtIO driver version is 100.74.104.14100 (I guess that is release 0.1.141?).

What freaks me out most is the backup corruption - until now I thought that verification on the pbs made sure that everything matches the source, but obviously that is not always true.

Any hints really welcome!

Thanks,
Andreas
 
ECC RAM is already in place. Running memtest that long is difficult - this is a single node production server that can't be taken offline easily. Same applies for replacing disk.

I am still looking for an explanation why zfs scrub didn't catch this, if this is an actual hardware issue.
 
What filesystem are you using on that Windows VM?

Generally, ZFS can only catch such mistakes if they occur after writing to the disk, that means if the data already arrives corrupted there then ZFS cannot tell whether any corruption occured. Are you using ECC RAM on both servers (PBS/PVE)?
 
Yes, ECC RAM on both servers - PVE is a ProLiant ML 350 Gen10, PBS is an ML350 Gen6.

Moving the virtual disk - I will give that a try, thanks!

Generally, ZFS can only catch such mistakes if they occur after writing to the disk, that means if the data already arrives corrupted there then ZFS cannot tell whether any corruption occured
I understand that, but if garbage gets written to disk I should at least be able to read that garbage back instead of not being able to read it at all, right?
 
Last edited:
I understand that, but if garbage gets written to disk I should at least be able to read that garbage back instead of not being able to read it at all, right?
Since there is an abstraction layer inbetween, the corrupted data written could be important data for the filesystem in your VM - leading to the file being completely unreadable if metadata is corrupted for instance.

What filesystem are you using inside the Windows VM?
Is this the only disk on that zpool, or are there other disks without issues? If there are other disks, what guest OS / filesystem are they using?
 
Since there is an abstraction layer inbetween, the corrupted data written could be important data for the filesystem in your VM - leading to the file being completely unreadable if metadata is corrupted for instance.

What filesystem are you using inside the Windows VM?
Is this the only disk on that zpool, or are there other disks without issues? If there are other disks, what guest OS / filesystem are they using?
Okay, I get what you mean, and that could well be an explanation. The disk is using NTFS, and there are 4 other Windows VMs with NTFS disks on the same zpool that don't have any problems. There are also 5 LXC Containers with ext3-formatted volumes on that zpool, also without problems.
 
4 other Windows VMs
How do the 1 & 4 other VMs compare, firstly from Proxmox point-of-view; disk drivers etc. & then from within Windows. Finally are the VMs doing similar stuff within Windows.

Basically compare & contrast.

My hunch is that corruption is taking place from within Windows not Proxmox.
 
I would run an offline chkdsk on next boot (if a medium downtime is acceptable). If you’re lucky, Windows will detect corruptions in the USN journal, MFT or volume bitmap. And I would check if - unlikely but not impossible - the virtual 2016 server tried to run defrag in the past.
 
if backup to PBS is too slow, corruption can occur within guest.
There is fleecing option to workaround.
 
How do the 1 & 4 other VMs compare, firstly from Proxmox point-of-view; disk drivers etc. & then from within Windows. Finally are the VMs doing similar stuff within Windows.

Basically compare & contrast.

My hunch is that corruption is taking place from within Windows not Proxmox.

Exactly my thought - if there's only one VM causing issues out of multiple VMs that are located on that storage then it's likely to come from within the VM. Does the Windows Event Log contain anything interesting?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!