NTFS data corruption in windows VMs

antubis · Aug 1, 2025

Hello,

we are experiencing strange NTFS data corruption problem within some of our VMs.
The corruption only occours occasionally on the "pool disks" of our archiving application VMs. At some point Windows refuses to use the disks until a chkdsk repair is done. Although this is working in general some files are mostly damaged afterwards which is not acceptable permanently.

The disks are about 3T and (mostly) under some kind of load.
The application works like this:
* receiving and saving the customer data to pool disk
* reading the data again and archiving/writing it to tape
* deleting the data from disk

Our configuration:
* 3 nodes with Proxmox 8.4
* Ceph 18.2.4 with ~150T total disk space (nvme)
* EC pool with ~90T usable space
* affected VMs:
** all Windows Server 2022
** all disks are RBD images and accessed via the "VirtIO SCSI single" controller
** 500G system disk (no problems so far)
** 2,5T or 3T pool/data disk (problems occour irregularly)

As the problems occour only irregularly it's really hard to debug them. As a first step we deactivated the writeback cache on the pooldisks which seems to at least have reduced the frequency of the problems - but this could also only be a conincidence.
On one node we tried changing the file system to ReFS and so far the disk had no more problems, which of course could only be a coincidence. As my colleagues had heavy other problems with ReFS in the past we are still very reluctant to fully change to it right now.

The ceph itself is running fine and neither the system disks of affected VMs nor any other VMs (Linux and Windows) seem to have these problems.

*Did anybody run into similar problems and has a good solution or at least a workaround for them?
* Has anybody some experience with ReFS and would you recommend to use it?

leesteken · Aug 1, 2025

Someone else also noticed that: https://forum.proxmox.com/threads/disk-cache-none-safe-or-not-safe.168841/#post-787678

jdancer said:
Using anything else other than none/no cache with Ceph cause VM migration corruption issues.

_gabriel · Aug 1, 2025

Proxmox Backup Server used ?
VirtIO scsi Controller Windows driver version ?
double check version from Windows Device Manager, has to be v266+

PwrBank · Aug 1, 2025

Anecdotal, but we've had that same issue with ESXi as well. Occasionally we'll get a notification about a disk being full, go to the file structures and it's no where near being that full. Run chkdsk on it and bam, reports the correct storage. Figured it was some issue with the disk driver in ESXi, but maybe it's something else.

This will happen on disks that are literally never touched for any reason, ones where a tech long ago made an "E" drive that's 100GB of something, but never actually used. We'll get an alert that the drive is filled up, go and look, totally empty... Run chkdsk, and bam, it's back to 0% use.

NTFS and ReFS do this.

antubis · Aug 4, 2025

_gabriel said:
Proxmox Backup Server used ?
VirtIO scsi Controller Windows driver version ?
double check version from Windows Device Manager, has to be v266+

- yeah. we use a PBS for our backups
- the drivers were recently updated to 271

these VMs aren't migrated between hosts as they have their fixed FC-HBA (pci) for archiving the data to tape.
however, with my former employer i managed a way bigger cluster with writeback cache on and never fell into this kind of problems even with regurlarly migrating some of the VMs.

_gabriel · Aug 4, 2025

antubis said:
The corruption only occours occasionally

is after a backup to Proxmox Backup Server ?
is Proxmox Backup Server located on LAN ?

antubis · Aug 4, 2025

_gabriel said:
is after a backup to Proxmox Backup Server ?
is Proxmox Backup Server located on LAN ?

maybe. timing is not really predictable. the corruption starts with some single entries in the eventlog and continues until windows disables the disk completely. i'll try to investigate more detailed...

PBS is in the same lan segment as the PVE cluster, currently connected with 2x10G LACP.

UPDATE:
we had the issue again. as for timing:
- the backup via PBS started yesterday evening at 20h (and took ~8min) - at this time there were no ntfs errors found in the eventlog, only a volsnap error:

- the nfts errors started today at 9:23, so I'm not sure this is related to backup

_gabriel · Aug 4, 2025

Try "Fleecing" PVE option backup https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_vm_backup_fleecing
or Try disabling VM backup to PBS for a day.

antubis · Aug 20, 2025

Fleecing actually looks good so far, backup runs feel smoother.
However, we decided to migrate the disks to ReFS beforehand which already helped with the corruption issue.

Search

Search

NTFS data corruption in windows VMs

antubis

Renowned Member

leesteken

Distinguished Member

_gabriel

Famous Member

PwrBank

Active Member

antubis

Renowned Member

_gabriel

Famous Member

antubis

Renowned Member

_gabriel

Famous Member

antubis

Renowned Member

We value your privacy