I have been observing VM disks going readonly randomly (every 5-7 days, it's not a regular pattern). This appears to only be happening to VMs with very large disks, such as 4TB or more. I have a fairly large deployment with over 50 hypervisors.
The VMs have a local RAID 1/RAID 10 disk, I have seen it happen on both HDD and SSD storage. I have tried changing the SCSI controller, device type (SATA/SCSI/VirtIO Block), to no avail. I've tried changing the caching settings, none, default, writeback, directsync, and the problem still occurs. Changing the async IO setting doesn't really seem to help either.
This does not appear to be a host hardware related problem, I have had this occur on Xeon Silvers, old i7's, etc. The host filesystem doesn't go readonly, the host runs fine and there are no relevant logs in the host or the guest. In most cases, just rebooting the server triggers a fsck and that gets resolved automatically. In some cases, it boots into busybox and a manual fsck is required.
It "seems" that the problem has started after kernel 6.5, but I am not certain about this. Any help/tips regarding this would be super appreciated.
Unfortunately, I can't test whether all VMs are getting affected, in most cases these are single VM PVE installations, however, in two cases, I've had the VM with the largest disk (6TB) go readonly, while other VMs on the same storage and hypervisor continue working as if nothing happened. Simply rebooting/fscking the affected VM worked.
The VMs have a local RAID 1/RAID 10 disk, I have seen it happen on both HDD and SSD storage. I have tried changing the SCSI controller, device type (SATA/SCSI/VirtIO Block), to no avail. I've tried changing the caching settings, none, default, writeback, directsync, and the problem still occurs. Changing the async IO setting doesn't really seem to help either.
This does not appear to be a host hardware related problem, I have had this occur on Xeon Silvers, old i7's, etc. The host filesystem doesn't go readonly, the host runs fine and there are no relevant logs in the host or the guest. In most cases, just rebooting the server triggers a fsck and that gets resolved automatically. In some cases, it boots into busybox and a manual fsck is required.
It "seems" that the problem has started after kernel 6.5, but I am not certain about this. Any help/tips regarding this would be super appreciated.
Unfortunately, I can't test whether all VMs are getting affected, in most cases these are single VM PVE installations, however, in two cases, I've had the VM with the largest disk (6TB) go readonly, while other VMs on the same storage and hypervisor continue working as if nothing happened. Simply rebooting/fscking the affected VM worked.
Last edited: