Hi everyone,
I've been having a performance issue with virtio-scsi for some time after migrating from ESX last year, with the VM becoming unresponsive and sometimes causing other VMs using virtio-scsi to hang too.
I previously only saw the issue with Windows VMs during Windows updates, usually the large monthly cumulative OS updates. I note many threads here on the forums related to this, e.g. this one. These seemed to be more of a distraction around installation of the drivers, which wasn't an issue for me on the VMs affected at the time.
Digging deeper, I began to suspect io_uring, so tried native, and also with and without iothreads but to no avail. FWIW, I ended up settling on aio=io_uring and iothread=1.
Anyway, I still had to update one system at a time, and often shut down other systems using virtio-scsi so that they wouldn't become unresponsive. Fast forward to today, i had a VM crash during updating and when performing a disk check on reboot it could not finish before crashing out. This caused several other virtio-scsi systems to hang too.
So, I set it upon myself to sort it out. I ended up upgrading all virtio drivers and clients to the latest/stable (virtio-win-0.1.229). Two legacy 2012r2 systems which previously only had netkvm drivers manually installed via device manager were also successfully upgraded to the full package including the guest agent after enabling TESTSIGNING and installing the RedHat cert (and deselecting the qemupciserial driver in the installer). This also effectively migrated them from VMware PVSCSI to virtio-scsi (to be clear, I explicitly migrated to virtio-scsi).
This "addition" of two more VMs to use virtio-scsi seems to be a tipping point in reliability. Any significant I/O, e.g. an update, or a disk check results in all of the virtio-scsi VMs hanging. If the I/O is prolonged, all VMs, even those using IDE, SATA and PVSCSI hang as well. I also had "failed to convert unwritten extents to written extents -- potential data loss!" on the console / syslog and had to reset the server to recover.
During my earlier research, I came across the many threads here and abroad regarding detect-zeroes=unmap and I understand that a patch was released regarding the BDRV_REQ_REGISTERED_BUF flag per the posts by @fiona in those places. I had been waiting for the patched pve-qemu-kvm to be released to see if that helped at all, but it doesn't appear to have had any effect on my issue.
From running
I only use RAW devices. I would like to try detect-zeroes=off to see what happens. Is this possible by passing some args:, if so what would they look like?
The relevant part of the running config is as follows:
All my disk images are in RAW format sitting on a ext4 formatted LVM volume group. Hardware is an LSI MegaRAID SAS 9260-8i controller with 512MB cache.
I'm running the latest pve no-subscription.
At this point, any advice would be appreciated.
Thanks in advance,
Ben
I've been having a performance issue with virtio-scsi for some time after migrating from ESX last year, with the VM becoming unresponsive and sometimes causing other VMs using virtio-scsi to hang too.
I previously only saw the issue with Windows VMs during Windows updates, usually the large monthly cumulative OS updates. I note many threads here on the forums related to this, e.g. this one. These seemed to be more of a distraction around installation of the drivers, which wasn't an issue for me on the VMs affected at the time.
Digging deeper, I began to suspect io_uring, so tried native, and also with and without iothreads but to no avail. FWIW, I ended up settling on aio=io_uring and iothread=1.
Anyway, I still had to update one system at a time, and often shut down other systems using virtio-scsi so that they wouldn't become unresponsive. Fast forward to today, i had a VM crash during updating and when performing a disk check on reboot it could not finish before crashing out. This caused several other virtio-scsi systems to hang too.
So, I set it upon myself to sort it out. I ended up upgrading all virtio drivers and clients to the latest/stable (virtio-win-0.1.229). Two legacy 2012r2 systems which previously only had netkvm drivers manually installed via device manager were also successfully upgraded to the full package including the guest agent after enabling TESTSIGNING and installing the RedHat cert (and deselecting the qemupciserial driver in the installer). This also effectively migrated them from VMware PVSCSI to virtio-scsi (to be clear, I explicitly migrated to virtio-scsi).
This "addition" of two more VMs to use virtio-scsi seems to be a tipping point in reliability. Any significant I/O, e.g. an update, or a disk check results in all of the virtio-scsi VMs hanging. If the I/O is prolonged, all VMs, even those using IDE, SATA and PVSCSI hang as well. I also had "failed to convert unwritten extents to written extents -- potential data loss!" on the console / syslog and had to reset the server to recover.
During my earlier research, I came across the many threads here and abroad regarding detect-zeroes=unmap and I understand that a patch was released regarding the BDRV_REQ_REGISTERED_BUF flag per the posts by @fiona in those places. I had been waiting for the patched pve-qemu-kvm to be released to see if that helped at all, but it doesn't appear to have had any effect on my issue.
From running
ps aux | grep kvm
I see that all my VMs are using detect-zeroes=on.I only use RAW devices. I would like to try detect-zeroes=off to see what happens. Is this possible by passing some args:, if so what would they look like?
The relevant part of the running config is as follows:
Code:
-device virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0
-drive file=/mnt/mirror/images/300/vm-300-disk0.raw,if=none,id=drive-scsi0,aio=io_uring,format=raw,cache=none,detect-zeroes=on
-device scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100
All my disk images are in RAW format sitting on a ext4 formatted LVM volume group. Hardware is an LSI MegaRAID SAS 9260-8i controller with 512MB cache.
I'm running the latest pve no-subscription.
At this point, any advice would be appreciated.
Thanks in advance,
Ben