High virtio-scsi I/O in one VM crashes all VMs using virtio-scsi

benyamin

New Member
Oct 3, 2022
3
0
1
Hi everyone,

I've been having a performance issue with virtio-scsi for some time after migrating from ESX last year, with the VM becoming unresponsive and sometimes causing other VMs using virtio-scsi to hang too.

I previously only saw the issue with Windows VMs during Windows updates, usually the large monthly cumulative OS updates. I note many threads here on the forums related to this, e.g. this one. These seemed to be more of a distraction around installation of the drivers, which wasn't an issue for me on the VMs affected at the time.

Digging deeper, I began to suspect io_uring, so tried native, and also with and without iothreads but to no avail. FWIW, I ended up settling on aio=io_uring and iothread=1.

Anyway, I still had to update one system at a time, and often shut down other systems using virtio-scsi so that they wouldn't become unresponsive. Fast forward to today, i had a VM crash during updating and when performing a disk check on reboot it could not finish before crashing out. This caused several other virtio-scsi systems to hang too.

So, I set it upon myself to sort it out. I ended up upgrading all virtio drivers and clients to the latest/stable (virtio-win-0.1.229). Two legacy 2012r2 systems which previously only had netkvm drivers manually installed via device manager were also successfully upgraded to the full package including the guest agent after enabling TESTSIGNING and installing the RedHat cert (and deselecting the qemupciserial driver in the installer). This also effectively migrated them from VMware PVSCSI to virtio-scsi (to be clear, I explicitly migrated to virtio-scsi).

This "addition" of two more VMs to use virtio-scsi seems to be a tipping point in reliability. Any significant I/O, e.g. an update, or a disk check results in all of the virtio-scsi VMs hanging. If the I/O is prolonged, all VMs, even those using IDE, SATA and PVSCSI hang as well. I also had "failed to convert unwritten extents to written extents -- potential data loss!" on the console / syslog and had to reset the server to recover.

During my earlier research, I came across the many threads here and abroad regarding detect-zeroes=unmap and I understand that a patch was released regarding the BDRV_REQ_REGISTERED_BUF flag per the posts by @fiona in those places. I had been waiting for the patched pve-qemu-kvm to be released to see if that helped at all, but it doesn't appear to have had any effect on my issue.

From running ps aux | grep kvm I see that all my VMs are using detect-zeroes=on.

I only use RAW devices. I would like to try detect-zeroes=off to see what happens. Is this possible by passing some args:, if so what would they look like?

The relevant part of the running config is as follows:
Code:
-device virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0
-drive file=/mnt/mirror/images/300/vm-300-disk0.raw,if=none,id=drive-scsi0,aio=io_uring,format=raw,cache=none,detect-zeroes=on
-device scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100

All my disk images are in RAW format sitting on a ext4 formatted LVM volume group. Hardware is an LSI MegaRAID SAS 9260-8i controller with 512MB cache.

I'm running the latest pve no-subscription.

At this point, any advice would be appreciated.

Thanks in advance,
Ben
 
Hi there, this has also caused me hours of frustration. I was doing drive by drive passthrough to a TrueNAS Core installation; following update to 7.4, the VM would crash after about 10 seconds. I finally diagnosed it the the Virtio controller; switching to VMware PVSCSI controller in the VM's harware settings has solved the problem for the moment and given me a working VM.
 
I haven't had a chance to experiment further yet, but patch Tuesday approaches...

I have four VMs still using virtio-scsi (one windows, one debian and two HardenedBSD), two run PVSCSI and the rest (8 to 15 odd at times) are all IDE for now. I'd like to sort this out and move what I can to virtio-scsi if possible.

I'm wondering if this has anything to do with using monolithic flat disks rather than sparse disks...

In my case, I imported from ESX (where I used monolithic flat disks), to the extent I:
  1. Converted the vmdk images to the raw image format using qemu-image convert with the -S 0 option (SPARSE_SIZE = 0); &
  2. Copied the raw images using cp --sparse=never.
Both of these ensure the image remains a fully allocated monolithic flat disk.

This is one reason why I would like to try using a drive device with detect-zeroes=off.

@otymm, I'm curious if your passthrough device would be treated the same way, i.e. as fully allocated. Any chance you could revert to virtio and execute ps aux | grep kvm to confirm whether the relevant drive device has detect-zeroes=on? Maybe you already did...

I've noticed the Proxmox default appears to be a sparse format, e.g. if I grow an image it does not physically allocate the disk space, it "converts" the disk to a monolithic sparse format.

Anyone else care to weigh in?

EDIT: mixed up my detect-zeroes=on. This post is good.
 
I’ll have a look when I get home.

As a quick update, it ended up crashing with PVSCSI too (though significantly later, and only when I started conducting heavy IO; since then, I went back to virtio and installed the 6.2 edge kernel; it’s been okay since then.
 
I’ll have a look when I get home.

As a quick update, it ended up crashing with PVSCSI too (though significantly later, and only when I started conducting heavy IO; since then, I went back to virtio and installed the 6.2 edge kernel; it’s been okay since then.
@otymm, just curious, what CPUs are you running? Still running well?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!