Windows Server 2022 VM and vioscsi device resets

Sébastien Riccio · Jan 18, 2024

Hello,

I'm trying to troubleshoot an issue we have win a windows VM which periodically has trouble accessing disks.

In the windows system log, I can see that some attached disks (not always the same) starts being resetted by the vioscsi driver.

The result is that in the VM the mounted disk is inaccessible and any attempt to access the disk result in freezing explorer, etc. The only solution is to kill the VM (hard stop) and start it again.

Proxmox version is the latest.

The VM runs Win 2022 and exchange server 2019 ans has a system disk and 8 others disks attached for Exchange databases + Exchanges databases transaction logs formatted in ReFS as suggested by MS. The VM has the latest Win virtio guest tools installed.

The disks are qcow2 files hosted on a NFS share on a quite powerfull Huawei SAN/NAS. I have not noticed any performance issue with the VM images file storage.

The full VM config is here:

Code:

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0
cores: 8
cpu: SandyBridge
efidisk0: Huawei_NFS:404/vm-404-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
machine: pc-q35-8.1
memory: 32768
meta: creation-qemu=8.0.2,ctime=1692075358
name: exchange22019
net0: virtio=9E:4A:1F:20:27:A1,bridge=vmbr0,firewall=1,tag=12
numa: 0
onboot: 1
ostype: win11
protection: 1
scsi0: Huawei_NFS:404/vm-404-disk-1.qcow2,cache=writeback,discard=on,iothread=1,size=150G,ssd=1
scsi1: Huawei_NFS:404/vm-404-disk-4.qcow2,cache=writeback,discard=on,iothread=1,size=500G,ssd=1
scsi2: Huawei_NFS:404/vm-404-disk-5.qcow2,cache=writeback,discard=on,iothread=1,size=500G,ssd=1
scsi3: Huawei_NFS:404/vm-404-disk-6.qcow2,cache=writeback,discard=on,iothread=1,size=500G,ssd=1
scsi4: Huawei_NFS:404/vm-404-disk-7.qcow2,cache=writeback,discard=on,iothread=1,size=500G,ssd=1
scsi5: Huawei_NFS:404/vm-404-disk-8.qcow2,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsi6: Huawei_NFS:404/vm-404-disk-9.qcow2,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsi7: Huawei_NFS:404/vm-404-disk-10.qcow2,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsi8: Huawei_NFS:404/vm-404-disk-11.qcow2,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=25dc92bb-9161-4df1-bb54-a9d87674366c
sockets: 1
vmgenid: 8587649c-8b96-49bf-837e-878349ff8421

Still, I also notice that the vioscsi resets begins to occurs a bit after the scheduled Proxmox backups to a PBS occurs, but the issue happens not every night when the backups are running every night.

I had read that backups can be heavy on IO though and there were suggestion to lower the IO agressivity, so I went ahead and set these settings (performance and ionice)

Code:

root@pm:~# cat /etc/vzdump.conf
# vzdump default settings

performance: max-workers=1
ionice: 8
#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#performance: [max-workers=N][,pbs-entries-max=N]
#lockwait: MINUTES
#stopwait: MINUTES
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#prune-backups: keep-INTERVAL=N[,...]
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N
#notes-template: {{guestname}}

Still the issue remains and we have no choice than to kill the VM as a normal shutdown will be stuck on waiting for disk access...
Then we lose CBT for the VM and the next backup will need a full read of the disks....

Any suggestion or ideas about what else I could try to troubleshoot to find the origin of the issue is welcome.

Thanks a lot !

sb-jw · Jan 18, 2024

Have you already read this? -> https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623

Sébastien Riccio · Jan 18, 2024

sb-jw said:
Have you already read this? -> https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623

Oh, thank you. A very interresting read.

Especially this recent part about backups to PBS
https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623#issuecomment-1880928878

Also reading that this could be a workaround:

I am aware that there is the possibility with QEMU to do image fleecing nowadays and also abort a backup if a copy-before-write operation hits a given timeout (cbw-timeout+on-cbw-error=break-snapshot options). Those should also work around the issue, but we haven't gotten around to implementing those yet.

Any idea if these options can be activated somehow ? Other than that we would have no other option than to disable backups ?

Our PBS server storage is based on a ZFS pool of around 10 mirroring vdevs (kinda like raid 10 i guess) and I haven't seen any throughput performance issue with it, but still it's probably related, maybe due to latency from mechanical drives.

Kind regards.

Sébastien Riccio · Jan 18, 2024

It seems this happens to almost every VM we backup this way, but others are Linux VMs and it seems they are a bit more robust about it.
I see hung tasks in the VMs dmesgs though but in the end it can recover and continue operations.

What I dont understand is that we have some VM that are using disk images on iSCSI and they are backup up the same way, but they seems not affected though. How is that possible ?

How can the backup process initiate a snapshot of the VM for the iSCSI disk images based VM's in the first place, when these VMs can't be snapshotted (no support for snapshots for images on iSCSI).

Hmm, if I understand what I read correctly. During the backup some copy-before-write technique is used and this require de backup server to ACK a written packet before it can be written to the VM storage, which leads to timeouts if the PBS is slow to ACK.
But what if the PBS is for example crashing, while a backup is running ? That will render the VM unusable ?

I can see it being a quite huge problem. Issue with backing up becoming a possible source of problem for the running VM.
Am I right about this ? If yes, isn't there another way to do the backups to PBS without taking this risk ?

Thanks a lot for your comments

sb-jw · Jan 18, 2024

Since you mention that it's always after a backup and you have iothread enabled. You may have caught the other bug too. More infos: https://forum.proxmox.com/threads/vms-hung-after-backup.137286/

The thread in short, disable iothread and see if it occurs again.

Sébastien Riccio · Jan 18, 2024

Once again, thank you.

It starts happening during the backup so I would guess it's more likely to be linked to the first issue you suggested, but as some VMs seems to sometimes be not able to recover after the backup, it might be worth a try to disable iothreads yes.

We use iothread for all of our VMs as it seemed to be the recommended setting.

Sébastien Riccio · Jan 19, 2024

I can see that many people are facing this issue and some are suggesting to use fleecing, supported by qemu.

https://www.mail-archive.com/pve-devel@lists.proxmox.com/msg09294.html
https://bugzilla.proxmox.com/show_bug.cgi?id=3086

Is that something on the roadmap ?

Really in the current state, running a backup is dangerous as the VM performance and health during backups depends mainly on the backup server performance and even the backup server being suddently unavailable in a middle of a backup will freeze a VM.... :/

fiona · Jan 19, 2024

Hi,

Sébastien Riccio said:
I can see that many people are facing this issue and some are suggesting to use fleecing, supported by qemu.

https://www.mail-archive.com/pve-devel@lists.proxmox.com/msg09294.html
https://bugzilla.proxmox.com/show_bug.cgi?id=3086

Is that something on the roadmap ?

it's not actually fully supported in upstream QEMU yet, but currently being evaluated, see https://bugzilla.proxmox.com/show_bug.cgi?id=4136#c7

Sébastien Riccio · Jan 19, 2024

Hello Fiona,

Thanks for the feedback and for the bugzilla post.

Ok, I understand that there are missing things upstream for it to be implemented in Proxmox. Hopefully it will maybe be added

I have a question about on-cbw-error and cbw-timeout:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg880571.html

Is that something we can enable (even manually editig config files?)
It looks to me that it would at least preserve the VMs at the cost of aborted backups ?

Thanks again!

fiona · Jan 19, 2024

Sébastien Riccio said:
Hello Fiona,

Thanks for the feedback and for the bugzilla post.

Ok, I understand that there are missing things upstream for it to be implemented in Proxmox. Hopefully it will maybe be added

I have a question about on-cbw-error and cbw-timeout:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg880571.html

That would need to be changed inside QEMU in the implementation of backup command which sets up a backup job. Unfortunately, the cbw settings are designed for fleecing backup only for writes to the fleecing image (i.e. it will help when there are issues with the fleecing storage) and not for writes to the backup target storage itself. That would need some code changes to the backup job implementation in QEMU.

Search

Search

Windows Server 2022 VM and vioscsi device resets

Sébastien Riccio

Active Member

sb-jw

Famous Member

Sébastien Riccio

Active Member

Sébastien Riccio

Active Member

sb-jw

Famous Member

Sébastien Riccio

Active Member

Sébastien Riccio

Active Member

fiona

Proxmox Staff Member

Sébastien Riccio

Active Member

fiona

Proxmox Staff Member

We value your privacy