Windows Server 2022 VM and vioscsi device resets

Jul 18, 2019
24
0
21
47
Hello,

I'm trying to troubleshoot an issue we have win a windows VM which periodically has trouble accessing disks.

In the windows system log, I can see that some attached disks (not always the same) starts being resetted by the vioscsi driver.

1705566283928.png

The result is that in the VM the mounted disk is inaccessible and any attempt to access the disk result in freezing explorer, etc. The only solution is to kill the VM (hard stop) and start it again.

Proxmox version is the latest.

The VM runs Win 2022 and exchange server 2019 ans has a system disk and 8 others disks attached for Exchange databases + Exchanges databases transaction logs formatted in ReFS as suggested by MS. The VM has the latest Win virtio guest tools installed.

The disks are qcow2 files hosted on a NFS share on a quite powerfull Huawei SAN/NAS. I have not noticed any performance issue with the VM images file storage.

The full VM config is here:

Code:
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0
cores: 8
cpu: SandyBridge
efidisk0: Huawei_NFS:404/vm-404-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
machine: pc-q35-8.1
memory: 32768
meta: creation-qemu=8.0.2,ctime=1692075358
name: exchange22019
net0: virtio=9E:4A:1F:20:27:A1,bridge=vmbr0,firewall=1,tag=12
numa: 0
onboot: 1
ostype: win11
protection: 1
scsi0: Huawei_NFS:404/vm-404-disk-1.qcow2,cache=writeback,discard=on,iothread=1,size=150G,ssd=1
scsi1: Huawei_NFS:404/vm-404-disk-4.qcow2,cache=writeback,discard=on,iothread=1,size=500G,ssd=1
scsi2: Huawei_NFS:404/vm-404-disk-5.qcow2,cache=writeback,discard=on,iothread=1,size=500G,ssd=1
scsi3: Huawei_NFS:404/vm-404-disk-6.qcow2,cache=writeback,discard=on,iothread=1,size=500G,ssd=1
scsi4: Huawei_NFS:404/vm-404-disk-7.qcow2,cache=writeback,discard=on,iothread=1,size=500G,ssd=1
scsi5: Huawei_NFS:404/vm-404-disk-8.qcow2,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsi6: Huawei_NFS:404/vm-404-disk-9.qcow2,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsi7: Huawei_NFS:404/vm-404-disk-10.qcow2,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsi8: Huawei_NFS:404/vm-404-disk-11.qcow2,cache=writeback,discard=on,iothread=1,size=20G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=25dc92bb-9161-4df1-bb54-a9d87674366c
sockets: 1
vmgenid: 8587649c-8b96-49bf-837e-878349ff8421


Still, I also notice that the vioscsi resets begins to occurs a bit after the scheduled Proxmox backups to a PBS occurs, but the issue happens not every night when the backups are running every night.

I had read that backups can be heavy on IO though and there were suggestion to lower the IO agressivity, so I went ahead and set these settings (performance and ionice)

Code:
root@pm:~# cat /etc/vzdump.conf
# vzdump default settings

performance: max-workers=1
ionice: 8
#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#performance: [max-workers=N][,pbs-entries-max=N]
#lockwait: MINUTES
#stopwait: MINUTES
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#prune-backups: keep-INTERVAL=N[,...]
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N
#notes-template: {{guestname}}

Still the issue remains and we have no choice than to kill the VM as a normal shutdown will be stuck on waiting for disk access...
Then we lose CBT for the VM and the next backup will need a full read of the disks....

Any suggestion or ideas about what else I could try to troubleshoot to find the origin of the issue is welcome.

Thanks a lot !
 

Oh, thank you. A very interresting read.

Especially this recent part about backups to PBS
https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623#issuecomment-1880928878

Also reading that this could be a workaround:

I am aware that there is the possibility with QEMU to do image fleecing nowadays and also abort a backup if a copy-before-write operation hits a given timeout (cbw-timeout+on-cbw-error=break-snapshot options). Those should also work around the issue, but we haven't gotten around to implementing those yet.

Any idea if these options can be activated somehow ? Other than that we would have no other option than to disable backups ?

Our PBS server storage is based on a ZFS pool of around 10 mirroring vdevs (kinda like raid 10 i guess) and I haven't seen any throughput performance issue with it, but still it's probably related, maybe due to latency from mechanical drives.

Kind regards.
 
Last edited:
It seems this happens to almost every VM we backup this way, but others are Linux VMs and it seems they are a bit more robust about it.
I see hung tasks in the VMs dmesgs though but in the end it can recover and continue operations.

What I dont understand is that we have some VM that are using disk images on iSCSI and they are backup up the same way, but they seems not affected though. How is that possible ?

How can the backup process initiate a snapshot of the VM for the iSCSI disk images based VM's in the first place, when these VMs can't be snapshotted (no support for snapshots for images on iSCSI).

Hmm, if I understand what I read correctly. During the backup some copy-before-write technique is used and this require de backup server to ACK a written packet before it can be written to the VM storage, which leads to timeouts if the PBS is slow to ACK.
But what if the PBS is for example crashing, while a backup is running ? That will render the VM unusable ?

I can see it being a quite huge problem. Issue with backing up becoming a possible source of problem for the running VM.
Am I right about this ? If yes, isn't there another way to do the backups to PBS without taking this risk ?

Thanks a lot for your comments
 
Last edited:
Once again, thank you.

It starts happening during the backup so I would guess it's more likely to be linked to the first issue you suggested, but as some VMs seems to sometimes be not able to recover after the backup, it might be worth a try to disable iothreads yes.

We use iothread for all of our VMs as it seemed to be the recommended setting.
 
I can see that many people are facing this issue and some are suggesting to use fleecing, supported by qemu.

https://www.mail-archive.com/pve-devel@lists.proxmox.com/msg09294.html
https://bugzilla.proxmox.com/show_bug.cgi?id=3086

Is that something on the roadmap ?

Really in the current state, running a backup is dangerous as the VM performance and health during backups depends mainly on the backup server performance and even the backup server being suddently unavailable in a middle of a backup will freeze a VM.... :/
 
Hello Fiona,

Thanks for the feedback and for the bugzilla post.

Ok, I understand that there are missing things upstream for it to be implemented in Proxmox. Hopefully it will maybe be added :)

I have a question about on-cbw-error and cbw-timeout:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg880571.html

Is that something we can enable (even manually editig config files?)
It looks to me that it would at least preserve the VMs at the cost of aborted backups ?

Thanks again!
 
Hello Fiona,

Thanks for the feedback and for the bugzilla post.

Ok, I understand that there are missing things upstream for it to be implemented in Proxmox. Hopefully it will maybe be added :)

I have a question about on-cbw-error and cbw-timeout:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg880571.html
That would need to be changed inside QEMU in the implementation of backup command which sets up a backup job. Unfortunately, the cbw settings are designed for fleecing backup only for writes to the fleecing image (i.e. it will help when there are issues with the fleecing storage) and not for writes to the backup target storage itself. That would need some code changes to the backup job implementation in QEMU.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!