Some VMs fail to start after upgrading to 5.15.74-1-pve

lrollins

New Member
Nov 20, 2022
5
0
1
After upgrading pve and rebooting the host, all but 1 VMs crashed during OS start.

Attempted:
  • Reboot host
  • Repair drive inside of VM
  • Restore backup
  • OS Recovery
  • Pin 5.15.64-1-pve kernel
All the above failed.

Noticed the log contained blk_set_enable_write_cache: Assertion `qemu_in_main_thread()' failed and assumed this was expected because caching was disabled for all drives.

On a whim, we decided to enable writethrough caching and then the VM was able to start. This was only needed on the boot volume. Additional drives connected to the VM remain without any cache. Obviously, this is a temporary fix just to get up and running again...

pveversion:
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.2-14 (running version: 7.2-14/65898fbc)
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-7
libpve-guest-common-perl: 4.2-2
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.2-12
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.2
pve-cluster: 7.2-3
pve-container: 4.3-5
pve-docs: 7.2-3
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.1.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-10
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1
 
Last edited:
Drives are all virtio SCSI zfs volumes
All VMs have identical configuration (even the 1 that still works without enabling write cache)
 
Do you have iothread enabled for the drive?
In my case removing the iothread=1 allowed the vm to boot without having to change the caching mode.
 
  • Like
Reactions: bwatk15
Had to do the same thing to get my Windows Domain Controllers to boot (Active Directory turns off caching on disk in the OS). Using VirtIO single with iothread checked. Had to uncheck to boot as
Code:
blk_set_enable_write_cache: Assertion `qemu_in_main_thread()' failed
was logging when trying to boot the VM.
Drives are scsi (attached to Virtio single) writing raw disk image files to btrfs.
 
Can you please post the VM config(s)? E.g., qm config VMID
No boot / gets windows failed to start dialog repair options etc
Code:
root@hv01:~# qm config 7001
agent: 1
balloon: 1024
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 2
cpu: host
efidisk0: ssd-tank:7001/vm-7001-disk-0.raw,size=128K
machine: pc-i440fx-6.0
memory: 4096
name: WC-DC01
net0: virtio=92:A9:D7:25:23:98,bridge=vmbr0
numa: 0
onboot: 1
ostype: win10
rng0: source=/dev/urandom
scsi0: ssd-tank:7001/vm-7001-disk-1.raw,discard=on,iothread=1,size=50G,ssd=1
scsi1: ssd-tank:7001/vm-7001-disk-2.raw,discard=on,iothread=1,size=10G,ssd=1
scsi2: ssd-tank:7001/vm-7001-disk-3.raw,discard=on,iothread=1,size=10G,ssd=1
scsi3: ssd-tank:7001/vm-7001-disk-4.raw,discard=on,iothread=1,size=10G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=a20f23ae-7fea-4d90-ba5d-7c27f43e128e
sockets: 1
startup: order=2,up=600,down=1200
vmgenid: dd275ad6-e8fd-4049-8434-f32fc4d6124f

This boots

Code:
root@hv01:~# qm config 7002
agent: 1
balloon: 1024
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 2
cpu: host
efidisk0: ssd-tank:7002/vm-7002-disk-4.raw,size=128K
machine: pc-i440fx-6.0
memory: 4096
name: WC-DC02
net0: virtio=22:8C:E9:E7:50:A9,bridge=vmbr0
numa: 0
onboot: 1
ostype: win10
rng0: source=/dev/urandom
scsi0: ssd-tank:7002/vm-7002-disk-0.raw,discard=on,size=50G,ssd=1
scsi1: ssd-tank:7002/vm-7002-disk-1.raw,discard=on,size=10G,ssd=1
scsi2: ssd-tank:7002/vm-7002-disk-2.raw,discard=on,size=10G,ssd=1
scsi3: ssd-tank:7002/vm-7002-disk-3.raw,discard=on,size=10G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=286fc56c-21ab-4c43-b15a-b40fd27ba43e
sockets: 1
startup: order=3,up=600,down=1200
vmgenid: e1d09655-6414-4aad-a250-6741d568ddd5
 
Last edited:
mhmm.. could not reproduce here... is there anything else in the logs besides the qemu error?
could be some race condition when setting the cache of disks with iothread
 
Had to do the same thing to get my Windows Domain Controllers to boot (Active Directory turns off caching on disk in the OS). Using VirtIO single with iothread checked. Had to uncheck to boot as
Code:
blk_set_enable_write_cache: Assertion `qemu_in_main_thread()' failed
was logging when trying to boot the VM.
Drives are scsi (attached to Virtio single) writing raw disk image files to btrfs.
I just set the cache back to none and tried this (unchecking iothread) and the VMs booted up as expected
 
mhmm.. could not reproduce here... is there anything else in the logs besides the qemu error?
could be some race condition when setting the cache of disks with iothread
The qemu error was the only negative comment I could find in the logs (one for each failing VM)
 
Ack, thanks for the confirmation.
Albeit, I have to say that it also makes it a bit more strange, as we got many of such VMs here in our testing and production setups (with both one or multiple disks) running with latest QEMU since over a month just fine. If it was over iSCSI I'd be less surprised as that definitively gets less test exposure than ZFS or BTRFS.

Maybe this is dependent on some other additional factor in the setup - just to get the basics? What's your CPU model and Platform/Mainboard model/vendor?

Also, could you test if the start fails also with a new test/dummy VM with IO-Threads one and default caching, maybe test with one disk first then add a second? Would be good to know how we can trigger this more reliable.
 
We have a hunch and are currently building a new pve-qemu-kvm package, apparently it triggers only if the guest OS itself sends paging commands over the SCSI protocol. I'll update this thread once the package is on pvetest shortly.
 
  • Like
Reactions: fiona and bwatk15
Ok, now pve-qemu-kvm in version 7.1.0-4 is out on the pvetest repository (you can add that via the GUI: Node -> Repositories), then refresh for updates (GUI or on CLI with apt update) and upgrade (GUI or apt full-upgrade, or to really just pulling in the new qemu you can do apt install pve-qemu-kvm too).

Testing feedback would be hugely appreciated.
 
1669109389471.png
That worked this VM config boots now

Code:
root@hv01:~# qm config 7001
agent: 1
balloon: 1024
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 2
cpu: host
efidisk0: ssd-tank:7001/vm-7001-disk-0.raw,size=128K
machine: pc-i440fx-6.0
memory: 4096
name: WC-DC01
net0: virtio=92:A9:D7:25:23:98,bridge=vmbr0
numa: 0
onboot: 1
ostype: win10
rng0: source=/dev/urandom
scsi0: ssd-tank:7001/vm-7001-disk-1.raw,discard=on,iothread=1,size=50G,ssd=1
scsi1: ssd-tank:7001/vm-7001-disk-2.raw,discard=on,iothread=1,size=10G,ssd=1
scsi2: ssd-tank:7001/vm-7001-disk-3.raw,discard=on,iothread=1,size=10G,ssd=1
scsi3: ssd-tank:7001/vm-7001-disk-4.raw,discard=on,iothread=1,size=10G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=a20f23ae-7fea-4d90-ba5d-7c27f43e128e
sockets: 1
startup: order=2,up=600,down=1200
vmgenid: dd275ad6-e8fd-4049-8434-f32fc4d6124f

To try to recreate on a Windows VM head into Device manager and uncheck this box on the disk

1669110122932.png

It will/should recreate the same behavior as a Domain Controller for the disk, as shown below
1669110325871.png
 
Last edited:
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!