Blue screen with 5.1

pve-kernel-4.10.17-5 works well.

FWIW, the problem was MOST pronounced on ZFS drives with the 4.13 kernel. As in, Windows was seldom able to boot without BSODs.

The error occurred on my XFS drive as well, but only a handful of times, and performance seemed to be the same. On ZFS drives with the new kernel, there was a severe degradation in VM I/O performance.
 
Unfortunately I am a Linux-Noob. As you can see here
"GRUB_CMDLINE_LINUX_DEFAULT="quiet" put scsi_mod.use_blk_mq=n

root@pve:~# update-grub
/usr/sbin/grub-mkconfig: 9: /etc/default/grub: put: not found"
What am I missing?
 
it should be:
GRUB_CMDLINE_LINUX_DEFAULT="quiet scsi_mod.use_blk_mq=n"
but it didn't work, at least not for me, bug still exists with 4.13 kernel
and don't forget to initiate
update-grub
before reboot
 
Has someone a good testcase to reproduce the windows Bluescreen?
Because on my machine it can take up to 24 hours that Windows crash.
Debugging is hard with this conditions.
 
I am also experiencing the CPU flag related issue, and not the VirtIO related one with kernel 4.13.
  • Two VMs, both 2012R2 fully updated as of 11/9/17, no VirtIO drivers installed in either, they are more or less identical as they are AD DCs. I have ensured that VM configurations are identical, but that doesn't matter anyway because it is always the one on the Xeons that crashes. I've tried host, kvm64, and qemu64 CPU types, no difference between them related to crashes.
    balloon: 0
    boot: dcn
    bootdisk: ide0
    cores: 4
    cpu: qemu64
    ide0: HDDs:vm-113-disk-1,size=127G
    ide2: none,media=cdrom
    memory: 1024
    name: NETSERV2
    net0: e1000=00:15:5D:01:87:02,bridge=vmbr0
    numa: 0
    onboot: 1
    ostype: win8
    smbios1: uuid=73e9a13f-9e97-48d5-8ef0-443d0b16c3df
    sockets: 1
    startup: order=2
  • I have two hosts, one with 2x Opteron 6220, the other with 2x Xeon L5420. On the Xeon system, either VM will BSOD with Critical_Structure_Corruption after a few minutes up to a few hours. On the Opteron system, both VMs are stable. Both Proxmox systems were fully updated on 10/24 and are running:
    proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
    pve-manager: 5.1-35 (running version: 5.1-35/722cc488)
    pve-kernel-4.13.4-1-pve: 4.13.4-25
    libpve-http-server-perl: 2.0-6
    lvm2: 2.02.168-pve6
    corosync: 2.4.2-pve3
    libqb0: 1.0.1-1
    pve-cluster: 5.0-15
    qemu-server: 5.0-17
    pve-firmware: 2.0-3
    libpve-common-perl: 5.0-20
    libpve-guest-common-perl: 2.0-13
    libpve-access-control: 5.0-7
    libpve-storage-perl: 5.0-16
    pve-libspice-server1: 0.12.8-3
    vncterm: 1.5-2
    pve-docs: 5.1-12
    pve-qemu-kvm: 2.9.1-2
    pve-container: 2.0-17
    pve-firewall: 3.0-3
    pve-ha-manager: 2.0-3
    ksm-control-daemon: 1.2-2
    glusterfs-client: 3.8.8-1
    lxc-pve: 2.1.0-2
    lxcfs: 2.0.7-pve4
    criu: 2.11.1-1~bpo90
    novnc-pve: 0.6-4
    smartmontools: 6.5+svn4324-1
    zfsutils-linux: 0.7.2-pve1~bpo90
I have just updated the Xeon system to the pve-kernel-4.10.17-5-pve_4.10.17-25_amd64.deb package (and everything zfs related to 0.7.3) and am about to reboot it to see if that resolves the issue. I have not tried updating the microcode, and would like more details about that before I try it.
 
@wolfgang: I still have the issue with the 4.13 kernel and normally Windows crashes within one hour. How can I assist you to track this down?
 
Thanks. For apples-to-apples comparison, I have done the following items:
  • on the Xeon system, installed intel-microcode, and all updates except the kernel
  • on the Opteron system, installed amd64-microcode, and all updates including the kernel
Xeon L5420 microcode was 0xa0b, and the Opteron 6220 microcode was 0x600063d, neither changed after the install and reboot. However per Blue screen with 5.1 I don't expect this to make a significant difference even if there was an update. Also, the BIOS is the latest for each board, so maybe that's why the microcode was already up to date. At least now I can offer a direct comparison between 4.10.17.5 and 4.13.4.1. If I don't post again, you can assume that the VM hasn't crashed with the Critical_Structure_Corruption BSOD, otherwise if it does I'll report it. I'll be keeping an eye on this thread either way.

Edit: I confirmed with dmesg that the microcode update driver did indeed run during boot on both systems.
 
@wolfgang is there some command that inventories the CPU features that we could run that would help determine what the least common denominator for this issue is?
 
Based on certain things I've done on my W10 VM and one of the proxmox hosts, here some short reporting from my site.
  1. upgrade to virtio-win-0.1.141 --> blue screen appears again
  2. upgrade the intel microcode to 0x20 --> blue screen appears again
  3. download/install and running the 4.10.17-5-pve kernel --> blue screen does not appearing again...cross the fingers.
Windows upgrade from 1703 to 1709, performed on step 1 and 2 was not able because blue screens repeatedly.
Only after done the step 3, I'm was able to upgrade the window from version 1703 to 1709. And it is still stable.

@wolfgang: regards to the provided special kernel, what do you think, would it be possible to expected an solution (maybe in the near future)? So we would be able to use again the standard apt-get upgrade process with all the standard components from the pve-no-subscription repository.

many thanks for your effort.

cheers
 
For me, running that 4.10 kernel is the only solution with proven stability. Achieved a week uptime now, instead of 2 or 3 blue screens every day.
 
I encountered this on a fresh 5.1 install on a Windows Server 2012 R2 VM, and a Windows Server 2012R2 VM on a upgraded system. I am currently downgrading them both to kernel 4.10

New system:
  • Dual Xeon E5-2620V4s
  • RAID backed storage on an Adaptec 8805 HBA
  • SuperMicro X10-DRW-i mainboard
Old system, upgraded:
  • Dell Poweredge R520
  • RAID backed storage on a PERC H710 HBA
  • Dual Xeon E5-2430 V0s
Fortunately the SMC system isn't in production yet. It definitely BSOD'd at least once under heavy I/O load. If there's any more information I can provide please let me know

Edit: I noticed looking through the dmesg output a ton of messages regarding linux_edac scrolled by that don’t on 4.10. A bunch of PCI IDs then it complained about not being able to find a Broadcom device. I don’t have a full output unfortunately
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!