Blue screen with 5.1

DerMerowinger · Nov 7, 2017

Oh that was fast.
Thank you very much!

cpierr03 · Nov 7, 2017

pve-kernel-4.10.17-5 works well.

FWIW, the problem was MOST pronounced on ZFS drives with the 4.13 kernel. As in, Windows was seldom able to boot without BSODs.

The error occurred on my XFS drive as well, but only a handful of times, and performance seemed to be the same. On ZFS drives with the new kernel, there was a severe degradation in VM I/O performance.

DerMerowinger · Nov 9, 2017

Unfortunately I am a Linux-Noob. As you can see here
"GRUB_CMDLINE_LINUX_DEFAULT="quiet" put scsi_mod.use_blk_mq=n

root@pve:~# update-grub
/usr/sbin/grub-mkconfig: 9: /etc/default/grub: put: not found"
What am I missing?

cybermcm · Nov 9, 2017

it should be:
GRUB_CMDLINE_LINUX_DEFAULT="quiet scsi_mod.use_blk_mq=n"
but it didn't work, at least not for me, bug still exists with 4.13 kernel
and don't forget to initiate
update-grub
before reboot

wolfgang · Nov 10, 2017

Has someone a good testcase to reproduce the windows Bluescreen?
Because on my machine it can take up to 24 hours that Windows crash.
Debugging is hard with this conditions.

brwainer · Nov 10, 2017

I am also experiencing the CPU flag related issue, and not the VirtIO related one with kernel 4.13.

Two VMs, both 2012R2 fully updated as of 11/9/17, no VirtIO drivers installed in either, they are more or less identical as they are AD DCs. I have ensured that VM configurations are identical, but that doesn't matter anyway because it is always the one on the Xeons that crashes. I've tried host, kvm64, and qemu64 CPU types, no difference between them related to crashes.

balloon: 0
boot: dcn
bootdisk: ide0
cores: 4
cpu: qemu64
ide0: HDDs:vm-113-disk-1,size=127G
ide2: none,media=cdrom
memory: 1024
name: NETSERV2
net0: e1000=00:15:5D:01:87:02,bridge=vmbr0
numa: 0
onboot: 1
ostype: win8
smbios1: uuid=73e9a13f-9e97-48d5-8ef0-443d0b16c3df
sockets: 1
startup: order=2
I have two hosts, one with 2x Opteron 6220, the other with 2x Xeon L5420. On the Xeon system, either VM will BSOD with Critical_Structure_Corruption after a few minutes up to a few hours. On the Opteron system, both VMs are stable. Both Proxmox systems were fully updated on 10/24 and are running:

proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-35 (running version: 5.1-35/722cc488)
pve-kernel-4.13.4-1-pve: 4.13.4-25
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90

I have just updated the Xeon system to the pve-kernel-4.10.17-5-pve_4.10.17-25_amd64.deb package (and everything zfs related to 0.7.3) and am about to reboot it to see if that resolves the issue. I have not tried updating the microcode, and would like more details about that before I try it.

wolfgang · Nov 10, 2017

Hi brwainer

brwainer said:
I have not tried updating the microcode, and would like more details about that before I try it.

https://wiki.debian.org/Microcode

cybermcm · Nov 10, 2017

@wolfgang: I still have the issue with the 4.13 kernel and normally Windows crashes within one hour. How can I assist you to track this down?

brwainer · Nov 10, 2017

wolfgang said:
Hi brwainer

https://wiki.debian.org/Microcode

Thanks. For apples-to-apples comparison, I have done the following items:

on the Xeon system, installed intel-microcode, and all updates except the kernel
on the Opteron system, installed amd64-microcode, and all updates including the kernel

Xeon L5420 microcode was 0xa0b, and the Opteron 6220 microcode was 0x600063d, neither changed after the install and reboot. However per Blue screen with 5.1 I don't expect this to make a significant difference even if there was an update. Also, the BIOS is the latest for each board, so maybe that's why the microcode was already up to date. At least now I can offer a direct comparison between 4.10.17.5 and 4.13.4.1. If I don't post again, you can assume that the VM hasn't crashed with the Critical_Structure_Corruption BSOD, otherwise if it does I'll report it. I'll be keeping an eye on this thread either way.

Edit: I confirmed with dmesg that the microcode update driver did indeed run during boot on both systems.

vankooch · Nov 11, 2017

@wolfgang

I'll try to find a system, if we have enough spare parts around. I'll check that out on monday and give you feedback.

Regards

brwainer · Nov 11, 2017

@wolfgang is there some command that inventories the CPU features that we could run that would help determine what the least common denominator for this issue is?

vankooch · Nov 11, 2017

brwainer said:
@wolfgang is there some command that inventories the CPU features that we could run that would help determine what the least common denominator for this issue is?

Have you tried this?

cat /proc/cpuinfo

brwainer · Nov 11, 2017

vankooch said:
Have you tried this?

cat /proc/cpuinfo

I’m aware of that, but I don’t know if that would specifically help @wolfgang see the least common denominator for CPU features.

due · Nov 12, 2017

Based on certain things I've done on my W10 VM and one of the proxmox hosts, here some short reporting from my site.

upgrade to virtio-win-0.1.141 --> blue screen appears again
upgrade the intel microcode to 0x20 --> blue screen appears again
download/install and running the 4.10.17-5-pve kernel --> blue screen does not appearing again...cross the fingers.

Windows upgrade from 1703 to 1709, performed on step 1 and 2 was not able because blue screens repeatedly.
Only after done the step 3, I'm was able to upgrade the window from version 1703 to 1709. And it is still stable.

@wolfgang: regards to the provided special kernel, what do you think, would it be possible to expected an solution (maybe in the near future)? So we would be able to use again the standard apt-get upgrade process with all the standard components from the pve-no-subscription repository.

many thanks for your effort.

cheers

morph027 · Nov 12, 2017

I've migrated a physical windows to vm last week and tracked my bsod down to the qxl driver.

FastLaneJB · Nov 12, 2017

morph027 said:
I've migrated a physical windows to vm last week and tracked my bsod down to the qxl driver.

Interesting. I've upgraded two hosts so far with no issues but I don't use the qxl driver in Windows at all.

Serverhamster · Nov 14, 2017

For me, running that 4.10 kernel is the only solution with proven stability. Achieved a week uptime now, instead of 2 or 3 blue screens every day.

cybermcm · Nov 14, 2017

@morph027: I'm not using the qxl driver either and still have bluescreens...
@wolfgang: any news, can we assist to help you track this down?https://forum.proxmox.com/members/morph027.24396/

wolfgang · Nov 15, 2017

@cybermcm It looks like a problem in the mmu of kvm.

Sean Brackeen · Nov 16, 2017

I encountered this on a fresh 5.1 install on a Windows Server 2012 R2 VM, and a Windows Server 2012R2 VM on a upgraded system. I am currently downgrading them both to kernel 4.10

New system:

Dual Xeon E5-2620V4s
RAID backed storage on an Adaptec 8805 HBA
SuperMicro X10-DRW-i mainboard

Old system, upgraded:

Dell Poweredge R520
RAID backed storage on a PERC H710 HBA
Dual Xeon E5-2430 V0s

Fortunately the SMC system isn't in production yet. It definitely BSOD'd at least once under heavy I/O load. If there's any more information I can provide please let me know

Edit: I noticed looking through the dmesg output a ton of messages regarding linux_edac scrolled by that don’t on 4.10. A bunch of PCI IDs then it complained about not being able to find a Broadcom device. I don’t have a full output unfortunately

Blue screen with 5.1

Active Member

New Member

Active Member

Renowned Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Renowned Member

Active Member

New Member

Active Member

New Member

Active Member

Active Member

Renowned Member

Renowned Member

Active Member

Renowned Member

Proxmox Retired Staff

Active Member

We value your privacy