Blue screen with 5.1

cybermcm · Oct 24, 2017

Hi,

I'm running a small lab at home (old PC hardware, no special server hardware). Until yesterday I used version 5.0, today I did an inplace upgrade to version 5.1. Since version 5.1 my server 2016 VMs are getting blue screens (server 2016 core and 2016 with GUI). With 5.0 everything ran stable.
Blue screen is caused by ntoskrnl.exe bugcheck code 0x00000109.
The host itself seems to run stable;
Guest systems have virtio-win-0.1.141 drivers installed.

Any idea where to look for a solution?

t.lamprecht · Oct 24, 2017

cybermcm said:
Any idea where to look for a solution?

Can you try to boot the previous kernel in the GRUB boot menu? Should be an 4.10 while PVE 5.1 uses one in version 4.13
If that solves it there may be a regression in a kernel module, maybe KVM.

Did you had the 5.0 also updated or on an older state, e.g. the one from the ISO installer? Trying to rule things out here.

cybermcm · Oct 24, 2017

I did apply updates regularly, I was on 4.10 before the upgrade. It was stable with 4.10.
Is there probably a log file which helps to track down the issue?

t.lamprecht · Oct 24, 2017

Hmm, look into the journal (journalctl) or dmesg if you see anything resembling a kernel error or stack trace from around the time where the VM bluescreens.

Else, it could also be a bad coincidence and a memory or storage (hardware) problem...

cybermcm · Oct 25, 2017

I didn't find anything in the journalctl log, my dmesg is attached, I'm not really sure if there is a problem visible. Maybe you can take a look at it. If this doesn't help I'll revert back to the old kernel to see if the system is stable again

cybermcm · Oct 25, 2017

Did memory testing today, seems fine. SMART values from the harddisks OK ->
I'm now trying the old kernel (Linux host04 4.10.17-3-pve #1 SMP PVE 4.10.17-23 (Tue, 19 Sep 2017 09:43:50 +0200) x86_64 GNU/Linux) again via advanced GRUB startup. Let's see if it is stable again

cybermcm · Oct 26, 2017

Update: Stable again with the old kernel (no blue screen during the night). Is there anything on my side I can do to track down the issue?
another question: how can I modify grub to start 4.10? currently 4.13 is starting which isn't useful at the moment.
I tried
GRUB_DEFAULT=saved
GRUB_SAVEDEFAULT=true
and update-grub but this doesn't work.

Pascual · Oct 26, 2017

Hi , fresh clean install 5.1.
New Dell T330 hardware.
VM Windows 2012r2 restored from 5.0 backup.

Continuous BSOD. : CRITICAL_STRUCTURE_CORRUPTION

Tested other VM's working well with the 5.0 version.

proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-35 (running version: 5.1-35/722cc488)
pve-kernel-4.13.4-1-pve: 4.13.4-25
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90

Some clue?

Regards.

sumsum · Oct 26, 2017

Same situation here on a Lab Installation. A Mix of Debian, Centos VM's and one Windows 10 VM. while the Linux VM's run stable, the Win 10 VM get regular CRITICAL_STRUCTURE_CORRUPTION Blue Screen Error (0x00000109) after a few hours in operation. Before running PVE 4 and upgraded to PVE 5. the blue screen started shortly after starting the VM.
The VM's run on a SSD, the PVE 5.1 on regular HDD.
The Windows VM was running on a older virtio-win driver. Upgrading to virtio-win-0.1.141 drivers did not helped.

Important detail:
the unstabile Situation Happens while Running the VM on a single SSD. After I moved the VM to a regular HHD within the PVE5.1 Node everthing was stable as expected.

cybermcm · Oct 26, 2017

@sumsum: I'll do a test with a Linux system later. My Windows machines are on a SSD disk, Virtio drivers 0.1.141

aderumier · Oct 26, 2017

from microsoft support :
https://support.microsoft.com/en-ph...-structure-corruption-on-a-vmware-virtual-mac

seem to be related to virtual cpu flags.
maybe a regression in kvm, or a new flag sent.

what is your vm cpu model ? kvm64 ? host ? something else ?

sumsum · Oct 26, 2017

aderumier said:
what is your vm cpu model ? kvm64 ? host ? something else ?

for my part, the Windows10 VM is running on Default (kvm64)

cybermcm · Oct 26, 2017

my servers are using kvm64 as well
host cpu:

root@host04:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 23
Model name: Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
Stepping: 7
CPU MHz: 2504.885
CPU max MHz: 2499.0000
CPU min MHz: 2003.0000
BogoMIPS: 5009.77
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 3072K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm

Pascual · Oct 26, 2017

My case: default kvm64.
I'll change to host and tell the results.
I'm now checking with one VM, and the boot time is very fast now.
I'll post the results.

Regards.

aderumier · Oct 26, 2017

cybermcm said:
my servers are using kvm64 as well
host cpu:

root@host04:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 23
Model name: Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
Stepping: 7
CPU MHz: 2504.885
CPU max MHz: 2499.0000
CPU min MHz: 2003.0000
BogoMIPS: 5009.77
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 3072K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm

interessting, with core2

I found this note on centos

"
Limited CPU support for Windows 10 and Windows Server 2016 guests

On a Red Hat Enterprise 6 host, Windows 10 and Windows Server 2016 guests can only be created when using the following CPU models:

* the Intel Xeon E series
* the Intel Xeon E7 family
* Intel Xeon v2, v3, and v4
* Opteron G2, G3, G4, G5, and G6

For these CPU models, also make sure to set the CPU model of the guest to match the CPU model detected by running the "virsh capabilities" command on the host. Using the application default or hypervisor default prevents the guests from booting properly.

To be able to use Windows 10 guests on Legacy Intel Core 2 processors (also known as Penryn) or Intel Xeon 55xx and 75xx processor families (also known as Nehalem), add the following flag to the Domain XML file, with either Penryn or Nehalem as MODELNAME:

<cpu mode='custom' match='exact'>
<model>MODELNAME</model>
<feature name='erms' policy='require'/>
</cpu>

Other CPU models are not supported, and both Windows 10 guests and Windows Server 2016 guests created on them are likely to become unresponsive during the boot process.
"

I need to dig a little bit more

cybermcm · Oct 26, 2017

I tried to run my VMs with CPUs in host mode but same result -> blue screen.
Back with kernel 4.10, no problems...

aderumier · Oct 27, 2017

Pascual said:
My case: default kvm64.
I'll change to host and tell the results.
I'm now checking with one VM, and the boot time is very fast now.
I'll post the results.

Regards.

what is your physical cpu model ?

Pascual · Oct 27, 2017

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Xeon(R) CPU E3-1225 v6 @ 3.30GHz
Stepping: 9
CPU MHz: 3300.000
CPU max MHz: 3700.0000
CPU min MHz: 800.0000
BogoMIPS: 6624.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

Yesterday I did some tests, with "CPU HOST" no problems.

Last night I restored again the VM with "CPU KVM64" and we had BSOD at first boot.

I'm going to do the same but with "CPU HOST" and i'll post the results.

Thanks to all.

Pascual · Oct 27, 2017

BSOD again CRITICAL_STRUCTURE_CORRUPTION. this time using "CPU HOST" .

CPU use on the VM is 100% and besides I got:

TASK ERROR: VM quit/powerdown failed - got timeout when I trying to shutdown the machine.

with "STOP" I got : trying to acquire lock...TASK ERROR: can't lock file '/var/lock/qemu-server/lock-101.conf' - got timeout

Then I restarted pve-cluster service and finally it stopped.

Perhaps I could try the old kernel, but, where I can download it from , coming from a clean fresh 5.1 PVE installation?

Thanks.

fabian · Oct 27, 2017

Pascual said:
BSOD again CRITICAL_STRUCTURE_CORRUPTION. this time using "CPU HOST" .

CPU use on the VM is 100% and besides I got:

TASK ERROR: VM quit/powerdown failed - got timeout when I trying to shutdown the machine.

with "STOP" I got : trying to acquire lock...TASK ERROR: can't lock file '/var/lock/qemu-server/lock-101.conf' - got timeout

Then I restarted pve-cluster service and finally it stopped.

Perhaps I could try the old kernel, but, where I can download it from , coming from a clean fresh 5.1 PVE installation?

Thanks.

the pve-no-subscription repository also contains all the old packages:
http://download.proxmox.com/debian/dists/stretch/pve-no-subscription/binary-amd64/

Blue screen with 5.1

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Attachments

Renowned Member

Renowned Member

New Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

New Member

Renowned Member

Renowned Member

Renowned Member

New Member

New Member

Proxmox Staff Member

We value your privacy