Blue screen with 5.1

Has someone a good testcase to reproduce the windows Bluescreen?
Because on my machine it can take up to 24 hours that Windows crash.
Debugging is hard with this conditions.
Don't know if I'm facing the same issue, but I've run in 5.1 "dormient" Win10 VM, with spice and virtio 141, created pre-5.1. If I start the vm, I've a sort of blue screen after the "window" icon splash and some time of the sort of hourglass-substitute, before the login.
If I take a snapshot of the vm, it boots fine instead!
Code:
root@proxmox:/# pveversion -v
proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

and the vm is
Code:
 qm config 272
agent: 1
boot: cd
bootdisk: scsi0
cores: 4
description: Win10
ide2: none,media=cdrom
memory: 4096
name: Win10BASESpice
net0: virtio=B2:61:AF:1B:96:BB,bridge=vmbr0
net1: virtio=F2:B3:A9:3B:CD:56,bridge=vmbr0,link_down=1
net2: virtio=8A:0F:71:9C:67:05,bridge=vmbr0,link_down=1
net3: virtio=7E:CD:1A:F8:D8:B8,bridge=vmbr0,link_down=1
numa: 0
ostype: win8
parent: preagg
scsi0: local:272/vm-272-disk-1.qcow2,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=101e9c8b-c335-4f04-840a-62fcde6909b1
sockets: 1
vga: qxl

CPU is
Code:
root@proxmox:/# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                6
On-line CPU(s) list:   0-5
Thread(s) per core:    2
Core(s) per socket:    3
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 2
Model name:            AMD FX(tm)-6300 Six-Core Processor
Stepping:              0
CPU MHz:               3515.842
BogoMIPS:              7031.68
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              8192K
NUMA node0 CPU(s):     0-5
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
root@proxmox:/#
 
I have the same issue (blue screen - CRITICAL_STRUCTURE_CORRUPTION) on a fresh install of proxmox 5.1.
Sometimes it takes days to crash. A few hours uptime is also possible.
Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz
proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
Filesystem : ZFS on hard raid10 Adaptec 6405
virtio : virtio-win-0.1.141
Windows 2012 R2
Type CPU : kvm64

A fix for soon ?
 
The downgrade is easy. You follow the next procédure :
Code:
wget http://download.proxmox.com/debian/dists/stretch/pvetest/binary-amd64/pve-kernel-4.10.17-5-pve_4.10.17-25_amd64.deb
dpkg -i pve-kernel-4.10.17-5-pve_4.10.17-25_amd64.deb
After you restart on the good kernel
 
Make some progress but it more complected than I thought.

It looks like it is related to ksm and swap what trigger the BSOD.

Please to all can you prove that ksm is merging pages (In the Gui "KSM sharing" > 0)
and swap is used when the windows VM get the BOSD?
 
Yesterday night I had a BSOD (CRITICAL_STRUCTURE_CORRUPTION) on a Windows 2012r2 VM, in this node I have 4 VMs: 1 pfSense, 1 CentOS, 1 Windows 2012r2 (the one with the BSOD) and 1 with an old Windows 2003. Given this OS difference, KSM is not merging pages (in fact in the GUI I see KSM sharing 0B). This morning I had to restart (with stop and then start) the Windows 2012r2 VM and it's working again without problems, I don't know if KSM was merging pages at the moment of the BSOD, I have pvestat sending status information to an influxdb, but I don't see a metric related to ksm sharing.
Swap is used, in the GUI: SWAP usage 16.92% (2.71 GiB of 16.00 GiB)
I don't know why swap is used, I have free memory:
Code:
$ free -h
              total        used        free      shared  buff/cache   available
Mem:            39G         24G        9.7G        113M        4.9G         16G
Swap:           15G        2.7G         13G
 
Make some progress but it more complected than I thought.

It looks like it is related to ksm and swap what trigger the BSOD.

Please to all can you prove that ksm is merging pages (In the Gui "KSM sharing" > 0)
and swap is used when the windows VM get the BOSD?

Hi, I'm running a 9 node cluster, and this issue is happening primarily on two of the nodes. I've been trying to debug it myself until I found this, and it happens more during VZDump backups in Snapshot mode. It's not when it begins, but after it has finished, more often than not some of the machine is left with that STRUCTURE_CORRUPTION, or "The guest has not initialized the display yet" or half of a corrupt BSOD in the top of the screen when seen from the console (it looks like a compressed bsod, filling about 20% of the noVNC screen)

One has only two machines with Windows Server 2016 (not crashed right now):
Code:
CPU usage
0.75% of 32 CPU(s)
IO delay
0.14%
Load average
0.34,0.27,0.14

RAM usage
72.38% (68.27 GiB of 94.32 GiB)
KSM sharing
12.32 GiB
HD space(root)
0.06% (1.83 GiB of 2.99 TiB)
SWAP usage
100.00% (8.00 GiB of 8.00 GiB)

CPU(s)
32 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (2 Sockets)
Kernel Version
Linux 4.13.4-1-pve #1 SMP PVE 4.13.4-26 (Mon, 6 Nov 2017 11:23:55 +0100)
PVE Manager Version
pve-manager/5.1-36/131401db

And the other one, before stop-restarting one of the Windows 10 machines (they fail randomly). This has 7 Windows 10 machines (all of them have crashed one time or another) and some Windows 7 and linux machines wich haven't seen a crash.

Code:
CPU usage
11.24% of 16 CPU(s)
IO delay
2.16%
Load average
1.87,2.31,2.86

RAM usage
36.92% (49.39 GiB of 133.78 GiB)
KSM sharing
0 B
HD space(root)
0.06% (1.06 GiB of 1.74 TiB)
SWAP usage
30.32% (2.43 GiB of 8.00 GiB)

CPU(s)
16 x Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2 Sockets)
Kernel Version
Linux 4.13.4-1-pve #1 SMP PVE 4.13.4-26 (Mon, 6 Nov 2017 11:23:55 +0100)
PVE Manager Version

Hope it helps!
 
Hi to all,

i have had the same problem on my configuration with my Windows 10pro VM. the BSODs are random, some times more frequent after hours some times even after 2-3 days.

I have changed swappiness so, @wolfgang, the pve is not using swap and, at least for as long it is up (for now, after the last reboot it has an uptime of 5 days and some hours) it only uses KSM sharing with some other linux VMs (1 ubuntu,1 pfsense,1 debian8) and, i have had 2 BSODs so far on my windows10 VM.

At my first setup of the windows VM, i have used SPICE and qxl drivers, but the BSODs where more frequent (3-4 times per day) so i changed to default, although the problem persists.

This is the actual Summary:

CPU usage
23.97% of 4 CPU(s)

IO delay
1.12%
Load average
0.55,0.62,0.64
RAM usage
58.11% (13.69 GiB of 23.57 GiB)

KSM sharing
2.93 GiB
HD space(root)
0.62% (1.28 GiB of 206.55 GiB)

SWAP usage
0.00% (0 B of 8.00 GiB)
CPU(s)
4 x Intel(R) Xeon(R) CPU E3-1220 v6 @ 3.00GHz (1 Socket)
Kernel Version
Linux 4.13.4-1-pve #1 SMP PVE 4.13.4-26 (Mon, 6 Nov 2017 11:23:55 +0100)
PVE Manager Version
pve-manager/5.1-36/131401db

Hope those info helps.
Thank you.
 


right now i've been so long that i can make the command;

edit /etc/default/grub

but when i'm inside it, the keyboard is acting very strange, i can't do anything when i'm in, the UP arrow is acting as the letter B right arrow is acting as a A. How do i editing step 3. in the GRUB guide? I tried also finally to edit it but how do i save it then? When i just exit it nothing happend.

What to do?
 
@AndyDK: You have to use a text file editor, try
nano /etc/default/grub
You can save your file when you leave nano, then update-grub and then reboot
 
Hi,

after running some time with this kernel:

Linux pve-1 4.10.17-5-pve #1 SMP PVE 4.10.17-25 (Mon, 6 Nov 2017 12:37:43 +0100) x86_64 GNU/Linux

we had some better experience, but we have still some problems. The BSOD are gone but now the whole system crashs, the only way to recover is hard reset...

The only log I can see on the console is:

"NMI watchdog: BUG: soft loockup CPU#1 stucks for 23s (KVM)"

Is there any place where I can get some more logs?

Thx
 
@AndyDK: You have to use a text file editor, try
nano /etc/default/grub
You can save your file when you leave nano, then update-grub and then reboot

Thanks now i could edit it!

But, when i save it, and rund update-grub

it says it found 4.13 version and 4.10 version, but when i then run uname -a it says its 4.13 running, but i followed the guide from point to point... So what to do? Can i remove version 4.13 somehow?
 
@vankooch: it seems that you hit another bug, described here: https://forum.proxmox.com/threads/pve-5-1-kvm-broken-on-old-cpus.37666/
solution: upgrade to 4.13 :-(
@AndyDK: can you check the grub start menu on boot? under advanced options you should see the kernel 4.10 line. Can you manually select this line and boot?
I just rebooted and now i'm running 4.10 Thanks!! Hope Windows Server 2016 don't crash! But... Can i share a harddrive from another windows Server 2016 i'm running so i don't have to delete it and then add a new storage.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!