Windows 2016 crashes whole pve 5.2 system

djzort · Jun 14, 2018

Im running a single Windows 2016 server essentials VM, but every few hours (max 2 days) the whole system crashes and reboots.

I have captured kernel dumps, but there doesnt seem to be a debug package to assist in analyzing.

pveversion --verbose

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.17-2-pve)
pve-manager: 5.2-1 (running version: 5.2-1/0fcd7879)
pve-kernel-4.15: 5.2-2
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.15-1-pve: 4.15.15-6
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-32
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-9
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-5
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-26
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

im just running in straight LVM thinpool storage

lscpu

Code:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Model name:            Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
Stepping:              2
CPU MHz:               2293.475
BogoMIPS:              5333.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm pti tpr_shadow vnmi flexpriority ept vpid ibpb ibrs stibp dtherm arat

cat 100.conf

Code:

agent: 1
bootdisk: virtio0
cores: 4
ide2: local:iso/virtio-win-0.1.149.iso,media=cdrom,size=310276K
memory: 65536
name: sdp01
net0: virtio=00:de:ad:be:ef:00,bridge=vmbr1
numa: 1
onboot: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=244f5178-d35c-45ea-b336-3f209d0f2139
sockets: 2
virtio0: local-lvm:vm-100-disk-1,size=300G
virtio1: local-lvm:vm-100-disk-2,size=100G

disabling c-states didnt seem to help at all

Code:

cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.15.17-2-pve root=/dev/mapper/pve-pve--root ro quiet nmi_watchdog=0 crashkernel=256M

this didnt help at all but is in place
cat kvm.conf

Code:

# Win2016 bsod install workaround - see https://gist.github.com/jorritfolmer/d01194a00f440ad257bd56d51baddc2d
options kvm ignore_msrs=1

any thoughts? im worried the cpu might be too old?

Alwin · Jun 15, 2018

Did you re-install the Win2016 after setting the ignore_msrs?

djzort · Jun 15, 2018

Alwin said:
Did you re-install the Win2016 after setting the ignore_msrs?

Installation completed without this setting, the crash seems to happen quite infrequently (hours to days) with it on or off.

Does reinstallation with the setting on impact the install?

I would like to look at the vmcore but there doesnt seem to be a debug package? this would be extremely helpful - is there any reason this isnt available?

I have adjusted the CPU from kvm64 to Westmere (to match the x5650) - 24 hours later it's still up but still too early to tell.

mailinglists · Jun 15, 2018

If you want to match host exactly, there is also host option (scroll down to bottom).

djzort · Jun 15, 2018

mailinglists said:
If you want to match host exactly, there is also host option (scroll down to bottom).

excellent tip and much appreciated. that will be my next adjustment if this current configuration doesnt last longer than 48 hours

djzort · Jun 21, 2018

the system lasted nearly a week before crashing again. which is obviously a huge improvement.

ive set the cpu now to 'host'

is it possible to get the debug package for the kernel so that the vmcore can be analyzed?

djzort · Jun 24, 2018

I suspect that 'fsgsbase' cpuflag is really needed.

Redhat almost says as much here https://bugzilla.redhat.com/show_bug.cgi?id=1346153

I suspect that presenting it will get windows going, but sooner or later its called and everything crashes.

these onapp.com people explicitly say its needed - https://docs.onapp.com/display/53AG/Create+Virtual+Server

all my other proxmox systems running w10 and server 2016 havent had any such problems...

djzort · Jun 26, 2018

So after setting up the console (its super annoying there is no kernel debug package - please make that)

Code:

Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583459] mce: [Hardware Error]: TSC 125df25667e3c
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583490] mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1529960977 SOCKET 1 APIC 30 microcode 1e
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583537] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583576] mce: [Hardware Error]: Machine check: Processor context corrupt
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583615] Kernel panic - not syncing: Fatal machine check

Stratos Zolotas · Oct 12, 2018

Hello I have the same exact problem on full updated pve 5.2 system running a single windows 2016 standard guest. The whole system reboots when I try to copy a big file from a folder to another inside the VM. No logs or anything. I haven't tried to attach a console yet. It is always reproducible and happens 1-2 minutes after starting a 9GB file copy inside the VM.

djzort · Oct 12, 2018

the bad news is that my solution was to get new hardware. i cant say if my problem was a genuine bug in software or a hardware fault.

Stratos Zolotas · Oct 12, 2018

I reinstalled Windows 2016 with SATA disk and no virtio drivers (only the serial driver for the qemu-agent) and it seems that the issue has gone. I have managed to copy from one folder to another (inside the VM) over 10GB files, multiple times. I was using latest kernel on pve 5.2 (4.15.18-7-pve) and virtio driver v0.1.141. My pve installation in ZFS RAID1 and VM was using local-zfs with raw disk format and the issue was appearing directly 1-2 minutes after starting a local big copy inside the VM.

Stratos Zolotas · Oct 12, 2018

djzort said:
the bad news is that my solution was to get new hardware. i cant say if my problem was a genuine bug in software or a hardware fault.

I experience this issue on a brand new Dell T130 server with Intel(R) Xeon(R) CPU E3-1220 v6 @ 3.00GHz CPU... seems more like a sotfware bug and probably virtio has something to do with that but it is the first time I'm seeing pve to reboot... I can understand a crashing VM but crashing the whole host seems serious.

djzort · Oct 12, 2018

virtio-blk or virtio-scsi?

it worth noting that the latest virtio network drivers have issues with windows server 2016 - see https://forum.proxmox.com/threads/virtio-driver-crashing-high-load.45609/#post-217865

it may well be the case with disk's too

djzort · Oct 12, 2018

ah i read your comment again and see that you are using 141, which afaik is still "Stable"

is the fsgsbase flag present on your cpu?
also, have you set ignore_msrs ?

Stratos Zolotas · Oct 12, 2018

djzort said:
ah i read your comment again and see that you are using 141, which afaik is still "Stable"

is the fsgsbase flag present on your cpu?
also, have you set ignore_msrs ?

Nope just found out these setting... either I missed them or they are not included on the training video which is the only one I found regarding Proxmox and Windows 2016. I was using virtio-scsi by the way.

The very strange thing is that the VM was working as expected, doing updates and configuring and installing software with much disk IO and stayed up for days without issues (although not serving anything) but it breaks almost instantly when you are trying just to copy a big file inside the VM.... also network (again with virtio drivers) was performing nicely without issues and copying big files to and from the VM was working as expected, the local big copy was resulting on crashing the host node completely...

djzort · Oct 13, 2018

maybe play with those settings and try the bleeding edge virtio-scsi drivers?

Windows 2016 crashes whole pve 5.2 system

djzort

Member

Alwin

Proxmox Retired Staff

djzort

Member

mailinglists

Renowned Member

djzort

Member

djzort

Member

djzort

Member

djzort

Member

Stratos Zolotas

New Member

djzort

Member

Stratos Zolotas

New Member

Stratos Zolotas

New Member

djzort

Member

djzort

Member

Stratos Zolotas

New Member

djzort

Member

We value your privacy