Windows 2016 crashes whole pve 5.2 system

djzort

Member
Aug 8, 2013
29
1
23
Im running a single Windows 2016 server essentials VM, but every few hours (max 2 days) the whole system crashes and reboots.

I have captured kernel dumps, but there doesnt seem to be a debug package to assist in analyzing.

pveversion --verbose
Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.17-2-pve)
pve-manager: 5.2-1 (running version: 5.2-1/0fcd7879)
pve-kernel-4.15: 5.2-2
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.15-1-pve: 4.15.15-6
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-32
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-9
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-5
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-26
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

im just running in straight LVM thinpool storage

lscpu
Code:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Model name:            Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
Stepping:              2
CPU MHz:               2293.475
BogoMIPS:              5333.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm pti tpr_shadow vnmi flexpriority ept vpid ibpb ibrs stibp dtherm arat

cat 100.conf
Code:
agent: 1
bootdisk: virtio0
cores: 4
ide2: local:iso/virtio-win-0.1.149.iso,media=cdrom,size=310276K
memory: 65536
name: sdp01
net0: virtio=00:de:ad:be:ef:00,bridge=vmbr1
numa: 1
onboot: 1
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=244f5178-d35c-45ea-b336-3f209d0f2139
sockets: 2
virtio0: local-lvm:vm-100-disk-1,size=300G
virtio1: local-lvm:vm-100-disk-2,size=100G

disabling c-states didnt seem to help at all

Code:
cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.15.17-2-pve root=/dev/mapper/pve-pve--root ro quiet nmi_watchdog=0 crashkernel=256M

this didnt help at all but is in place
cat kvm.conf
Code:
# Win2016 bsod install workaround - see https://gist.github.com/jorritfolmer/d01194a00f440ad257bd56d51baddc2d
options kvm ignore_msrs=1


any thoughts? im worried the cpu might be too old?
 
Did you re-install the Win2016 after setting the ignore_msrs?
 
Did you re-install the Win2016 after setting the ignore_msrs?

Installation completed without this setting, the crash seems to happen quite infrequently (hours to days) with it on or off.

Does reinstallation with the setting on impact the install?

I would like to look at the vmcore but there doesnt seem to be a debug package? this would be extremely helpful - is there any reason this isnt available?

I have adjusted the CPU from kvm64 to Westmere (to match the x5650) - 24 hours later it's still up but still too early to tell.
 
the system lasted nearly a week before crashing again. which is obviously a huge improvement.

ive set the cpu now to 'host'

is it possible to get the debug package for the kernel so that the vmcore can be analyzed?
 
So after setting up the console (its super annoying there is no kernel debug package - please make that)

Code:
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583459] mce: [Hardware Error]: TSC 125df25667e3c
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583490] mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1529960977 SOCKET 1 APIC 30 microcode 1e
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583537] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583576] mce: [Hardware Error]: Machine check: Processor context corrupt
Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583615] Kernel panic - not syncing: Fatal machine check
 
Hello I have the same exact problem on full updated pve 5.2 system running a single windows 2016 standard guest. The whole system reboots when I try to copy a big file from a folder to another inside the VM. No logs or anything. I haven't tried to attach a console yet. It is always reproducible and happens 1-2 minutes after starting a 9GB file copy inside the VM.
 
the bad news is that my solution was to get new hardware. i cant say if my problem was a genuine bug in software or a hardware fault.
 
I reinstalled Windows 2016 with SATA disk and no virtio drivers (only the serial driver for the qemu-agent) and it seems that the issue has gone. I have managed to copy from one folder to another (inside the VM) over 10GB files, multiple times. I was using latest kernel on pve 5.2 (4.15.18-7-pve) and virtio driver v0.1.141. My pve installation in ZFS RAID1 and VM was using local-zfs with raw disk format and the issue was appearing directly 1-2 minutes after starting a local big copy inside the VM.
 
the bad news is that my solution was to get new hardware. i cant say if my problem was a genuine bug in software or a hardware fault.

I experience this issue on a brand new Dell T130 server with Intel(R) Xeon(R) CPU E3-1220 v6 @ 3.00GHz CPU... seems more like a sotfware bug and probably virtio has something to do with that but it is the first time I'm seeing pve to reboot... I can understand a crashing VM but crashing the whole host seems serious.
 
ah i read your comment again and see that you are using 141, which afaik is still "Stable"

is the fsgsbase flag present on your cpu?
also, have you set ignore_msrs ?
 
ah i read your comment again and see that you are using 141, which afaik is still "Stable"

is the fsgsbase flag present on your cpu?
also, have you set ignore_msrs ?

Nope just found out these setting... either I missed them or they are not included on the training video which is the only one I found regarding Proxmox and Windows 2016. I was using virtio-scsi by the way.

The very strange thing is that the VM was working as expected, doing updates and configuring and installing software with much disk IO and stayed up for days without issues (although not serving anything) but it breaks almost instantly when you are trying just to copy a big file inside the VM.... also network (again with virtio drivers) was performing nicely without issues and copying big files to and from the VM was working as expected, the local big copy was resulting on crashing the host node completely...
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!