Windows 2016 crashes whole pve 5.2 system

Discussion in 'Proxmox VE: Installation and configuration' started by djzort, Jun 14, 2018.

  1. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
    Im running a single Windows 2016 server essentials VM, but every few hours (max 2 days) the whole system crashes and reboots.

    I have captured kernel dumps, but there doesnt seem to be a debug package to assist in analyzing.

    pveversion --verbose
    Code:
    proxmox-ve: 5.2-2 (running kernel: 4.15.17-2-pve)
    pve-manager: 5.2-1 (running version: 5.2-1/0fcd7879)
    pve-kernel-4.15: 5.2-2
    pve-kernel-4.15.17-2-pve: 4.15.17-10
    pve-kernel-4.15.17-1-pve: 4.15.17-9
    pve-kernel-4.15.15-1-pve: 4.15.15-6
    corosync: 2.4.2-pve5
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: not correctly installed
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.0-8
    libpve-apiclient-perl: 2.0-4
    libpve-common-perl: 5.0-32
    libpve-guest-common-perl: 2.0-16
    libpve-http-server-perl: 2.0-9
    libpve-storage-perl: 5.0-23
    libqb0: 1.0.1-1
    lvm2: 2.02.168-pve6
    lxc-pve: 3.0.0-3
    lxcfs: 3.0.0-1
    novnc-pve: 0.6-4
    proxmox-widget-toolkit: 1.0-18
    pve-cluster: 5.0-27
    pve-container: 2.0-23
    pve-docs: 5.2-4
    pve-firewall: 3.0-9
    pve-firmware: 2.0-4
    pve-ha-manager: 2.0-5
    pve-i18n: 1.0-5
    pve-libspice-server1: 0.12.8-3
    pve-qemu-kvm: 2.11.1-5
    pve-xtermjs: 1.0-5
    qemu-server: 5.0-26
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    
    im just running in straight LVM thinpool storage

    lscpu
    Code:
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                24
    On-line CPU(s) list:   0-23
    Thread(s) per core:    2
    Core(s) per socket:    6
    Socket(s):             2
    NUMA node(s):          1
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 44
    Model name:            Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
    Stepping:              2
    CPU MHz:               2293.475
    BogoMIPS:              5333.99
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              12288K
    NUMA node0 CPU(s):     0-23
    Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm pti tpr_shadow vnmi flexpriority ept vpid ibpb ibrs stibp dtherm arat
    
    cat 100.conf
    Code:
    agent: 1
    bootdisk: virtio0
    cores: 4
    ide2: local:iso/virtio-win-0.1.149.iso,media=cdrom,size=310276K
    memory: 65536
    name: sdp01
    net0: virtio=00:de:ad:be:ef:00,bridge=vmbr1
    numa: 1
    onboot: 1
    ostype: win10
    scsihw: virtio-scsi-pci
    smbios1: uuid=244f5178-d35c-45ea-b336-3f209d0f2139
    sockets: 2
    virtio0: local-lvm:vm-100-disk-1,size=300G
    virtio1: local-lvm:vm-100-disk-2,size=100G
    
    disabling c-states didnt seem to help at all

    Code:
    cat /proc/cmdline
    BOOT_IMAGE=/vmlinuz-4.15.17-2-pve root=/dev/mapper/pve-pve--root ro quiet nmi_watchdog=0 crashkernel=256M
    
    this didnt help at all but is in place
    cat kvm.conf
    Code:
    # Win2016 bsod install workaround - see https://gist.github.com/jorritfolmer/d01194a00f440ad257bd56d51baddc2d
    options kvm ignore_msrs=1
    

    any thoughts? im worried the cpu might be too old?
     
  2. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,647
    Likes Received:
    141
    Did you re-install the Win2016 after setting the ignore_msrs?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
    Installation completed without this setting, the crash seems to happen quite infrequently (hours to days) with it on or off.

    Does reinstallation with the setting on impact the install?

    I would like to look at the vmcore but there doesnt seem to be a debug package? this would be extremely helpful - is there any reason this isnt available?

    I have adjusted the CPU from kvm64 to Westmere (to match the x5650) - 24 hours later it's still up but still too early to tell.
     
  4. mailinglists

    mailinglists Member

    Joined:
    Mar 14, 2012
    Messages:
    206
    Likes Received:
    18
    If you want to match host exactly, there is also host option (scroll down to bottom).
     
  5. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
    excellent tip and much appreciated. that will be my next adjustment if this current configuration doesnt last longer than 48 hours
     
  6. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
    the system lasted nearly a week before crashing again. which is obviously a huge improvement.

    ive set the cpu now to 'host'

    is it possible to get the debug package for the kernel so that the vmcore can be analyzed?
     
  7. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
  8. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
    So after setting up the console (its super annoying there is no kernel debug package - please make that)

    Code:
    Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583459] mce: [Hardware Error]: TSC 125df25667e3c
    Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583490] mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1529960977 SOCKET 1 APIC 30 microcode 1e
    Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583537] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
    Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583576] mce: [Hardware Error]: Machine check: Processor context corrupt
    Jun 26 07:09:37 xxxx.bytefoundry.com.au [121030.583615] Kernel panic - not syncing: Fatal machine check
    
     
  9. Stratos Zolotas

    Stratos Zolotas New Member

    Joined:
    Oct 12, 2018
    Messages:
    4
    Likes Received:
    0
    Hello I have the same exact problem on full updated pve 5.2 system running a single windows 2016 standard guest. The whole system reboots when I try to copy a big file from a folder to another inside the VM. No logs or anything. I haven't tried to attach a console yet. It is always reproducible and happens 1-2 minutes after starting a 9GB file copy inside the VM.
     
  10. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
    the bad news is that my solution was to get new hardware. i cant say if my problem was a genuine bug in software or a hardware fault.
     
  11. Stratos Zolotas

    Stratos Zolotas New Member

    Joined:
    Oct 12, 2018
    Messages:
    4
    Likes Received:
    0
    I reinstalled Windows 2016 with SATA disk and no virtio drivers (only the serial driver for the qemu-agent) and it seems that the issue has gone. I have managed to copy from one folder to another (inside the VM) over 10GB files, multiple times. I was using latest kernel on pve 5.2 (4.15.18-7-pve) and virtio driver v0.1.141. My pve installation in ZFS RAID1 and VM was using local-zfs with raw disk format and the issue was appearing directly 1-2 minutes after starting a local big copy inside the VM.
     
  12. Stratos Zolotas

    Stratos Zolotas New Member

    Joined:
    Oct 12, 2018
    Messages:
    4
    Likes Received:
    0
    I experience this issue on a brand new Dell T130 server with Intel(R) Xeon(R) CPU E3-1220 v6 @ 3.00GHz CPU... seems more like a sotfware bug and probably virtio has something to do with that but it is the first time I'm seeing pve to reboot... I can understand a crashing VM but crashing the whole host seems serious.
     
  13. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
  14. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
    ah i read your comment again and see that you are using 141, which afaik is still "Stable"

    is the fsgsbase flag present on your cpu?
    also, have you set ignore_msrs ?
     
  15. Stratos Zolotas

    Stratos Zolotas New Member

    Joined:
    Oct 12, 2018
    Messages:
    4
    Likes Received:
    0
    Nope just found out these setting... either I missed them or they are not included on the training video which is the only one I found regarding Proxmox and Windows 2016. I was using virtio-scsi by the way.

    The very strange thing is that the VM was working as expected, doing updates and configuring and installing software with much disk IO and stayed up for days without issues (although not serving anything) but it breaks almost instantly when you are trying just to copy a big file inside the VM.... also network (again with virtio drivers) was performing nicely without issues and copying big files to and from the VM was working as expected, the local big copy was resulting on crashing the host node completely...
     
    #15 Stratos Zolotas, Oct 12, 2018
    Last edited: Oct 12, 2018
  16. djzort

    djzort New Member

    Joined:
    Aug 8, 2013
    Messages:
    26
    Likes Received:
    1
    maybe play with those settings and try the bleeding edge virtio-scsi drivers?
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice