Windows 7 x64 VMs crashing randomly during process termination

MarcelRoelofs

New Member
Feb 17, 2014
10
0
1
We have a cluster of 5 proxmox nodes hosting a mix of Linux and Windows 7 x64 VMs which we use as our automated software build and test environment. When starting build and test jobs from our Jenkins server, some of the build and test jobs on the Windows 7 VM fail very regularly, because the Windows 7 VMs crash at random points during the job. This is quite annoying.

The Windows VMs use virtio drivers for both disk and network, the disks are stored on the local drive, with writeback caching. I've tried two versions of the virtio drivers (0.1-74 and 0.1-59), but haven't seen any difference. I've disabled memory ballooning on all VMs.

pveversion -v produces:
Code:
proxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: not correctly installed
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

I've used windbg to examine a number of Windows memory.dmp files that are produced during the VM crashes, and running '!analyze -v' on them, all produce similar output like below: the general pattern seems to be that a PAGE_FAULT_IN_NONPAGED_AREA exception occurs when terminating a process. In the various dump files I've seen this happening to various executables that are used as part of our build and test jobs.

If this is some kind of race condition going on, this would explain why our build and test jobs are good candidates to trigger this exception: during each job a huge number of processes are started and terminated.

Code:
[FONT=courier new]*******************************************************************************[/FONT]
[FONT=courier new]*                                                                             *[/FONT]
[FONT=courier new]*                        Bugcheck Analysis                                    *[/FONT]
[FONT=courier new]*                                                                             *[/FONT]
[FONT=courier new]*******************************************************************************[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]PAGE_FAULT_IN_NONPAGED_AREA (50)[/FONT]
[FONT=courier new]Invalid system memory was referenced.  This cannot be protected by try-except,[/FONT]
[FONT=courier new]it must be protected by a Probe.  Typically the address is just plain bad or it[/FONT]
[FONT=courier new]is pointing at freed memory.[/FONT]
[FONT=courier new]Arguments:[/FONT]
[FONT=courier new]Arg1: fffff680003f7db8, memory referenced.[/FONT]
[FONT=courier new]Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.[/FONT]
[FONT=courier new]Arg3: fffff800026fcdbc, If non-zero, the instruction address which referenced the bad memory[/FONT]
[FONT=courier new]    address.[/FONT]
[FONT=courier new]Arg4: 0000000000000002, (reserved)[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]Debugging Details:[/FONT]
[FONT=courier new]------------------[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]READ_ADDRESS:  fffff680003f7db8 [/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FAULTING_IP: [/FONT]
[FONT=courier new]nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]fffff800`026fcdbc 498b06          mov     rax,qword ptr [r14][/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]MM_INTERNAL_CODE:  2[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]BUGCHECK_STR:  0x50[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]PROCESS_NAME:  grep.exe[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]CURRENT_IRQL:  0[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]ANALYSIS_VERSION: 6.3.9600.17029 (debuggers(dbg).140219-1702) amd64fre[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]TRAP_FRAME:  fffff88005378f00 -- (.trap 0xfffff88005378f00)[/FONT]
[FONT=courier new]NOTE: The trap frame does not contain all registers.[/FONT]
[FONT=courier new]Some register values may be zeroed or incorrect.[/FONT]
[FONT=courier new]rax=000000fdf6e00000 rbx=0000000000000000 rcx=0000000fffffffff[/FONT]
[FONT=courier new]rdx=0000058000000000 rsi=0000000000000000 rdi=0000000000000000[/FONT]
[FONT=courier new]rip=fffff800026fcdbc rsp=fffff88005379090 rbp=fffffa80058b1200[/FONT]
[FONT=courier new] r8=0000007ffffffff8  r9=0000098000000000 r10=fffffa8003601b90[/FONT]
[FONT=courier new]r11=fffff88005379170 r12=0000000000000000 r13=0000000000000000[/FONT]
[FONT=courier new]r14=0000000000000000 r15=0000000000000000[/FONT]
[FONT=courier new]iopl=0         nv up ei ng nz na po cy[/FONT]
[FONT=courier new]nt!MiDeletePageTableHierarchy+0x9c:[/FONT]
[FONT=courier new]fffff800`026fcdbc 498b06          mov     rax,qword ptr [r14] ds:00000000`00000000=????????????????[/FONT]
[FONT=courier new]Resetting default scope[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]LAST_CONTROL_TRANSFER:  from fffff800027465e4 to fffff800026c9bc0[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]STACK_TEXT:  [/FONT]
[FONT=courier new]fffff880`05378d98 fffff800`027465e4 : 00000000`00000050 fffff680`003f7db8 00000000`00000000 fffff880`05378f00 : nt!KeBugCheckEx[/FONT]
[FONT=courier new]fffff880`05378da0 fffff800`026c7cee : 00000000`00000000 fffff680`003f7db8 00000000`0008ed00 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x43836[/FONT]
[FONT=courier new]fffff880`05378f00 fffff800`026fcdbc : fffffa80`0299e6b0 00000000`00000001 fffffa80`0302aa80 fffff6fb`40001000 : nt!KiPageFault+0x16e[/FONT]
[FONT=courier new]fffff880`05379090 fffff800`026998b6 : fffff700`01080510 fffffa80`058b1598 fffff700`01080000 fffff8a0`004028e8 : nt!MiDeletePageTableHierarchy+0x9c[/FONT]
[FONT=courier new]fffff880`053791a0 fffff800`0269a892 : fffffa80`058b1200 fffffa80`00000000 fffff8a0`00000025 00000000`00000000 : nt!MiDeleteAddressesInWorkingSet+0x3fb[/FONT]
[FONT=courier new]fffff880`05379a50 fffff800`0299e15a : fffff8a0`0b6cea90 00000000`00000001 00000000`00000000 fffffa80`05621a00 : nt!MmCleanProcessAddressSpace+0x96[/FONT]
[FONT=courier new]fffff880`05379aa0 fffff800`029826b8 : 00000000`c0000005 00000000`00000001 00000000`7efdb000 00000000`00000000 : nt!PspExitThread+0x56a[/FONT]
[FONT=courier new]fffff880`05379ba0 fffff800`026c8e53 : fffffa80`058b1200 00000000`c0000005 fffffa80`05621a00 00000000`7efdf000 : nt!NtTerminateProcess+0x138[/FONT]
[FONT=courier new]fffff880`05379c20 00000000`76ee157a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13[/FONT]
[FONT=courier new]00000000`0008f758 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x76ee157a[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]STACK_COMMAND:  kb[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FOLLOWUP_IP: [/FONT]
[FONT=courier new]nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]fffff800`026fcdbc 498b06          mov     rax,qword ptr [r14][/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]SYMBOL_STACK_INDEX:  3[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]SYMBOL_NAME:  nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FOLLOWUP_NAME:  MachineOwner[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]MODULE_NAME: nt[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]DEBUG_FLR_IMAGE_TIMESTAMP:  521ea035[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]IMAGE_VERSION:  6.1.7601.18247[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]IMAGE_NAME:  memory_corruption[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FAILURE_BUCKET_ID:  X64_0x50_nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]BUCKET_ID:  X64_0x50_nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]ANALYSIS_SOURCE:  KM[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FAILURE_ID_HASH_STRING:  km:x64_0x50_nt!mideletepagetablehierarchy+9c[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FAILURE_ID_HASH:  {a5101511-63a3-65ce-1b12-16e97aca479e}[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]Followup: MachineOwner[/FONT]
[FONT=courier new]---------[/FONT]

I would be most grateful if anyone could shed some light on these annoying crashes, or give some configuration change to help prevent them.

Cheers,
Marcel Roelofs
 
I can give one additional piece of information.

Before moving to proxmox we had a single ubuntu+libvirt host running a number of Win7 x64 VMs, running similar jobs as the ones which crash the proxmox-hosted VMs. On these VMs we never experienced these type of crashes. The latest version of the virtio drivers we use are the same version that run on the ubuntu+libvirt host without any problems.

Cheers,
Marcel Roelofs
 
I suggest you also test with latest qemu 1.7.1, available in our latest packages (pvetest or no-subscription repo).
 
After upgrading qemu to version 1.7.1, I already experienced the first crashing Win7 VM again, with a similar kernel trace as described above.

Any other suggestions?

Cheers,
Marcel
 
After upgrading qemu to version 1.7.1, I already experienced the first crashing Win7 VM again, with a similar kernel trace as described above.

Any other suggestions?

Cheers,
Marcel

can you run the VM on a different hardware, to eliminated physical hardware issues (e.g. faulty memory, bios)?
 
Not quite sure if that will help: all Win7 VMs on all our 5 host machines exhibit this behavior on a regular basis, but all of the Centos and Ubuntu VMs on the same host machines are rock solid.

There seems to be some kind of pattern though for crashes that occur during our build jobs: Win7 VMs performing builds that produce the largest object files and binaries have the highest probability of crashing. All VMs are configured with 2 cores, and, to keep their CPU usage close to 100%, use "make -j 3" during the compilation steps within our build jobs. When I change this to "make -j 6" there is no observable decrease of the total compilation time, but (fingers crossed) seems to lead to less VM crashes. Assuming that more parallel jobs lead to all disk i/o to be more evenly spread over time, this could point to the virtio block driver as a possible culprit, but then again that's only a wild guess from a non-expert.

Cheers,
Marcel
 
Checked the bios of one of the nodes, which had an American Megatrends bios. Under Advanced-CPU Performance there were a number of values mentioning Energy Efficient, which I changed to Performance. Could not find any other settings that I could relate to the features you're mentioning.

Will also check the other nodes and let you know if this has an effect.

Cheers,
Marcel
 
Checked the bios of one of the nodes, which had an American Megatrends bios. Under Advanced-CPU Performance there were a number of values mentioning Energy Efficient, which I changed to Performance. Could not find any other settings that I could relate to the features you're mentioning.

Will also check the other nodes and let you know if this has an effect.

Cheers,
Marcel

Yes, this should works I think !
 
Based on the observation that none of the Windows-based build and test jobs executed on our proxmox cluster during the last day have caused a bsod any longer, I'm inclined to say that changing the bios settings as above resolved the issue. Thank you very much for pointing this out, spirit!! Apparently, changes in CPU frequency when the load of all jobs running on a single host decrease, can cause Windows VMs to bsod. How, if at all, do other virtualization solutions deal with this?

One remarkable thing to notice as well is that the CPU load of individual idle Windows VMs with only a Jenkins slave running in the background dropped considerably with the changed bios settings (from ~16% to < 10%). For idle Windows VMs without a Jenkins slave the load is actually much lower (<4%) The overall CPU load of our proxmox hosts with only idling Jenkins slaves dropped from ~5% to ~2.5%.

The only thing that worries me, is that I can't find anything documented on the importance of the proper bios settings for the stability of Windows VMs under load. I cannot imagine that spirit and me are the only people having suffered from this, and frankly, it could have easily been a deal breaker for the continued use of proxmox. It's that I have a strong belief that any issue can be resolved or worked around (not quite sure which of these two I should consider changing the bios settings), but I've also heard my colleagues grumbling about the instability of the build process after moving our Jenkins infrastructure to the proxmox cluster. At least I think this should go into the Windows best practices area of the wiki, to save other people the experience I had to go through.

Cheers,
Marcel
 
The main problem is that cpu frenquency change, and core shutdown, can give some clock sync problem, and windows really don't like that.

We have make a patch for next kernel version, to setup gouvernor to max performance by default.

But it's always a best practice for all hypervisor, to disables theses cpu features.

I'll write a wiki article
 
The main problem is that cpu frenquency change, and core shutdown, can give some clock sync problem, and windows really don't like that.

We have make a patch for next kernel version, to setup gouvernor to max performance by default.

But it's always a best practice for all hypervisor, to disables theses cpu features.

I'll write a wiki article

Thank you spirit...how could we set the governor on runtime? echo perfomance > /sys/devices/......
 
I used this fix...

in my /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_idle.max_cstate=0 processor.max_cstate=0"

after update-grub every kernel is "patched" :)
 
I used this fix...

in my /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_idle.max_cstate=0 processor.max_cstate=0"

after update-grub every kernel is "patched" :)


note that I'm not sure it's enough for governor max-performance. (It's disabling c-states)
But I'll add also this in the wiki.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!