We have a cluster of 5 proxmox nodes hosting a mix of Linux and Windows 7 x64 VMs which we use as our automated software build and test environment. When starting build and test jobs from our Jenkins server, some of the build and test jobs on the Windows 7 VM fail very regularly, because the Windows 7 VMs crash at random points during the job. This is quite annoying.
The Windows VMs use virtio drivers for both disk and network, the disks are stored on the local drive, with writeback caching. I've tried two versions of the virtio drivers (0.1-74 and 0.1-59), but haven't seen any difference. I've disabled memory ballooning on all VMs.
pveversion -v produces:
I've used windbg to examine a number of Windows memory.dmp files that are produced during the VM crashes, and running '!analyze -v' on them, all produce similar output like below: the general pattern seems to be that a PAGE_FAULT_IN_NONPAGED_AREA exception occurs when terminating a process. In the various dump files I've seen this happening to various executables that are used as part of our build and test jobs.
If this is some kind of race condition going on, this would explain why our build and test jobs are good candidates to trigger this exception: during each job a huge number of processes are started and terminated.
I would be most grateful if anyone could shed some light on these annoying crashes, or give some configuration change to help prevent them.
Cheers,
Marcel Roelofs
The Windows VMs use virtio drivers for both disk and network, the disks are stored on the local drive, with writeback caching. I've tried two versions of the virtio drivers (0.1-74 and 0.1-59), but haven't seen any difference. I've disabled memory ballooning on all VMs.
pveversion -v produces:
Code:
proxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: not correctly installed
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1
I've used windbg to examine a number of Windows memory.dmp files that are produced during the VM crashes, and running '!analyze -v' on them, all produce similar output like below: the general pattern seems to be that a PAGE_FAULT_IN_NONPAGED_AREA exception occurs when terminating a process. In the various dump files I've seen this happening to various executables that are used as part of our build and test jobs.
If this is some kind of race condition going on, this would explain why our build and test jobs are good candidates to trigger this exception: during each job a huge number of processes are started and terminated.
Code:
[FONT=courier new]*******************************************************************************[/FONT]
[FONT=courier new]* *[/FONT]
[FONT=courier new]* Bugcheck Analysis *[/FONT]
[FONT=courier new]* *[/FONT]
[FONT=courier new]*******************************************************************************[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]PAGE_FAULT_IN_NONPAGED_AREA (50)[/FONT]
[FONT=courier new]Invalid system memory was referenced. This cannot be protected by try-except,[/FONT]
[FONT=courier new]it must be protected by a Probe. Typically the address is just plain bad or it[/FONT]
[FONT=courier new]is pointing at freed memory.[/FONT]
[FONT=courier new]Arguments:[/FONT]
[FONT=courier new]Arg1: fffff680003f7db8, memory referenced.[/FONT]
[FONT=courier new]Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.[/FONT]
[FONT=courier new]Arg3: fffff800026fcdbc, If non-zero, the instruction address which referenced the bad memory[/FONT]
[FONT=courier new] address.[/FONT]
[FONT=courier new]Arg4: 0000000000000002, (reserved)[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]Debugging Details:[/FONT]
[FONT=courier new]------------------[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]READ_ADDRESS: fffff680003f7db8 [/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FAULTING_IP: [/FONT]
[FONT=courier new]nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]fffff800`026fcdbc 498b06 mov rax,qword ptr [r14][/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]MM_INTERNAL_CODE: 2[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]BUGCHECK_STR: 0x50[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]PROCESS_NAME: grep.exe[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]CURRENT_IRQL: 0[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]ANALYSIS_VERSION: 6.3.9600.17029 (debuggers(dbg).140219-1702) amd64fre[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]TRAP_FRAME: fffff88005378f00 -- (.trap 0xfffff88005378f00)[/FONT]
[FONT=courier new]NOTE: The trap frame does not contain all registers.[/FONT]
[FONT=courier new]Some register values may be zeroed or incorrect.[/FONT]
[FONT=courier new]rax=000000fdf6e00000 rbx=0000000000000000 rcx=0000000fffffffff[/FONT]
[FONT=courier new]rdx=0000058000000000 rsi=0000000000000000 rdi=0000000000000000[/FONT]
[FONT=courier new]rip=fffff800026fcdbc rsp=fffff88005379090 rbp=fffffa80058b1200[/FONT]
[FONT=courier new] r8=0000007ffffffff8 r9=0000098000000000 r10=fffffa8003601b90[/FONT]
[FONT=courier new]r11=fffff88005379170 r12=0000000000000000 r13=0000000000000000[/FONT]
[FONT=courier new]r14=0000000000000000 r15=0000000000000000[/FONT]
[FONT=courier new]iopl=0 nv up ei ng nz na po cy[/FONT]
[FONT=courier new]nt!MiDeletePageTableHierarchy+0x9c:[/FONT]
[FONT=courier new]fffff800`026fcdbc 498b06 mov rax,qword ptr [r14] ds:00000000`00000000=????????????????[/FONT]
[FONT=courier new]Resetting default scope[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]LAST_CONTROL_TRANSFER: from fffff800027465e4 to fffff800026c9bc0[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]STACK_TEXT: [/FONT]
[FONT=courier new]fffff880`05378d98 fffff800`027465e4 : 00000000`00000050 fffff680`003f7db8 00000000`00000000 fffff880`05378f00 : nt!KeBugCheckEx[/FONT]
[FONT=courier new]fffff880`05378da0 fffff800`026c7cee : 00000000`00000000 fffff680`003f7db8 00000000`0008ed00 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x43836[/FONT]
[FONT=courier new]fffff880`05378f00 fffff800`026fcdbc : fffffa80`0299e6b0 00000000`00000001 fffffa80`0302aa80 fffff6fb`40001000 : nt!KiPageFault+0x16e[/FONT]
[FONT=courier new]fffff880`05379090 fffff800`026998b6 : fffff700`01080510 fffffa80`058b1598 fffff700`01080000 fffff8a0`004028e8 : nt!MiDeletePageTableHierarchy+0x9c[/FONT]
[FONT=courier new]fffff880`053791a0 fffff800`0269a892 : fffffa80`058b1200 fffffa80`00000000 fffff8a0`00000025 00000000`00000000 : nt!MiDeleteAddressesInWorkingSet+0x3fb[/FONT]
[FONT=courier new]fffff880`05379a50 fffff800`0299e15a : fffff8a0`0b6cea90 00000000`00000001 00000000`00000000 fffffa80`05621a00 : nt!MmCleanProcessAddressSpace+0x96[/FONT]
[FONT=courier new]fffff880`05379aa0 fffff800`029826b8 : 00000000`c0000005 00000000`00000001 00000000`7efdb000 00000000`00000000 : nt!PspExitThread+0x56a[/FONT]
[FONT=courier new]fffff880`05379ba0 fffff800`026c8e53 : fffffa80`058b1200 00000000`c0000005 fffffa80`05621a00 00000000`7efdf000 : nt!NtTerminateProcess+0x138[/FONT]
[FONT=courier new]fffff880`05379c20 00000000`76ee157a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13[/FONT]
[FONT=courier new]00000000`0008f758 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x76ee157a[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]STACK_COMMAND: kb[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FOLLOWUP_IP: [/FONT]
[FONT=courier new]nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]fffff800`026fcdbc 498b06 mov rax,qword ptr [r14][/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]SYMBOL_STACK_INDEX: 3[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]SYMBOL_NAME: nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FOLLOWUP_NAME: MachineOwner[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]MODULE_NAME: nt[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]DEBUG_FLR_IMAGE_TIMESTAMP: 521ea035[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]IMAGE_VERSION: 6.1.7601.18247[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]IMAGE_NAME: memory_corruption[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FAILURE_BUCKET_ID: X64_0x50_nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]BUCKET_ID: X64_0x50_nt!MiDeletePageTableHierarchy+9c[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]ANALYSIS_SOURCE: KM[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FAILURE_ID_HASH_STRING: km:x64_0x50_nt!mideletepagetablehierarchy+9c[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]FAILURE_ID_HASH: {a5101511-63a3-65ce-1b12-16e97aca479e}[/FONT]
[FONT=courier new]
[/FONT]
[FONT=courier new]Followup: MachineOwner[/FONT]
[FONT=courier new]---------[/FONT]
I would be most grateful if anyone could shed some light on these annoying crashes, or give some configuration change to help prevent them.
Cheers,
Marcel Roelofs