Proxmox VE crashes on three identical machines

A

aszeszo

Guest
Hi All

We are using Proxmox VE on three identical four-socket AMD boxes and are very happy with it. It is a very nice useful project. We use OpenVZ exclusively. Majority of the containers we've got are doing transcoding using ffmpeg and VLC and write data to NFS share.

Unfortunately boxes crash at least once a month since we have put them in production in July 2012. We were keeping machines up to date in terms of software updates but none of the newer kernels resolved the problem. Majority of the panic stack traces were mentioning NFS in the past. We were thinking that maybe NFS client support inside the containers was buggy and switched to using bind mounts from the global zone. Unfortunately machine on which we have made the change crashed last night again.

Below is some info about the environment. Please let me know what other information I can provide to help diagnose the problem.

Cheers,

Andrzej

Few sample screenshots with kernel stack traces:
http://linux01.everycity.co.uk/~aszeszo/skitched-20130109-175746.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130109-223555.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130123-180431.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130130-160401.png

Power management settings:
http://linux01.everycity.co.uk/~aszeszo/skitched-20130208-120846.png

dmesg:
http://paste.ec/?f401e3dc44f12ece#VkiJMlbKeGWxXCoHaMEfH8+hlHu7QNNscswi7HpKbQA=

Code:
# uname -a
Linux localhost 2.6.32-17-pve #1 SMP Wed Nov 28 07:15:55 CET 2012 x86_64 GNU/Linux
# gzip -dc /usr/share/doc/pve-kernel-2.6.32-17-pve/changelog.Debian.gz  | head -5
pve-kernel-2.6.32 (2.6.32-83) unstable; urgency=low


  * update to vzkernel-2.6.32-042stab065.3.src.rpm


 -- Proxmox Support Team <support@proxmox.com>  Wed, 28 Nov 2012 06:55:15 +0100

# cat /proc/meminfo 
MemTotal:       65954784 kB
MemFree:        52777540 kB
Buffers:          973340 kB
Cached:         10163764 kB
SwapCached:       275288 kB
Active:          3712844 kB
Inactive:        8406784 kB
Active(anon):     691080 kB
Inactive(anon):   329540 kB
Active(file):    3021764 kB
Inactive(file):  8077244 kB
Unevictable:       61540 kB
Mlocked:           61540 kB
SwapTotal:      53477368 kB
SwapFree:       53202080 kB
Dirty:            120452 kB
Writeback:             0 kB
AnonPages:        824104 kB
Mapped:           105288 kB
Shmem:             31212 kB
Slab:             599764 kB
SReclaimable:     499148 kB
SUnreclaim:       100616 kB
KernelStack:       14208 kB
PageTables:        17296 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    86454760 kB
Committed_AS:    2312112 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      347372 kB
VmallocChunk:   34299128716 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6628 kB
DirectMap2M:     3129344 kB
DirectMap1G:    63963136 kB

# cat /proc/cpuinfo 
processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 9
model name    : AMD Opteron(tm) Processor 6174
stepping    : 1
cpu MHz        : 2200.039
cache size    : 512 KB
physical id    : 0
siblings    : 12
core id        : 0
cpu cores    : 12
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter
bogomips    : 4400.07
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate


[snip]


processor    : 47
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 9
model name    : AMD Opteron(tm) Processor 6174
stepping    : 1
cpu MHz        : 2200.039
cache size    : 512 KB
physical id    : 1
siblings    : 12
core id        : 5
cpu cores    : 12
apicid        : 27
initial apicid    : 27
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter
bogomips    : 4400.44
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
#
 
What device is providing your NFS share? if all 3 boxes are doing it, it either points to switch, cabling or NFS device.
 
Thanks for your reply hotwired007.

Until recently we were using custom-built NFS cluster based on OpenIndiana with illumos kernel as NFS server. Hundreds of other Solaris 10 and SmartOS zones use it, as well as dozen Citrix XenServer hosts and we don't have any problems with it in general. Recently we have migrated shares that Proxmox containers are using to a brand new NetApp box but the hosts keep on crashing.

Switches are fine (two separate Juniper EX4200 stacks). Brand new Intel NICs we stuck into the boxes are fine as well.

What device is providing your NFS share? if all 3 boxes are doing it, it either points to switch, cabling or NFS device.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!