A
aszeszo
Guest
Hi All
We are using Proxmox VE on three identical four-socket AMD boxes and are very happy with it. It is a very nice useful project. We use OpenVZ exclusively. Majority of the containers we've got are doing transcoding using ffmpeg and VLC and write data to NFS share.
Unfortunately boxes crash at least once a month since we have put them in production in July 2012. We were keeping machines up to date in terms of software updates but none of the newer kernels resolved the problem. Majority of the panic stack traces were mentioning NFS in the past. We were thinking that maybe NFS client support inside the containers was buggy and switched to using bind mounts from the global zone. Unfortunately machine on which we have made the change crashed last night again.
Below is some info about the environment. Please let me know what other information I can provide to help diagnose the problem.
Cheers,
Andrzej
Few sample screenshots with kernel stack traces:
http://linux01.everycity.co.uk/~aszeszo/skitched-20130109-175746.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130109-223555.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130123-180431.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130130-160401.png
Power management settings:
http://linux01.everycity.co.uk/~aszeszo/skitched-20130208-120846.png
dmesg:
http://paste.ec/?f401e3dc44f12ece#VkiJMlbKeGWxXCoHaMEfH8+hlHu7QNNscswi7HpKbQA=
We are using Proxmox VE on three identical four-socket AMD boxes and are very happy with it. It is a very nice useful project. We use OpenVZ exclusively. Majority of the containers we've got are doing transcoding using ffmpeg and VLC and write data to NFS share.
Unfortunately boxes crash at least once a month since we have put them in production in July 2012. We were keeping machines up to date in terms of software updates but none of the newer kernels resolved the problem. Majority of the panic stack traces were mentioning NFS in the past. We were thinking that maybe NFS client support inside the containers was buggy and switched to using bind mounts from the global zone. Unfortunately machine on which we have made the change crashed last night again.
Below is some info about the environment. Please let me know what other information I can provide to help diagnose the problem.
Cheers,
Andrzej
Few sample screenshots with kernel stack traces:
http://linux01.everycity.co.uk/~aszeszo/skitched-20130109-175746.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130109-223555.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130123-180431.png
http://linux01.everycity.co.uk/~aszeszo/skitched-20130130-160401.png
Power management settings:
http://linux01.everycity.co.uk/~aszeszo/skitched-20130208-120846.png
dmesg:
http://paste.ec/?f401e3dc44f12ece#VkiJMlbKeGWxXCoHaMEfH8+hlHu7QNNscswi7HpKbQA=
Code:
# uname -a
Linux localhost 2.6.32-17-pve #1 SMP Wed Nov 28 07:15:55 CET 2012 x86_64 GNU/Linux
# gzip -dc /usr/share/doc/pve-kernel-2.6.32-17-pve/changelog.Debian.gz | head -5
pve-kernel-2.6.32 (2.6.32-83) unstable; urgency=low
* update to vzkernel-2.6.32-042stab065.3.src.rpm
-- Proxmox Support Team <support@proxmox.com> Wed, 28 Nov 2012 06:55:15 +0100
# cat /proc/meminfo
MemTotal: 65954784 kB
MemFree: 52777540 kB
Buffers: 973340 kB
Cached: 10163764 kB
SwapCached: 275288 kB
Active: 3712844 kB
Inactive: 8406784 kB
Active(anon): 691080 kB
Inactive(anon): 329540 kB
Active(file): 3021764 kB
Inactive(file): 8077244 kB
Unevictable: 61540 kB
Mlocked: 61540 kB
SwapTotal: 53477368 kB
SwapFree: 53202080 kB
Dirty: 120452 kB
Writeback: 0 kB
AnonPages: 824104 kB
Mapped: 105288 kB
Shmem: 31212 kB
Slab: 599764 kB
SReclaimable: 499148 kB
SUnreclaim: 100616 kB
KernelStack: 14208 kB
PageTables: 17296 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 86454760 kB
Committed_AS: 2312112 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 347372 kB
VmallocChunk: 34299128716 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 6628 kB
DirectMap2M: 3129344 kB
DirectMap1G: 63963136 kB
# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 9
model name : AMD Opteron(tm) Processor 6174
stepping : 1
cpu MHz : 2200.039
cache size : 512 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 12
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter
bogomips : 4400.07
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
[snip]
processor : 47
vendor_id : AuthenticAMD
cpu family : 16
model : 9
model name : AMD Opteron(tm) Processor 6174
stepping : 1
cpu MHz : 2200.039
cache size : 512 KB
physical id : 1
siblings : 12
core id : 5
cpu cores : 12
apicid : 27
initial apicid : 27
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter
bogomips : 4400.44
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
#