PVE lockup

Dear all,

sorry for reopening the old thread, but the problem still persists and since nothing changed except the software versions, I decided to not open a new thread.

We have two identical servers (please find the output of pveperf, pveversion and lspci below), both equipped with 32 GB of RAM and a RAID10 set, newest BIOS and newest RAID firmware. Sometimes, they just hang. It happens frequently during backup (no matter if snapshot or suspend/resume backup) but sometimes also without an obvious reason, even when there is not much load on the machine.

Both machines are running several versions of Windows (XP, 2003, 2008) and Linux (Debian, RHEL, CentOS) as fully virtualized (kvm) machines.

When one of the machines hangs, the last message on the console (and in syslog) mostly is

Code:
kernel: EXT3-fs: mounted filesystem with ordered data mode.

Hangups sometimes happen several days in a row, sometimes only once in a month. We have absolutely no clue what could be the root cause of the problem.

Still, any help is highly appreciated.

Best regards,
Christoph


Code:
[B]# pveperf[/B]
CPU BOGOMIPS:      45335.09
REGEX/SECOND:      946221
HD SIZE:           94.49 GB (/dev/pve/root)
BUFFERED READS:    340.48 MB/sec
AVERAGE SEEK TIME: 10.39 ms
FSYNCS/SECOND:     2220.20
DNS EXT:           39.19 ms
DNS INT:           0.64 ms


[B]# pveversion -v[/B]
pve-manager: 1.5-10 (pve-manager/1.5/4822)
running kernel: 2.6.24-9-pve
pve-kernel-2.6.24-9-pve: 2.6.24-18
pve-kernel-2.6.24-8-pve: 2.6.24-16
qemu-server: 1.1-16
pve-firmware: 1.0-5
libpve-storage-perl: 1.0-13
vncterm: 0.9-2
vzctl: 3.0.23-1pve11
vzdump: 1.2-5
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1


[B]# lspci[/B]
00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller Hub (rev b1)
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 2-3 (rev b1)
00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 4-5 (rev b1)
00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 6-7 (rev b1)
00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA Engine (rev b1)
00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1)
00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1)
00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1)
00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev b1)
00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev b1)
00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev b1)
00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev b1)
00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express Root Port 1 (rev 09)
00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 (rev 09)
00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 (rev 09)
00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 (rev 09)
00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller (rev 09)
00:1f.2 IDE interface: Intel Corporation 631xESB/632xESB/3100 Chipset SATA IDE Controller (rev 09)
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port (rev 01)
01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X Bridge (rev 01)
02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 (rev 01)
02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E3 (rev 01)
03:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09)
03:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09)
[B]06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)
06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01)
09:00.0 RAID bus controller: 3ware Inc 9690SA SAS/SATA-II RAID PCIe (rev 01)[/B]
0b:01.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
 
This same kernel bug keeps re-surfacing again and again. (But it seems only with Intel hardware.)

It seems the symptoms are quite diverse ranging from NFS, FTP, SSH, RSYNC to SMB but always they are linked to high I/O. With high I/O, you lose the system.

I got my test server at home to work a year ago (I wish I could recall how) but now I'm trying to reinstall with Proxmox 1.5 with different kernels (and also with latest Ubuntu&OpenVZ) and I'm failing miserably. Every time I try to copy a file larger than 200mb to the Proxmox server I get CRC corruption on the files and errors on the filesystem. The bigger the file more certain it is that the system freezes. (It's NOT a faulty HD or card.)

I'm going to try the acpi=off option with the Grub. I seem to recall something about that last time I did this. (I really should get AMD test server.)
 
Here is one solution I found that affects NFS.
http://ubuntuforums.org/showthread.php?t=1478413&page=3


Re: large NFS copy locks up/hang client with large files (again) (lucid)


In my case, however, it was the 'sync' option causing all the grief. I followed a couple NFSv4 guides and decided to use the 'sync' versus 'async' since it sounds safer...

Changed /etc/exports to:

/exports 192.168.x.y(rw,fsid=0,insecure,async,no_subtree_check)
/exports/share0 192.168.x.y(rw,nohide,insecure,async,no_subtree_check)
/exports/temp0 192.168.x.y(rw,nohide,insecure,async,no_subtree_check)

AND my NFSv4 WRITE speeds went from 7 MB/s to 80MB/s - quite an improvement! Did md5sum on xfers and all is well.

I also tried the /proc/sys/vm/* changes suggested above. Got 80 MB/s WITH the changed values, but also got same 80 MB/s after reboot and /proc/sys/vm/* values reverted back to orignal speed. Therefore, aysnc made all the diff for me...

time dd if=hd1c_8x4GB.mpg of=/mnt/nt_tmp/hd1c_8x4GB.mpg
17676766+1 records in
17676766+1 records out
9050504616 bytes (9.1 GB) copied, 113.261 s, 79.9 MB/s
real 1m53.372s
 
Found it.

apt-get install ethtool
ethtool -K eth0 rx off tx off
Now I can use the 3 NIC's on my test server again without worrying about the server freezing.

.. I put it in a script here...
/etc/network/if-up.d/broadfix

..make it executable..
chmod +x /etc/network/if-up.d/broadfix

..content of the script..
#!/bin/bash
if [[ "$IFACE" == eth[01] ]]; then
ethtool -K $IFACE rx off tx off
fi
 
Last edited:
Hi SamTzu,

Thanks, I'm going to try this right now. Let's see what happens.

Do you think that the problem is actually related to the Ethernet controller?

Christoph
 
I think my test hardware has two different problems.
I can now copy much bigger files than before but last night my test copy of 500Gb KVM file failed with annoyingly cryptic error message. Anyway we normally don't use VM images larger than 32GB due to the time it takes to recover them over the network from backup.
 
Hi SamTzu,

the Ethernet config did not improve the situation, one of my two machines locked up last night again.

SSH brought up the login prompt but just freezes after typing in the username. The console was frozen, no console switching or login was possible. The machine brought up a message when I removed the Ethernet cable and replugged it afterward, but that only worked for one time. Pushing the reset button was required, as always in this situation.

I still have no idea...

Christoph
 
Hi SamTzu,

I just upgraded our machines, let's see what happens this week.

Best Regards,
Christoph
 
Hi all,

with the current version, the problem occurs with a lot lower frequency. However, it still persists.

Regards,
Christoph

Code:
pve-manager: 1.6-5 (pve-manager/1.6/5261)
running kernel: 2.6.24-9-pve
pve-kernel-2.6.24-9-pve: 2.6.24-18
pve-kernel-2.6.24-8-pve: 2.6.24-16
qemu-server: 1.1-22
pve-firmware: 1.0-9
libpve-storage-perl: 1.0-14
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-8
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1