pve host not responding: kernel error ?

fromport

Renowned Member
Feb 16, 2009
53
0
71
i just started powertop in a ssh session to see what the cpu is doing.
My ssh session froze, so i started a ping to the host to see if it was still alive: yes no problems.
So i logged in with a second ssh session and i found this in dmesg:
Code:
BUG: soft lockup - CPU#4 stuck for 11s! [kstopmachine:32366]
CPU 4:
Modules linked in: kvm_intel kvm vzethdev vznetdev simfs vzrst vzcpt tun vzdquota vzm
on vzdev xt_tcpudp xt_length ipt_ttl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filte
r xt_multiport xt_limit ipt_tos ipt_REJECT ip_tables x_tables ac battery ipv6 bridge 
joydev psmouse sg serio_raw e1000e evdev pcspkr button sr_mod cdrom xfs dm_mirror dm_
snapshot dm_mod raid10 raid1 md_mod ide_generic ide_core sd_mod usbhid hid usb_storag
e libusual ahci ehci_hcd libata uhci_hcd scsi_mod usbcore thermal processor fan
Pid: 32366, comm: kstopmachine Not tainted 2.6.24-2-pve #1 ovz005
RIP: 0010:[<ffffffff80281e78>]  [<ffffffff80281e78>] stopmachine+0x68/0x100
RSP: 0018:ffff81025edcff30  EFLAGS: 00000202
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000202 RDI: 0000000000000000
RBP: ffffffff804a6387 R08: ffff81025edce000 R09: ffff81000105f810
R10: ffff810001065ee0 R11: 0000000000000001 R12: 0000000000000004
R13: ffffffff8024a550 R14: 0000000000000000 R15: ffff81025edcfeb0
FS:  0000000000000000(0000) GS:ffff81032f6c3300(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000b7ae4000 CR3: 0000000330551000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 [<ffffffff8020d338>] child_rip+0xa/0x12
 [<ffffffff80281e10>] stopmachine+0x0/0x100
 [<ffffffff8020d32e>] child_rip+0x0/0x12
at that point the machine came back to live and powertop started just like nothing happened. The "load" had rissen to 34
At that moment 3 kvm guests and 1 openvz guest were running.

pveversion -v
pve-manager: 1.1-4 (pve-manager/1.1/3746)
qemu-server: 1.0-10
pve-kernel: 2.6.24-5
pve-kvm: 83-1
pve-firmware: 1
vncterm: 0.9-1
vzctl: 3.0.23-1pve1
vzdump: 1.1-1
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1dso1
# pveperf
CPU BOGOMIPS: 42564.11
REGEX/SECOND: 219550
HD SIZE: 99.95 GB (/dev/mapper/pve-root)
BUFFERED READS: 105.55 MB/sec
AVERAGE SEEK TIME: 10.78 ms
FSYNCS/SECOND: 3548.39
DNS EXT: 70.03 ms

Long dmesg @ http://dth.net/pve/dmesg_cpu_lockup
And here the munin stats of this server: http://stats.bitsource.net/munin/la.ow.bitsource.net/vhost2.la.ow.bitsource.net.html
The spike can be clearly seen (when vieuwed not to late after i post this)

Anybody ideas /suggestions ?
 
Is this a standard setup, or do you use something special (do you run a firewall on the host? Why is the xfs module loaded?)
 
Is this a standard setup, or do you use something special (do you run a firewall on the host? Why is the xfs module loaded?)

No firewall
Special: yes
Code:
vhost2:~# cat /proc/mdstat
Personalities : [raid1] [raid10] 
md1 : active raid10 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      1952523904 blocks 64K chunks 2 near-copies [4/4] [UUUU]
      
md0 : active raid1 sda1[0] sdb1[1]
      497856 blocks [2/2] [UU]
      
unused devices: <none>
vhost2:~# mount
/dev/mapper/pve-root on / type xfs (rw,noatime)
/dev/mapper/pve-data on /var/lib/vz type xfs (rw,noatime)
/dev/md0 on /boot type ext3 (rw)
vhost2:~# pvdisplay 
  --- Physical volume ---
  PV Name               /dev/md1
  VG Name               pve
  PV Size               1.82 TB / not usable 0
So software raid 10 (4x 1Tb), xfs as main filesystem exept /boot=ext3
 
Both Software RAID, and XFS is known to be unstable (at least with the kernel we use). That is one reason why we do not support it. Please use a HW RAID controller and ext3 instead.
 
Both Software RAID, and XFS is known to be unstable (at least with the kernel we use). That is one reason why we do not support it. Please use a HW RAID controller and ext3 instead.

First time i'm reading this.
I was under the impression you don't support it because of the lack of knowledge of most sysadmins about the matter and that's it's to hard(er) to support.
I'm really surprised, xfs has been stable for me for so long. I'm running on it with very old kernels in combination with software raid.

One of my actual production servers: uname -a
Linux server5 2.6.10-rc3
It's been running with xfs/software raid since 2001 when i set the server up, kernel is from dec 2004.
Another machine under my control is using debian with linux-vserver setup including software raid & xfs file system: stable for years.

/me scratches on head....
Is this stability specifically related to the combination of openvz with xfs/software raid that you know of?
Do you have any idea is powertop could have been contributing to the soft lockup ?

Oh btw: i previously tried to boot your experimental pve3 kernel but it wont find the pve lvm group so it times out waiting for a root file system. just wanted to let you know my feedback.
 
First time i'm reading this.
I was under the impression you don't support it because of the lack of knowledge of most sysadmins about the matter and that's it's to hard(er) to support.
I'm really surprised, xfs has been stable for me for so long. I'm running on it with very old kernels in combination with software raid.

xfs is not supported by openvz (they need ext3 for quota support).

Various users reported strange problems with SW RAID. I don't know the reason, but I know that it is impossible to debug.

Do you have any idea is powertop could have been contributing to the soft lockup?

try without - is it more stable then?

Oh btw: i previously tried to boot your experimental pve3 kernel but it wont find the pve lvm group so it times out waiting for a root file system. just wanted to let you know my feedback.

The only change is the intel gigabit network driver (igb.ko). everything else is the same - so please can you test again, just to be sure?

- DIetmar
 
xfs is not supported by openvz (they need ext3 for quota support). Various users reported strange problems with SW RAID. I don't know the reason, but I know that it is impossible to debug.
try without - is it more stable then?
The only change is the intel gigabit network driver (igb.ko). everything else is the same - so please can you test again, just to be sure?
- DIetmar

According to the openvz faq if you dont need quota you can use xfs (at least that's what i'm reading/want to read ;-) )

I have removed powertop, and this machine/kernel is still running!
Even after the lockups, i haven't rebooted and everything is still "ok" (for now)
My gig-E is working, so no immediate need. i might try later. just wanted to let you know it's not working _for me_ (special setup with md ?)
it's 6:41 right now here, might get 2 few hours sleep before first appointment at 9:00am ;-)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!