Kernel Oops in openvz vzctl resume

philten

New Member
Feb 5, 2009
26
0
1
www.philten.com
Hello,

While executing a backup (suspend mode) the resume
at the end of the process failed and left the OpenVZ
VM hanged. Impossible to stop the vm with vzctl.
I had to reboot the host node to recover control over
the vm.

I noticed many processes in state "D" while the VM was hanged.

Other VMs on the same host node did not show any
problems after this Oops.

Oct 26 03:38:08 ns300364 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000808 RIP:
Oct 26 03:38:08 ns300364 kernel: [<ffffffff804c9708>] _spin_lock_irqsave+0x28/0xb0
Oct 26 03:38:08 ns300364 kernel: PGD 114c1b067 PUD 102886067 PMD 0
Oct 26 03:38:08 ns300364 kernel: Oops: 0002 [1] PREEMPT SMP
Oct 26 03:38:08 ns300364 kernel: CPU: 1
Oct 26 03:38:08 ns300364 kernel: Modules linked in: kvm_intel kvm vzethdev vznetdev simfs vzrst vzcpt tun vzdquota vzmon vzdev xt_length $
Oct 26 03:38:08 ns300364 kernel: Pid: 7467, comm: vzctl Not tainted 2.6.24-7-pve #1 ovz005
Oct 26 03:38:08 ns300364 kernel: RIP: 0010:[<ffffffff804c9708>] [<ffffffff804c9708>] _spin_lock_irqsave+0x28/0xb0
Oct 26 03:38:08 ns300364 kernel: RSP: 0000:ffff8101091cfdf8 EFLAGS: 00010002
Oct 26 03:38:08 ns300364 kernel: RAX: 0000000000000000 RBX: 0000000000000808 RCX: 00000000c0000100
Oct 26 03:38:08 ns300364 kernel: RDX: 0000000000000202 RSI: ffff81009bd2c8e0 RDI: 0000000000000808
Oct 26 03:38:08 ns300364 kernel: RBP: 0000000000000000 R08: ffff8101091ce000 R09: 0000000000000000
Oct 26 03:38:08 ns300364 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff81010bd44380
Oct 26 03:38:08 ns300364 kernel: R13: ffff81010bd50800 R14: ffff81010bd50998 R15: 0000000000001000
Oct 26 03:38:08 ns300364 kernel: FS: 00007f3bf90e26e0(0000) GS:ffff810117402880(0000) knlGS:0000000000000000
Oct 26 03:38:08 ns300364 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct 26 03:38:08 ns300364 kernel: CR2: 0000000000000808 CR3: 000000010bd87000 CR4: 00000000000026e0
Oct 26 03:38:08 ns300364 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 26 03:38:08 ns300364 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 26 03:38:08 ns300364 kernel: Process vzctl (pid: 7467, veid=0, threadinfo ffff8101091ce000, task ffff81009880c8e0)
Oct 26 03:38:08 ns300364 kernel: Stack: ffff81009bd2c8e0 ffff8100b8c3d1c0 ffff81010bd50820 ffffffff884df796
Oct 26 03:38:08 ns300364 kernel: 0000000000000000 0000000000002d08 ffff81010bd50820 ffff81010c18ff00
Oct 26 03:38:08 ns300364 kernel: 0000000000000000 ffff81010bd50800 0000000000000000 ffffffff884dbfed
Oct 26 03:38:08 ns300364 kernel: Call Trace:
Oct 26 03:38:08 ns300364 kernel: [<ffffffff884df796>] :vzcpt:cpt_resume+0xe6/0x210
Oct 26 03:38:08 ns300364 kernel: [<ffffffff884dbfed>] :vzcpt:cpt_ioctl+0xa3d/0xeb0
Oct 26 03:38:08 ns300364 kernel: [<ffffffff804cba66>] do_page_fault+0x176/0x890
Oct 26 03:38:08 ns300364 kernel: [<ffffffff884db5b0>] :vzcpt:cpt_ioctl+0x0/0xeb0
Oct 26 03:38:08 ns300364 kernel: [<ffffffff8031c43e>] proc_reg_unlocked_ioctl+0xee/0x110
Oct 26 03:38:08 ns300364 kernel: [<ffffffff802e18af>] do_ioctl+0x2f/0xb0
Oct 26 03:38:08 ns300364 kernel: [<ffffffff802e1bbb>] vfs_ioctl+0x28b/0x300
Oct 26 03:38:08 ns300364 kernel: [<ffffffff802d2a3c>] vfs_write+0x12c/0x190
Oct 26 03:38:08 ns300364 kernel: [<ffffffff802e1c79>] sys_ioctl+0x49/0x80
Oct 26 03:38:08 ns300364 kernel: [<ffffffff8020c69e>] system_call+0x7e/0x83
Oct 26 03:38:08 ns300364 kernel:
Oct 26 03:38:08 ns300364 kernel:
Oct 26 03:38:08 ns300364 kernel: Code: 87 03 85 c0 7e 19 c7 43 04 00 00 00 00 48 89 d0 48 8b 5c 24
Oct 26 03:38:08 ns300364 kernel: RIP [<ffffffff804c9708>] _spin_lock_irqsave+0x28/0xb0
Oct 26 03:38:08 ns300364 kernel: RSP <ffff8101091cfdf8>
Oct 26 03:38:08 ns300364 kernel: CR2: 0000000000000808
Oct 26 03:38:08 ns300364 kernel: ---[ end trace 61c9cdf94503b360 ]---
Oct 26 03:38:08 ns300364 kernel: note: vzctl[7467] exited with preempt_count 1

root@ns300364:~# pveversion
pve-manager/1.3/4023
root@ns300364:~# pveversion -v
pve-manager: 1.3-1 (pve-manager/1.3/4023)
qemu-server: 1.0-14
pve-kernel: 2.6.24-11
pve-kvm: 86-3
pve-firmware: not correctly installed
vncterm: 0.9-2
vzctl: 3.0.23-1pve3
vzdump: 1.1-2
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1dso1

root@ns300364:~# uname -a
Linux ns300364.ovh.net 2.6.24-7-pve #1 SMP PREEMPT Fri Aug 21 09:07:39 CEST 2009 x86_64 GNU/Linux

Any ideas about what could be the cause ?

Thanks in advance

Phil Ten
 
I had this problem once and unfortunately, I can't reproduce it.
Since this one time the problem did not occur again.

I noticed a strange behavior before the Oops:
During the backup-suspend, the first rsync was executed
very slowly (1h20 instead of 12 min!) and succeeded.

The server is not under heavy load and I could't find any
hardware error reported regarding the hdd.

Phil Ten
 
HI.
I have same issue with kernel Oops. :
lastcrash.jpg


It happens periodically without any logic.
I have vzdump with suspend executed every night for server backup. Several times I see this picture on the screen and need to reboot server.

How to avoid this ?
 
Sorry - I did not specify details, it is running 2.6.18 :
uname -a
2.6.18-128.2.1.el5.028stab064.7 #1 SMP Wed Aug 26 15:47:17 MSD 2009 x86_64 x86_64 x86_64 GNU/Linux

vzyum-2.4.0-11
vzctl-3.0.23-1
ovzkernel-2.6.18-128.2.1.el5.028stab064.7
vzctl-lib-3.0.23-1
vzpkg-2.7.0-18
vzdump-1.2-4
vzquota-3.0.12-1
 
Sorry - I did not specify details, it is running 2.6.18 :
uname -a
2.6.18-128.2.1.el5.028stab064.7 #1 SMP Wed Aug 26 15:47:17 MSD 2009 x86_64 x86_64 x86_64 GNU/Linux

vzyum-2.4.0-11
vzctl-3.0.23-1
ovzkernel-2.6.18-128.2.1.el5.028stab064.7
vzctl-lib-3.0.23-1
vzpkg-2.7.0-18
vzdump-1.2-4
vzquota-3.0.12-1

This is not a Proxmox VE Kernel - we have now a 2.6.18, see
http://www.proxmox.com/forum/showthread.php?t=2853 - so go for it!
 
In case someone has a suggest to solve this ...

Two more kernel Oops while running "resume" within a vzdump suspend mode backup

pve-manager: 1.3-1 (pve-manager/1.3/4023)
qemu-server: 1.0-14
pve-kernel: 2.6.24-11
pve-kvm: 86-3
pve-firmware: not correctly installed
vncterm: 0.9-2
vzctl: 3.0.23-1pve3
vzdump: 1.1-2
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1dso1
root@ns300364:/var/log# uname -a
Linux ns300364.ovh.net 2.6.24-7-pve #1 SMP PREEMPT Fri Aug 21 09:07:39 CEST 2009 x86_64 GNU/Linux
root@ns300364:/var/log#

This server is running soft RAID 1. I know it is not recommanded (sorry, in some cases there is no option).
I mention, I run several other Proxmox servers also with soft RAID 1 and only this one triggers this problem.
Actually, it's the only server running vzdump suspend backups, all others are running snapshot vzdump backups.
Plus, Georges above, seem to also run suspend backups.
In other words, I start to suspect that this issue is related the suspend/resume command.

-------------------------------------------------------------------------
First Oops:

Feb 22 02:11:21 ns300364 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000808 RIP:
Feb 22 02:11:21 ns300364 kernel: [<ffffffff804c9708>] _spin_lock_irqsave+0x28/0xb0
Feb 22 02:11:21 ns300364 kernel: PGD 1131dc067 PUD 10e4c9067 PMD 0
Feb 22 02:11:21 ns300364 kernel: Oops: 0002 [1] PREEMPT SMP
Feb 22 02:11:21 ns300364 kernel: CPU: 1
Feb 22 02:11:21 ns300364 kernel: Modules linked in: kvm_intel kvm vzethdev vznetdev simfs vzrst vzcpt tun vzdquota vzmon vzdev xt_length ipt_ttl xt_tcpmss xt_TCPMSS xt_multiport xt_limit ipt_tos ipt_REJECT xt_tcpudp xt_state iptable_filter iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_raw ip_tables x_tables dummy bridge ipv6 ata_generic pata_acpi ehci_hcd ohci1394 i2c_i801 uhci_hcd ieee1394 pata_marvell pcspkr evdev i2c_core usbcore e1000e button heci intel_agp dm_snapshot thermal processor fan sata_nv via686a ahci mptctl mptsas scsi_transport_sas mptspi mptscsih mptbase dm_crypt raid456 async_xor async_memcpy async_tx xor raid0 raid1 md_mod dm_mirror dm_mod sata_via ata_piix sata_sis pata_sis libata sym53c8xx megaraid aic7xxx scsi_transport_spi sd_mod 3w_xxxx scsi_mod atl1 sky2 skge r8169 e1000 via_rhine sis900 8139too e100 mii
Feb 22 02:11:21 ns300364 kernel: Pid: 7511, comm: vzctl Not tainted 2.6.24-7-pve #1 ovz005
Feb 22 02:11:21 ns300364 kernel: RIP: 0010:[<ffffffff804c9708>] [<ffffffff804c9708>] _spin_lock_irqsave+0x28/0xb0
Feb 22 02:11:21 ns300364 kernel: RSP: 0000:ffff810102d67df8 EFLAGS: 00010002
Feb 22 02:11:21 ns300364 kernel: RAX: 0000000000000000 RBX: 0000000000000808 RCX: 00000000c0000100
Feb 22 02:11:21 ns300364 kernel: RDX: 0000000000000202 RSI: ffff8100831348e0 RDI: 0000000000000808
Feb 22 02:11:21 ns300364 kernel: RBP: 0000000000000000 R08: ffff810102d66000 R09: 0000000000000000
Feb 22 02:11:21 ns300364 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8100d7965f80
Feb 22 02:11:21 ns300364 kernel: R13: ffff81003d5b9000 R14: ffff81003d5b9198 R15: 0000000000001000
Feb 22 02:11:21 ns300364 kernel: FS: 00007f8f0ad876e0(0000) GS:ffff810117402880(0000) knlGS:0000000000000000
Feb 22 02:11:21 ns300364 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 22 02:11:21 ns300364 kernel: CR2: 0000000000000808 CR3: 0000000108089000 CR4: 00000000000026e0
Feb 22 02:11:21 ns300364 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 22 02:11:21 ns300364 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 22 02:11:21 ns300364 kernel: Process vzctl (pid: 7511, veid=0, threadinfo ffff810102d66000, task ffff810046634000)
Feb 22 02:11:21 ns300364 kernel: Stack: ffff8100831348e0 ffff81001b1bd1c0 ffff81003d5b9020 ffffffff884c8796
Feb 22 02:11:21 ns300364 kernel: 0000000000000000 0000000000002d08 ffff81003d5b9020 ffff8100d7be5200
Feb 22 02:11:21 ns300364 kernel: 0000000000000000 ffff81003d5b9000 0000000000000000 ffffffff884c4fed
Feb 22 02:11:21 ns300364 kernel: Call Trace:
Feb 22 02:11:21 ns300364 kernel: [<ffffffff884c8796>] :vzcpt:cpt_resume+0xe6/0x210
Feb 22 02:11:21 ns300364 kernel: [<ffffffff884c4fed>] :vzcpt:cpt_ioctl+0xa3d/0xeb0
Feb 22 02:11:21 ns300364 kernel: [<ffffffff804cba66>] do_page_fault+0x176/0x890
Feb 22 02:11:21 ns300364 kernel: [<ffffffff884c45b0>] :vzcpt:cpt_ioctl+0x0/0xeb0
Feb 22 02:11:21 ns300364 kernel: [<ffffffff8031c43e>] proc_reg_unlocked_ioctl+0xee/0x110
Feb 22 02:11:21 ns300364 kernel: [<ffffffff802e18af>] do_ioctl+0x2f/0xb0
Feb 22 02:11:21 ns300364 kernel: [<ffffffff802e1bbb>] vfs_ioctl+0x28b/0x300
Feb 22 02:11:21 ns300364 kernel: [<ffffffff802d2a3c>] vfs_write+0x12c/0x190
Feb 22 02:11:21 ns300364 kernel: [<ffffffff802e1c79>] sys_ioctl+0x49/0x80
Feb 22 02:11:21 ns300364 kernel: [<ffffffff8020c69e>] system_call+0x7e/0x83
Feb 22 02:11:21 ns300364 kernel:
Feb 22 02:11:21 ns300364 kernel:
Feb 22 02:11:21 ns300364 kernel: Code: 87 03 85 c0 7e 19 c7 43 04 00 00 00 00 48 89 d0 48 8b 5c 24
Feb 22 02:11:21 ns300364 kernel: RIP [<ffffffff804c9708>] _spin_lock_irqsave+0x28/0xb0
Feb 22 02:11:21 ns300364 kernel: RSP <ffff810102d67df8>
Feb 22 02:11:21 ns300364 kernel: CR2: 0000000000000808
Feb 22 02:11:21 ns300364 kernel: ---[ end trace 71ce71d54ca466b6 ]---
Feb 22 02:11:21 ns300364 kernel: note: vzctl[7511] exited with preempt_count 1

-------------------------------------------------------------------------------------------
Second Oops

(I had to remove the second oops because the post was too large)

-----------------------------------------------------------------------------------------
The result after these Oops is a few processes locked in state D.

root@ns300364:/var/log# ps axl | grep D
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 1000 5967 4324 20 0 57924 32560 futex_ Sl ? 0:05 /usr/bin/mono /var/www/dekiwiki/bin/mindtouch.host.exe apikey jPcYGyFZ86f48kmu5Dq04POIiSBMcpGt script /etc/dekiwiki/mindtouch.deki.startup.xml path-prefix @api http-port 8081 ip localhost connect-limit -5 notty
0 500 13637 4221 20 0 6732 4880 refrig D ? 0:00 /usr/bin/perl /opt/zimbra/libexec/logswatch --config-file=/opt/zimbra/conf/logswatchrc --use-cpan-file-tail --pid-file=/opt/zimbra/log/logswatch.pid --script-dir=/opt/zimbra/data/tmp -t /var/log/zimbra-stats.log
0 500 13639 13637 20 0 8764 6796 refrig D ? 0:00 /usr/bin/perl /opt/zimbra/data/tmp/.swatch_script.5585
0 500 13641 13639 20 0 16692 5852 refrig D ? 2:04 /usr/bin/perl /opt/zimbra/libexec/zmlogger
0 500 13787 4221 20 0 6732 4896 refrig D ? 0:00 /usr/bin/perl /opt/zimbra/libexec/swatch --config-file=/opt/zimbra/conf/swatchrc --use-cpan-file-tail --script-dir=/opt/zimbra/data/tmp -t /var/log/zimbra.log
0 500 13829 13787 20 0 8768 6768 refrig D ? 0:00 /usr/bin/perl /opt/zimbra/data/tmp/.swatch_script.5730
1 500 13886 4221 20 0 9912 7384 refrig D ? 0:00 /usr/bin/perl /opt/zimbra/libexec/zmmtaconfig
1 0 16087 4221 20 0 1656 388 wait Ss ? 0:00 /opt/zimbra/libexec/zmmailboxdmgr start -server -Djava.awt.headless=true -XX:+UseConcMarkSweepGC -XX:NewRatio=2 -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SoftRefLRUPolicyMSPerMB=1 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -Xss256k -Xms1512m -Xmx1512m -Xmn378m
4 500 16088 16087 20 0 1896476 507404 futex_ Sl ? 1:52 /opt/zimbra/java/bin/java -server -Djava.awt.headless=true -XX:+UseConcMarkSweepGC -XX:NewRatio=2 -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SoftRefLRUPolicyMSPerMB=1 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -Xss256k -Xms1512m -Xmx1512m -Xmn378m -Djava.io.tmpdir=/opt/zimbra/mailboxd/work -Djava.library.path=/opt/zimbra/lib -Djava.endorsed.dirs=/opt/zimbra/mailboxd/common/endorsed -Dzimbra.config=/opt/zimbra/conf/localconfig.xml -Djetty.home=/opt/zimbra/mailboxd -DSTART=/opt/zimbra/mailboxd/etc/start.config -jar /opt/zimbra/mailboxd/start.jar /opt/zimbra/mailboxd/etc/jetty.properties /opt/zimbra/mailboxd/etc/jetty-setuid.xml /opt/zimbra/mailboxd/etc/jetty.xml
0 500 17825 4221 20 0 6420 4648 refrig D ? 0:00 /usr/bin/perl -w? /opt/zimbra/libexec/zmstat-io
0 500 17828 4221 20 0 6420 4580 refrig D ? 0:00 /usr/bin/perl -w? /opt/zimbra/libexec/zmstat-fd
0 500 17832 4221 20 0 6972 5188 refrig D ? 1:11 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-allprocs
0 500 17851 4221 20 0 6552 4744 refrig D ? 0:10 /usr/bin/perl -w? /opt/zimbra/libexec/zmstat-mysql
0 500 17875 4221 20 0 6420 4596 refrig D ? 0:02 /usr/bin/perl -w? /opt/zimbra/libexec/zmstat-mtaqueue
0 500 18687 17825 20 0 1692 596 refrig D ? 0:00 /usr/bin/iostat -d -k 30
0 500 26295 26294 20 0 429956 30380 futex_ Sl ? 0:01 /opt/zimbra/java/bin/java -XX:ErrorFile=/opt/zimbra/log -client -Xmx256m -Dzimbra.home=/opt/zimbra -Djava.library.path=/opt/zimbra/lib -Djava.ext.dirs=/opt/zimbra/java/jre/lib/ext:/opt/zimbra/lib/jars:/opt/zimbra/lib/ext-common:/opt/zimbra/lib/ext/clamscanner com.zimbra.common.util.ConsoleRunner com.zimbra.cs.account.ProvUtil -l gs zimbra.identeasy.com
0 500 26330 26196 20 0 430076 28172 futex_ Sl ? 0:01 /opt/zimbra/java/bin/java -XX:ErrorFile=/opt/zimbra/log -client -Xmx256m -Dzimbra.home=/opt/zimbra -Djava.library.path=/opt/zimbra/lib -Djava.ext.dirs=/opt/zimbra/java/jre/lib/ext:/opt/zimbra/lib/jars:/opt/zimbra/lib/ext-common:/opt/zimbra/lib/ext/clamscanner com.zimbra.common.util.ConsoleRunner com.zimbra.cs.account.ProvUtil gas mta
0 0 26766 22473 20 0 8888 796 pipe_w S+ pts/0 0:00 grep D
root@ns300364:/var/log#
 
Thank you for your reply. I think I will keep your suggestion as the last option
as I am afraid to trigger other problems by replacing the kernel.
I guess I will first try to use snapshot backups to see if it helps.

Phil Ten
 
not updating will also lead to '"other" problems, so I really recommend you go for the latest as there are a lot of fixes in the last year. (make sure you have a similar test environment and a working backup/restore plan)
 
after a long time I got a chance to move my project from Centos/OpenVZ to Debian/proxmox

everything works great now, no any issue with kernel and vzdump/vzrestore on Debian
 
I saw a similar freezing problem that seemed to have something to do with Java. Zimbra uses a lot of Java.

4 500 16088 16087 20 0 1896476 507404 futex_ Sl ? 1:52 /opt/zimbra/java/bin/java -server -Djava.awt.headless=true -

Most of those locked up processes looks to be Zimbra processes.