Kernel Panic migrating an OpenVz Container

superbit

New Member
Apr 17, 2012
20
0
1
I'm using Proxmox 2.0-29/18400f07 in four servers in my LAN. All of them are running in a cluster.
I have problems with the last server that I added to the cluster: when I try to migrate a VE (OpenVZ container) to this server the process never ends OK, I always obtain a Kernel Panic.
I reinstalled Proxmox and added the server to the cluster again, and the same happens.
I don’t think to have hardware problems.
Any suggestion? :(

This's the output in my ssh client:
Code:
Message from syslogd@dom at Apr 20 13:22:57 ...
 kernel:Oops: 0000 [#1] SMP

Message from syslogd@dom at Apr 20 13:22:57 ...
 kernel:last sysfs file: /sys/kernel/uevent_seqnum

Message from syslogd@dom at Apr 20 13:22:57 ...
 kernel:Stack:

Message from syslogd@dom at Apr 20 13:22:57 ...
 kernel:Call Trace:

Message from syslogd@dom at Apr 20 13:22:57 ...
 kernel:Code: 03 00 00 65 48 8b 04 25 c8 cb 00 00 48 8b 80 38 e0 ff ff a8 08 0f 85 55 fb ff ff 48 81 c4 88 00 00 00 5b 41 5c 41 5d 41 5e 41 5f <c9> c3 66 0f 1f 44 00 00 48 29 d0 48 63 c9 48 89 c2 48 8b 05 30

Message from syslogd@dom at Apr 20 13:22:57 ...
 kernel:CR2: 0000000000000000
 
We need a more complete log to debug further. Try to use netconsole for that:

http://wiki.openvz.org/Remote_console_setup

Please also post the migration task output.

Pardon, I had some day without Internet access.
I had problems with netconsole, the server I used to receive the data didn’t receive nothing. I fix it running dmesg -n 8 on the sender.
I was trying to migrate an openvz container on an off state and everything ok. The problem seems to happen when I use live migration.
Well, finally these are the outputs of netconsole and migrations task.

Netconsole:
Code:
console [netcon0] enabled
netconsole: network logging started
warning: `vzctl' uses 32-bit capabilities (legacy support in use)
CT: 1111: started
CT: 1111: stopped
CT: 1111: started
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81511ed0>] thread_return+0xb0/0x7d0
PGD 7a589067 PUD 37bb7067 PMD 7a5a3067 PTE 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/kernel/uevent_seqnum
CPU 0
Modules linked in: netconsole vhost_net macvtap macvlan tun kvm_intel kvm vzethdev v                                         znetdev simfs vzrst vzcpt nfs lockd fscache nfs_acl auth_rpcgss sunrpc vzdquota vzmo                                         n vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_owner xt_mac ipt_R                                         EDIRECT nf_nat_irc nf_nat_ftp iptable_nat nf_nat xt_helper xt_state xt_conntrack nf_                                         conntrack_irc nf_conntrack_ftp nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 xt_leng                                         th ipt_LOG xt_hl xt_tcpmss xt_TCPMSS ipt_REJECT xt_DSCP xt_dscp xt_multiport xt_limi                                         t iptable_mangle iptable_filter ip_tables dlm configfs vzevent ib_iser rdma_cm ib_cm                                          iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transp                                         ort_iscsi fuse snd_pcsp snd_pcm snd_timer i3000_edac snd shpchp edac_core tpm_tis so                                         undcore tpm snd_page_alloc tpm_bios serio_raw ext3 jbd mbcache sg ata_generic pata_a                                         cpi tg3 ata_piix [last unloaded: scsi_wait_scan]

Pid: 2711, comm: sshd veid: 1111 Not tainted 2.6.32-11-pve #1 042stab053_5 HP ProLia                                         nt ML110 G4/ML110 G4
RIP: 0010:[<ffffffff81511ed0>]  [<ffffffff81511ed0>] thread_return+0xb0/0x7d0
RSP: 0018:ffff88003727bf18  EFLAGS: 00010296
RAX: 0000000000020004 RBX: 0000000000000000 RCX: 0000000000000068
RDX: ffffffff81f05b80 RSI: 0000000000000282 RDI: 0000000000000282
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880002600000(0063) knlGS:00000000b7783ad0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000078d68000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sshd (pid: 2711, veid: 1111, threadinfo ffff88003727a000, task ffff880037a76                                         1c0)
Stack:
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
<0> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
<0> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
Code: 03 00 00 65 48 8b 04 25 c8 cb 00 00 48 8b 80 38 e0 ff ff a8 08 0f 85 55 fb ff                                          ff 48 81 c4 88 00 00 00 5b 41 5c 41 5d 41 5e 41 5f <c9> c3 66 0f 1f 44 00 00 48 29 d                                         0 48 63 c9 48 89 c2 48 8b 05 30
RIP  [<ffffffff81511ed0>] thread_return+0xb0/0x7d0
 RSP <ffff88003727bf18>
CR2: 0000000000000000
---[ end trace b3d85a5b15a5a6c4 ]---
Kernel panic - not syncing: Fatal exception
Pid: 2711, comm: sshd veid: 1111 Tainted: G      D    ----------------   2.6.32-11-p                                         ve #1
Call Trace:
 [<ffffffff815116c3>] ? panic+0x78/0x143
 [<ffffffff815159e4>] ? oops_end+0xe4/0x100
 [<ffffffff8104224b>] ? no_context+0xfb/0x260
 [<ffffffff8118ea5a>] ? do_sync_read+0xfa/0x140
 [<ffffffff810424c5>] ? __bad_area_nosemaphore+0x115/0x1e0
 [<ffffffff81060b51>] ? update_curr+0xe1/0x1f0
 [<ffffffff810425fe>] ? bad_area+0x4e/0x60
 [<ffffffff81042d06>] ? __do_page_fault+0x3c6/0x490
 [<ffffffff8113230d>] ? free_hot_page+0x2d/0x60
 [<ffffffff81132620>] ? __free_pages+0x60/0x90
 [<ffffffff8113268e>] ? free_pages+0x3e/0x40
 [<ffffffff8151907c>] ? kprobe_flush_task+0xbc/0xe0
 [<ffffffff815179ae>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81514d15>] ? page_fault+0x25/0x30
 [<ffffffff81511ed0>] ? thread_return+0xb0/0x7d0

Migration task:
Code:
Apr 25 13:19:29 starting migration of CT 1111 to node 'dom' (172.16.0.10)
Apr 25 13:19:29 container is running - using online migration
Apr 25 13:19:29 starting rsync phase 1
Apr 25 13:19:29 # /usr/bin/rsync -aH --delete --numeric-ids --sparse /var/lib/vz/private/1111 root@172.16.0.10:/var/lib/vz/private
Apr 25 13:19:59 start live migration - suspending container
Apr 25 13:19:59 dump container state
Apr 25 13:19:59 copy dump file to target node
Apr 25 13:19:59 starting rsync (2nd pass)
Apr 25 13:19:59 # /usr/bin/rsync -aH --delete --numeric-ids /var/lib/vz/private/1111 root@172.16.0.10:/var/lib/vz/private
Apr 25 13:20:00 dump 2nd level quota
Apr 25 13:20:00 copy 2nd level quota to target node
Apr 25 13:20:01 initialize container on remote node 'dom'
Apr 25 13:20:01 initializing remote quota
Apr 25 13:20:01 turn on remote quota
Apr 25 13:20:01 load 2nd level quota
Apr 25 13:20:01 starting container on remote node 'dom'
Apr 25 13:20:01 restore container state
Apr 25 13:25:27 # /usr/bin/ssh -c blowfish -o 'BatchMode=yes' root@172.16.0.10 vzctl restore 1111 --undump --dumpfile /var/lib/vz/dump/dump.1111 --skip_arpdetect
Apr 25 13:20:02 Restoring container ...
Apr 25 13:20:02 Starting container ...
Apr 25 13:20:02 Container is mounted
Apr 25 13:20:02 undump...
Apr 25 13:20:02 Adding IP address(es): 172.16.0.128
Apr 25 13:20:02 Setting CPU units: 1000
Apr 25 13:20:02 Setting CPUs: 1
Apr 25 13:25:27 vzquota : (warning) Quota is running for id 1111 already
Apr 25 13:25:27 Write failed: Broken pipe
Apr 25 13:25:27 ERROR: online migrate failure - Failed to restore container: exit code 255
Apr 25 13:25:27 removing container files on local node
Apr 25 13:25:28 start final cleanup
Apr 25 13:25:28 ERROR: migration finished with problems (duration 00:05:59)
TASK ERROR: migration problems
 
Last edited:
OK, I updated all my servers to Proxmox VE 2.1-1/f8b0f63a and no change!!!
After that, I updated my problematic server to the testing version and ... no change. The problem persists.
 
I am experiencing the exact same issue, very similar tracebacks and panic stack trace.

"Table" (original PVE 2.0 server) is an i7/16G
"Plate" (just added PVE 2.0 server) is a C2Q/4G

The funny thing is I can live migrate just fine from Plate to Table, but Plate freezes when I do it the other way around.

Both were installed on top of a plain Debian squeeze install using your apt repo. There are very slight differences in versions. Keep in mind that Table (the one exhibiting the correct behaviour) is running the older packages.

Package versions from Table:
Code:
gomo@srv-table:~$ dpkg -l | grep pve
ii  clvm                                2.02.88-2pve2                Cluster LVM Daemon for lvm2
ii  corosync-pve                        1.4.1-1                      Standards-based cluster framework (daemon and modules)
ii  dmsetup                             2:1.02.67-2pve2              Linux Kernel Device Mapper userspace library
ii  fence-agents-pve                    3.1.7-2                      fence agents for redhat cluster suite
ii  libcorosync4-pve                    1.4.1-1                      Standards-based cluster framework (libraries)
ii  libdevmapper1.02.1                  2:1.02.67-2pve2              Linux Kernel Device Mapper userspace library
ii  libopenais3-pve                     1.1.4-2                      Standards-based cluster framework (libraries)
ii  libpve-access-control               1.0-17                       Proxmox VE access control library
ii  libpve-common-perl                  1.0-25                       Proxmox VE base library
ii  libpve-storage-perl                 2.0-17                       Proxmox VE storage management library
ii  lvm2                                2.02.88-2pve2                Linux Logical Volume Manager
ii  openais-pve                         1.1.4-2                      Standards-based cluster framework (daemon and modules)
ii  pve-cluster                         1.0-26                       Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-firmware                        1.0-15                       Binary firmware code for the pve-kernel
ii  pve-kernel-2.6.32-11-pve            2.6.32-65                    The Proxmox PVE Kernel Image
ii  pve-manager                         2.0-57                       The Proxmox Virtual Environment
ii  pve-qemu-kvm                        1.0-9                        Full virtualization on x86 hardware
ii  redhat-cluster-pve                  3.1.8-3                      Red Hat cluster suite
ii  resource-agents-pve                 3.9.2-3                      resource agents for redhat cluster suite
ii  vzctl                               3.0.30-2pve2                 OpenVZ - server virtualization solution - control tools
gomo@srv-table:~$ dpkg -l | grep proxmox
ii  proxmox-ve-2.6.32                   2.0-65                       The Proxmox Virtual Environment


Package versions from Plate:
Code:
root@srv-plate:/home/gomo# dpkg -l | grep pve
ii  clvm                                2.02.95-1pve2                Cluster LVM Daemon for lvm2
ii  corosync-pve                        1.4.3-1                      Standards-based cluster framework (daemon and modules)
ii  dmsetup                             2:1.02.74-1pve2              Linux Kernel Device Mapper userspace library
ii  fence-agents-pve                    3.1.7-2                      fence agents for redhat cluster suite
ii  libcorosync4-pve                    1.4.3-1                      Standards-based cluster framework (libraries)
ii  libdevmapper1.02.1                  2:1.02.74-1pve2              Linux Kernel Device Mapper userspace library
ii  libopenais3-pve                     1.1.4-2                      Standards-based cluster framework (libraries)
ii  libpve-access-control               1.0-21                       Proxmox VE access control library
ii  libpve-common-perl                  1.0-27                       Proxmox VE base library
ii  libpve-storage-perl                 2.0-18                       Proxmox VE storage management library
ii  lvm2                                2.02.95-1pve2                Linux Logical Volume Manager
ii  openais-pve                         1.1.4-2                      Standards-based cluster framework (daemon and modules)
ii  pve-cluster                         1.0-26                       Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-firmware                        1.0-15                       Binary firmware code for the pve-kernel
ii  pve-kernel-2.6.32-11-pve            2.6.32-66                    The Proxmox PVE Kernel Image
ii  pve-manager                         2.1-1                        The Proxmox Virtual Environment
ii  pve-qemu-kvm                        1.0-9                        Full virtualization on x86 hardware
ii  redhat-cluster-pve                  3.1.8-3                      Red Hat cluster suite
ii  resource-agents-pve                 3.9.2-3                      resource agents for redhat cluster suite
ii  vzctl                               3.0.30-2pve5                 OpenVZ - server virtualization solution - control tools
root@srv-plate:/home/gomo# dpkg -l | grep proxmox
ii  proxmox-ve-2.6.32                   2.0-66                       The Proxmox Virtual Environment
 
Last edited by a moderator:
I tried downgrading the kernel on Plate to go back to the exact version the other server is running and the problem subsists. ¿Would you like me to try downgrading a specific package? I don't really know what could be causing a kernel panic outside of the kernel :(

I also tried using bridge interfaces instead of venet in case it might be network related, but no luck either.

Thanks!
 
Last edited by a moderator:
I just upgraded both nodes to the latest packages:
Code:
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-66
pve-kernel-2.6.32-11-pve: 2.6.32-66
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-15
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

With this identical configuration on both nodes, I can still live migrate CTs from Plate to Table and *not* the other way around. The only asymmetry I'm aware of between the servers is hardware differences. I just did a diff between the packages lists on each server and the only difference is I have nginx installed in one and not on the other.

The thing that seems the most obvious in dayli usage is missing VT-x support (which the C2Q CPU in Plate doesn't have), but it shouldn't be an issue with CTs.

This is the old pveversion from "Table" (working fine like the new one I posted above).

Code:
gomo@srv-table:~$ pveversion -v
pve-manager: 2.0-57 (pve-manager/2.0/ff6cd700)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-65
pve-kernel-2.6.32-11-pve: 2.6.32-65
lvm2: 2.02.88-2pve2
clvm: 2.02.88-2pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-37
pve-firmware: 1.0-15
libpve-common-perl: 1.0-25
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-17
vncterm: 1.0-2
vzctl: 3.0.30-2pve2
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
 
Is there somewhere else I can post or do you guys want me to do anything to track the error? Kernel Panic seems like a big issue.
 
Still broken with the latest kernel, same issues. Looks like a nasty bug to me.
 
Hi Tom,
I upgraded to the new last version of testing kernel, and ... the panic night continues :D.
I will report it to openvz, but I think this's the correct place.
Thanks a lot!!!
 
Superbit, maybe you can post in the BugZilla ticket as well so we get some attention on the issue.
 
I updated to pve-kernel-2.6.32-13-pve.
I can do some live migrations without kernel panic, but with errors.
Finally "The Kernel Panic comes back". It seems to be a terror movie :)
I'm still using the stable kernel version on the other servers.

Error log:
Code:
Jun 14 09:57:38 starting migration of CT 1222 to node 'cancerbero' (172.16.0.12)
Jun 14 09:57:38 container is running - using online migration
Jun 14 09:57:38 starting rsync phase 1
Jun  14 09:57:38 # /usr/bin/rsync -aH --delete --numeric-ids --sparse  /var/lib/vz/private/1222 root@172.16.0.12:/var/lib/vz/private
Jun 14 09:58:08 start live migration - suspending container
Jun 14 09:58:08 dump container state
Jun 14 09:58:08 copy dump file to target node
Jun 14 09:58:09 starting rsync (2nd pass)
Jun 14 09:58:09 # /usr/bin/rsync -aH --delete --numeric-ids /var/lib/vz/private/1222 root@172.16.0.12:/var/lib/vz/private
Jun 14 09:58:09 dump 2nd level quota
Jun 14 09:58:09 copy 2nd level quota to target node
Jun 14 09:58:10 initialize container on remote node 'cancerbero'
Jun 14 09:58:10 initializing remote quota
Jun 14 09:58:11 turn on remote quota
Jun 14 09:58:11 load 2nd level quota
Jun 14 09:58:11 starting container on remote node 'cancerbero'
Jun 14 09:58:11 restore container state
Jun  14 09:58:11 # /usr/bin/ssh -c blowfish -o 'BatchMode=yes'  root@172.16.0.12 vzctl restore 1222 --undump --dumpfile  /var/lib/vz/dump/dump.1222 --skip_arpdetect
Jun 14 09:58:11 Restoring container ...
Jun 14 09:58:11 Starting container ...
Jun 14 09:58:11 Container is mounted
Jun 14 09:58:11 	undump...
Jun 14 09:58:11 vzquota : (warning) Quota is running for id 1222 already
Jun 14 09:58:11 Error: undump failed: Invalid argument
Jun 14 09:58:11 Container start failed (try to check kernel messages, e.g. "dmesg | tail")
Jun 14 09:58:11 Restoring failed:
Jun 14 09:58:11 Container is unmounted
Jun 14 09:58:11 ERROR: online migrate failure - Failed to restore container: Error: Unknown image version: 801. Can't restore.
Jun 14 09:58:11 removing container files on local node
Jun 14 09:58:12 start final cleanup
Jun 14 09:58:12 ERROR: migration finished with problems (duration 00:00:35)
TASK ERROR: migration problems
 
the bug report tells that the fix is in 042stab058_9, not yet available.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!