container live migration failure w/mysqld

sweet-t

New Member
Oct 22, 2012
19
2
1
I have searched these forums and have only found multiple references to openvz bug # 2242 for others with this problem.
https://bugzilla.openvz.org/show_bug.cgi?id=2242
However, now that this bug appears to be RESOLVED FIXED I see no indication anywhere that this fix is available in proxmox.


Can anyone verify before I spend extra time researching other possible problems/resolutions?




Code:
# pveversion -v
pve-manager: 3.0-23 (pve-manager/3.0/957f0862)
running kernel: 2.6.32-22-pve
proxmox-ve-2.6.32: 3.0-107
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-22-pve: 2.6.32-107
lvm2: 2.02.95-pve3
clvm: 2.02.95-pve3
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-4
qemu-server: 3.0-20
pve-firmware: 1.0-23
libpve-common-perl: 3.0-4
libpve-access-control: 3.0-4
libpve-storage-perl: 3.0-8
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-13
ksm-control-daemon: 1.1-1
root@proxprd1:~# pveversion -v
pve-manager: 3.0-23 (pve-manager/3.0/957f0862)
running kernel: 2.6.32-22-pve
proxmox-ve-2.6.32: 3.0-107
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-22-pve: 2.6.32-107
lvm2: 2.02.95-pve3
clvm: 2.02.95-pve3
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-4
qemu-server: 3.0-20
pve-firmware: 1.0-23
libpve-common-perl: 3.0-4
libpve-access-control: 3.0-4
libpve-storage-perl: 3.0-8
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-13
ksm-control-daemon: 1.1-1


my CentOS 6 container live-migration (running mysqld)
Code:
Oct 25 00:40:44 starting migration of CT 100 to node 'proxprd1' (10.10.0.10)
Oct 25 00:40:44 container is running - using online migration
Oct 25 00:40:44 container data is on shared storage 'proxCT_0'
Oct 25 00:40:44 start live migration - suspending container
Oct 25 00:40:44 dump container state
Oct 25 00:40:45 dump 2nd level quota
Oct 25 00:40:46 initialize container on remote node 'proxprd1'
Oct 25 00:40:46 initializing remote quota
Oct 25 00:41:23 turn on remote quota
Oct 25 00:41:23 load 2nd level quota
Oct 25 00:41:23 starting container on remote node 'proxprd1'
Oct 25 00:41:23 restore container state
Oct 25 00:41:28 # /usr/bin/ssh -o 'BatchMode=yes' root@10.10.0.10 vzctl restore 100 --undump --dumpfile /mnt/pve/proxCT_0/dump/dump.100 --skip_arpdetect
Oct 25 00:41:24 Restoring container ...
Oct 25 00:41:24 Starting container ...
Oct 25 00:41:24 Container is mounted
Oct 25 00:41:24     undump...
Oct 25 00:41:24 Setting CPU units: 1000
Oct 25 00:41:24 Setting CPUs: 1
Oct 25 00:41:24 Configure veth devices: veth100.0 
Oct 25 00:41:24 Adding interface veth100.0 to bridge vmbr0 on CT0 for CT100
Oct 25 00:41:28 vzquota : (warning) Quota is running for id 100 already
Oct 25 00:41:28 Error: undump failed: No such file or directory
Oct 25 00:41:28 Restoring failed:
Oct 25 00:41:28 Error: rst_open_file: failed to lookup path '/tmp/.nfs00000000000a1ccd00000006': -2
Oct 25 00:41:28 Error: can't open file /tmp/.nfs00000000000a1ccd00000006
Oct 25 00:41:28 Error: rst_file: -2 117192
Oct 25 00:41:28 Error: rst_files: -2
Oct 25 00:41:28 Error: make_baby: -2
Oct 25 00:41:28 Error: rst_clone_children
Oct 25 00:41:28 Error: make_baby: -2
Oct 25 00:41:28 Error: rst_clone_children
Oct 25 00:41:28 Error: make_baby: -2
Oct 25 00:41:28 Error: rst_clone_children
Oct 25 00:41:28 Container is unmounted
Oct 25 00:41:28 ERROR: online migrate failure - Failed to restore container: Container start failed
Oct 25 00:41:28 start final cleanup
Oct 25 00:41:28 ERROR: migration finished with problems (duration 00:00:44)
TASK ERROR: migration problems                                                       -
 
Last edited:
Raymond, I hadn't anticipated any issues with offline migrations failures with an offline mysql service. I will report here in the event your issue is unrelated or confirmed.

Update: No issues with offline migrations. Raymond it sounds like your issues are unrelated. If you wish you can PM your details and woes - and I can see if I've encountered similar issues.
 
Last edited:
Ive copied the unified raw text of the patch to file but am unable to figure out how to apply the patch.
it appear to be applied to file: a/kernel/cpt/cpt_dump.c and others I cant locate. Am I looking at this wrong? Can anyone provide patching instructions?

By your answer I'm assuming this fix is not yet in proxmox' kernel. Should I also have to apply this fix, once I know how, each time an update is made on my hypervisors?
 
Last edited:
Sure, I'm about to spin up a container and live-migrate before and after mysql is installed and running. I'll report back here.
 
Ive just completed creating a new container and testing live-migrations.


I brought up eth0 and live-migrated it to proxmox2 successfully. I live-migrated the container back to proxmox1 successfully.

Code:
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:19 starting migration of CT 107 to node 'proxmox2' (10.10.0.11)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:19 container is running - using online migration[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:19 container data is on shared storage 'proxCT_0'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:19 start live migration - suspending container[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:19 dump container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:20 dump 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:21 initialize container on remote node 'proxmox2'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:21 initializing remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:36 turn on remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:36 load 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:36 starting container on remote node 'proxmox2'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:36 restore container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:41 start final cleanup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:38:41 migration finished successfuly (duration 00:00:22)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK OK
[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:02 starting migration of CT 107 to node 'proxmox1' (10.10.0.10)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:02 container is running - using online migration[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:02 container data is on shared storage 'proxCT_0'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:02 start live migration - suspending container[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:02 dump container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:02 dump 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:03 initialize container on remote node 'proxmox1'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:03 initializing remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:08 turn on remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:08 load 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:08 starting container on remote node 'proxmox1'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:08 restore container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:10 start final cleanup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:39:10 migration finished successfuly (duration 00:00:08)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK OK[/FONT][/COLOR]

I then installed mysql-server and started mysqld service. No other commands or installs were made.

I am no longer able to live migrate.

Code:
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:06 starting migration of CT 107 to node 'proxmox2' (10.10.0.11)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:06 container is running - using online migration[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:07 container data is on shared storage 'proxCT_0'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:07 start live migration - suspending container[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:07 dump container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:07 dump 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:08 initialize container on remote node 'proxmox2'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:08 initializing remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 turn on remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 load 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 starting container on remote node 'proxmox2'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 restore container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 # /usr/bin/ssh -o 'BatchMode=yes' root@10.10.0.11 vzctl restore 107 --undump --dumpfile /mnt/pve/proxCT_0/dump/dump.107 --skip_arpdetect[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 Restoring container ...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 Starting container ...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 Container is mounted[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 	undump...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 Setting CPU units: 1000[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 Setting CPUs: 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 Configure veth devices: veth107.0 [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:30 Adding interface veth107.0 to bridge vmbr0 on CT0 for CT107[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 vzquota : (warning) Quota is running for id 107 already[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: undump failed: No such file or directory[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Restoring failed:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: rst_open_file: failed to lookup path '/usr/lib64/mysql/.nfs000000000010a18400000064': -2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: can't open file /usr/lib64/mysql/.nfs000000000010a18400000064[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: do_rst_vma: rst_file: 93944[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: do_rst_mm: failed to restore vma: -2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: do_rst_mm 1937752[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: rst_mm: -2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: make_baby: -2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Error: rst_clone_children[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 Container start failed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 ERROR: online migrate failure - Failed to restore container: Can't umount /var/lib/vz/root/107: Device or resource busy[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 start final cleanup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 27 03:42:32 ERROR: migration finished with problems (duration 00:00:26)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration problems

[/FONT][/COLOR]running kernel: 2.6.32-22-pve on both proxmox nodes

What steps should I complete to apply the patch provided by Konstantin Khorenko in openvz bug 2242 ?
 
Is kernel version 2.6.32-25-pve only available on enterprise repo? Ive updated/upgraded and the latest available is 2.6.32-23-pve
 
with the latest kernel the same problem

root@v4:~# pveversion -v
proxmox-ve-2.6.32: 3.1-114 (running kernel: 2.6.32-26-pve)
pve-manager: 3.1-21 (running version: 3.1-21/93bf03d4)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-2
pve-cluster: 3.0-8
qemu-server: 3.1-8
pve-firmware: 1.0-23
libpve-common-perl: 3.0-8
libpve-access-control: 3.0-7
libpve-storage-perl: 3.0-17
pve-libspice-server1: 0.12.4-2
vncterm: 1.1-4
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.1-1
 
Can someone provide instructions on how to apply the openvz patch in proxmox? Id be happy to try it.

From January 2013 I see the patch was removed. Is it in the latest/current pve kernel or any updated kernels since January?

The naming convention and current kernel causes some confusion. 2.6.32-85, vzkernel-2.6.32-042stab072, 2.6.32-25-pve...

In which kernel is this patch in proxmox's upstream?

https://github.com/proxmox/pve-kernel-2.6.32/blob/master/changelog.Debian

Code:
pve-kernel-2.6.32 (2.6.32-85) unstable; urgency=low


  * update to vzkernel-2.6.32-042stab072.3.src.rpm


  * remove openvz-nfs-migration-fix.patch (now upstream)
 
Where we at with this? Seems no replies in a couple months and this is a major blocking issue.

Put simply "cannot migrate typical LAMP server" which is 80% of my installs (and probably the world), and never mind if mysql is actually writing at the time or recently, it just has to be running to cause failure. (Yes, I tested it.)

This is a serious bug, currently fixed with OpenVZ but not proxmox?

Will try openVZ (with and without ploop) and see how it goes. I'd like to deploy proxmox everywhere, but til this is fixed there's no benefit in my environment.
 
some technical elements brought here from http://forum.proxmox.com/threads/11983-OpenVZ-Ploop-support

bug is caused by mysql keeping FH's open on unlinked files: https://bugzilla.openvz.org/show_bug.cgi?id=2242

a patch is out for that.

However, I suspect think this will not work with rsync migrations and local storage because the unlinked file wont be copied by rsync in the first place of course.

virgin mysql right after startup (well, 3rd time restarting it, havent done a single insert/query on it though):

lrwx------ 1 root root 64 Feb 9 19:02 11 -> (deleted)/tmp/ibyhkGBP

ploop would solve this (as per other thread) as a work around as it maintains the same exact data content at the filesystem level, down to content of unlinked files.

either way, using either of these solutions voids one's support contract.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!