Migrations and HA in 2.0

VulcanRidr

Renowned Member
May 21, 2010
27
0
66
Apologies for the cross-post...

I'm running into issues with migration. I tried to migrate a container, and it failed. I have two machines in my cluster:

Code:
[root@hornet pve]# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M      8   2012-04-11 00:13:21  akagi
   2   M      4   2012-04-11 00:13:20  hornet

I upgraded akagi first, as it was my backup cluster node under 1.9. Then I shut down and did a backup of each container on hornet (1.9), copied the backup tarbal to akagi (2.0), and restored. Each container came up. First, I tried a live migration from alagi -> hornet, which failed:

Code:
2012-04-11T15:19:00-0400 vzctl : CT 108 : Setting up checkpoint... 2012-04-11T15:19:00-0400 vzctl : CT 108 :     suspend...
2012-04-11T15:19:00-0400 vzctl : CT 108 : Can not suspend container: Device or resource busy

So I shut down the vm and attempted to migrate it to hornet. It completed, with errors ("Aborted with migration problems"). When I tried to start it up on hornet, it would not start, because it could not find the container private area. I couldn't migrate it back via the web interface, so I moved the 108.conf from /etc/pve/nodes/hornet to /etc/pve/nodes/akagi. What is the problem with migration? Is there somewhere that explains the issue deeper within the logs?

The other part to this post is setting up shared storage (and eventually HA for containers). I don't have a SAN, can I share the local storage between the nodes? I set the storage on akagi to shared, but it doesn't seem to be working...Ideally, I'd like both to be shared so I can use the entire space allocated among them. Any ideas on why I can't migrate or how to set up shared storage?

Thanks,
--vr
 
Hi,
there are two things:
a) why does suspend don't work?
b) it's looks you have defined the storage allready on akagi as shared - in this case only the config will be transfered, because the data (should) be present on the other nodes - is shared!
But the storage isn't shared - it's fully independet storage. If you don't makr the storage as shared pve must rsync the data for migration.
Shared storage for OpenVZ can be an mounted (on all nodes) nfs-share.

About a: Is it an normal pve-installation? Do you have any bind-mounts inside the container? Or services, which don't should run inside container, like ntp?

Udo
 
Hi, there are two things:
a) why does suspend don't work?

I'm not sure why, I haven't seen a reason in the logs. It just said device or resource busy.

b) it's looks you have defined the storage allready on akagi as shared - in this case only the config will be transfered, because the data (should) be present on the other nodes - is shared! But the storage isn't shared - it's fully independet storage. If you don't makr the storage as shared pve must rsync the data for migration. Shared storage for OpenVZ can be an mounted (on all nodes) nfs-share.

Okay, so I have turned off the shared option under the Datacenter tab. Unfortunately, even turning off shared did not work and it failed to checkpoint and suspend:

Code:
Apr 13 11:23:48 starting migration of CT 108 to node 'hornet' (192.168.224.13) 
Apr 13 11:23:48 container is running - using online migration 
Apr 13 11:23:48 starting rsync phase 1 
Apr 13 11:23:48 # /usr/bin/rsync -aH --delete --numeric-ids --sparse /var/lib/vz/private/108 root@192.168.224.13:/var/lib/vz/private 
Apr 13 11:28:00 file has vanished: "/var/lib/vz/private/108/var/lib/nagios3/spool/checkresults/cOg2kbR" 
Apr 13 11:28:02 file has vanished: "/var/lib/vz/private/108/var/lib/nagios3/spool/checkresults/cOg2kbR.ok" 
Apr 13 11:28:02 rsync warning: some files vanished before they could be transferred (code 24) at main.c(1060) [sender=3.0.7] 
Apr 13 11:28:02 start live migration - suspending container 
Apr 13 11:28:02 # vzctl --skiplock chkpnt 108 --suspend Apr 13 11:28:02 Setting up checkpoint... 
Apr 13 11:28:02 suspend... Apr 13 11:28:02 Can not suspend container: Device or resource busy 
Apr 13 11:28:02 Error: task 374869/1198(ntpd) uses posix timers 
Apr 13 11:28:02 ERROR: Failed to suspend container: Checkpointing failed 
Apr 13 11:28:02 aborting phase 1 - cleanup resources 
Apr 13 11:28:02 removing copied files on target node 
Apr 13 11:28:04 start final cleanup 
Apr 13 11:28:04 ERROR: migration aborted (duration 00:04:17): Failed to suspend container: Checkpointing failed TASK ERROR: migration aborted

So as far as the status of the container, it is still up and running on akagi. I am also attempting a non-live migration on container 107, just to see if that is successful...And it worked. So I don't know what the problem is with checkpointing...

About a: Is it an normal pve-installation? Do you have any bind-mounts inside the container? Or services, which don't should run inside container, like ntp? Udo

I did not realize that ntp wasn't supposed to run inside a container. I turned off ntp on another container and tried, and it still didn't work... Now how do I need to set up these systems to be able to take advantage of the HA capabilities of proxmox-ve 2.0?

Thanks,
--vr
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!