lxc cluster with HA

shafeeks

Renowned Member
Mar 8, 2013
42
0
71
Hi

Recently we have installed a new cluster on 3 nodes with high availability activated on proxmox pve-manager/4.1-1/2f9650d4 (running kernel: 4.2.6-1-pve). The distributed rdb is based on Ceph. We tested the cluster with both VMs and LXC CTs.

When testing the HA, we brought down 1 node and all the VMs and LXC CTs have been migrated evenly of other noded. We have created new LXC CTs and not migrated from OpenVZ.

When the node which was down comes up, all the VMs have been successfully migrated to its native node but only the LXC CTs are having some problems to migrate. The log goes as follows and it keeps on like this:

Jan 11 14:14:48 node1 pve-ha-crm[1619]: service 'ct:103' - migration failed (exit code 255)
Jan 11 14:14:48 node1 pve-ha-crm[1619]: service 'ct:103': state changed from 'migrate' to 'started' (node = node1)
Jan 11 14:14:48 node1 pve-ha-crm[1619]: migrate service 'ct:103' to node 'node2' (running)
Jan 11 14:14:48 node1 pve-ha-crm[1619]: service 'ct:103': state changed from 'started' to 'migrate' (node = node1, target = node2)
Jan 11 14:14:48 node1 pve-ha-lrm[19513]: Can't locate object method "migrate_vm" via package "PVE::API2::LXC" at /usr/share/perl5/PVE/HA/Resources.pm line 274.
Jan 11 14:14:58 node1 pve-ha-crm[1619]: service 'ct:103' - migration failed (exit code 255)
Jan 11 14:14:58 node1 pve-ha-crm[1619]: service 'ct:103': state changed from 'migrate' to 'started' (node = node1)
Jan 11 14:14:58 node1 pve-ha-crm[1619]: migrate service 'ct:103' to node 'node2' (running)
Jan 11 14:14:58 node1 pve-ha-crm[1619]: service 'ct:103': state changed from 'started' to 'migrate' (node = node1, target = node2)
Jan 11 14:14:58 node1 pve-ha-lrm[19556]: Can't locate object method "migrate_vm" via package "PVE::API2::LXC" at /usr/share/perl5/PVE/HA/Resources.pm line 274.

Thanks for your help

Best regards

Shafeek
 
Hi, are you running on the same versions on all nodes?

Can you post your output from
Code:
pveversion -v

I had a similar problems on lxc too(perls error) and i solved with upgrade.
 
Hi,

Here you are with the pveversion on the 3 nodes. They are the same
root@node1:/etc# pveversion -v
proxmox-ve: 4.1-26 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-1 (running version: 4.1-1/2f9650d4)
pve-kernel-4.2.6-1-pve: 4.2.6-26
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-41
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-17
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie

root@node2:~# pveversion -v
proxmox-ve: 4.1-26 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-1 (running version: 4.1-1/2f9650d4)
pve-kernel-4.2.6-1-pve: 4.2.6-26
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-41
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-17
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie

root@node3:~# pveversion -v
proxmox-ve: 4.1-26 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-1 (running version: 4.1-1/2f9650d4)
pve-kernel-4.2.6-1-pve: 4.2.6-26
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-41
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-17
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie

Thanks

Shafeek

Hi, are you running on the same versions on all nodes?

Can you post your output from
Code:
pveversion -v

I had a similar problems on lxc too(perls error) and i solved with upgrade.
 
Hi Dietmar,

I just install the patch on the 3 nodes and re-run the procedure to test the HA.

I still have the problem when the CT migrates to its native node as per below (but the VM works perfectly):

Jan 11 16:38:25 node1 pve-ha-lrm[4919]: <root@pam> end task UPID:node1:00001338:000117E1:5693A241:vzmigrate:103:root@pam: migration aborted
Jan 11 16:38:25 node1 pve-ha-lrm[4919]: service ct:103 not moved (migration error)
Jan 11 16:38:27 node1 pve-ha-crm[1620]: service 'ct:103' - migration failed (exit code 1)
Jan 11 16:38:27 node1 pve-ha-crm[1620]: service 'ct:103': state changed from 'migrate' to 'started' (node = node1)
Jan 11 16:38:27 node1 pve-ha-crm[1620]: migrate service 'ct:103' to node 'node2' (running)
Jan 11 16:38:27 node1 pve-ha-crm[1620]: service 'ct:103': state changed from 'started' to 'migrate' (node = node1, target = node2)
Jan 11 16:38:35 node1 pve-ha-lrm[4951]: <root@pam> starting task UPID:node1:00001358:00011BCE:5693A24B:vzmigrate:103:root@pam:
Jan 11 16:38:35 node1 pve-ha-lrm[4952]: migration aborted
Jan 11 16:38:35 node1 pve-ha-lrm[4951]: <root@pam> end task UPID:node1:00001358:00011BCE:5693A24B:vzmigrate:103:root@pam: migration aborted
Jan 11 16:38:35 node1 pve-ha-lrm[4951]: service ct:103 not moved (migration error)
Jan 11 16:38:37 node1 pve-ha-crm[1620]: service 'ct:103' - migration failed (exit code 1)
Jan 11 16:38:37 node1 pve-ha-crm[1620]: service 'ct:103': state changed from 'migrate' to 'started' (node = node1)
Jan 11 16:38:37 node1 pve-ha-crm[1620]: migrate service 'ct:103' to node 'node2' (running)
Jan 11 16:38:37 node1 pve-ha-crm[1620]: service 'ct:103': state changed from 'started' to 'migrate' (node = node1, target = node2)
Jan 11 16:38:45 node1 pve-ha-lrm[4987]: <root@pam> starting task UPID:node1:0000137C:00011FBC:5693A255:vzmigrate:103:root@pam:
Jan 11 16:38:45 node1 pve-ha-lrm[4988]: migration aborted

I know that live migration is yet to be implemented as per the roadmap of PVE.

But I think that when the node goes down and the CT gets running on the other nodes and when returning back to its native node, it just can't do it since it tries to migrate it live. I think it should be stopped and then migrate to its native node. Since it is in a cluster the live migration is done automatically.
Could you please confirm this?

Thanks

Shafeek



I assumes you use latest packages? If so, I assume this bug is fixed here:

https://git.proxmox.com/?p=pve-ha-m...ff;h=cb41f0d36b7fe1e34aa072b3c8da33d4e093bda9

I uploaded a new package to the pvetest repository, so you can test this using:

# wget ftp://download1.proxmox.com/debian/dists/jessie/pvetest/binary-amd64/pve-ha-manager_1.0-17_amd64.deb
# dpkg -i pve-ha-manager_1.0-17_amd64.deb
 
Hi,

I have further test it this morning.

When the 1 node is down, the vms and cts are migrated online from the down node to the other 2 nodes.
Then I stop only all the migrated CTs and leave the VMs on start mode. Then I plugged the network cable in order make the node up and running. It constitute the quorum.
I noticed only the VMs that are migrated online to its native node and not the CTs since they are in stopped mode. Thats ok. Then I just switch all the CTs (which were on other node and not on the native one) to Start mode.
I notice during the start mode, it migrate the CTs to its native node first and then it is started.

So I wanted to confirm whether does the live migrations of CTs (which is yet to be supported by PVE 4.1 for LXC) on HA have some issues? If yes is there any patch for the moment for it and how can we have a workaround?
Thanks

Shafeek
 
Hi Dietmar,
Thanks for your answer.
KIndly note that I have file a bug in bugzilla.

Shafeek