[SOLVED] HA with LXC CT not working

brichter-zt

New Member
Mar 17, 2016
8
0
1
45
Hi,

we have set up a Proxmox Cluster with Ceph (3 nodes), and all tests so far are looking good!

But we have a problem with HA for LXC CTs.

Works:
  • Migration of stopped LXC CT
  • (Online) migration of KVM - I havent actually tests offline migration ;)
  • Watchdog - cluster detects missing node
  • HA Migration of online KVM
Problems start after adding CTs to HA:
1. HA migration does not work. When I tell Proxmox to migrate a HA CT to another node, the CT is shutdown and started, but not migrated.
2. Failover does not work. If I cut the network connection of one node, the CT is not migrated automatically.

I have created a HA group with all three nodes, all HA enabled CTs and VMs have this group.

Is this a bug? How can I further debug this?

pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.8-1-pve: 4.2.8-39
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-62
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-42
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-24
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
openvswitch-switch: 2.3.2-2

I hope you can help me out with this.

Benjamin
 
1. HA migration does not work. When I tell Proxmox to migrate a HA CT to another node, the CT is shutdown and started, but not migrated.

That is how it works - there is no live migration with LXC.

2. Failover does not work. If I cut the network connection of one node, the CT is not migrated automatically.

What kind of storage do you use for the container?
 
That is how it works - there is no live migration with LXC.
That makes no sense. I was expecting it to shutdown the CT, migrate it (offline) to another node, and start it again. But it only did a restart, no migration. And this is only a problem with HA, normal (offline) migration works.

What kind of storage do you use for the container?
Storage is always Ceph.
 
Also I just tested failover again with KVM, this was migrated to another node, but not started. I can see the task "Resume", but the machine is still stopped.
 
Ah, just found the error message for the CT problem, I believe:
Code:
Mar 17 10:59:05 node2 pve-ha-crm[2997]: got unexpected error - Can't locate object method "config_file" via package "PVE::LXC::Config" (perhaps you forgot to load "PVE::LXC::Config"?) at /usr/share/perl5/PVE/HA/Resources/PVECT.pm line 37.
 
Ah, yes, I'm sorry. I checked the updates on one node and did not notice the other two were still behind.
Unfortunately I then proceeded to disconnect the up-to-date node, which lead to these problems ;)

I updated all nodes, and everything looks perfekt, thank you for the support!
 
I have found one remaining problem:
Migration of a HA enabled running LXC CT does not work.
When I do a online migration of a running CT, the HA migration is correctly run. The CT is stopped and restarted, but not migrated.
For failover this works fine, if the node fails the CT is restartetd on another node.

I believe this is the relevant log snippet:
Code:
Mar 17 12:57:45 node1 pvedaemon[10611]: <user@pve> starting task UPID:node1 :00002CD0:00792A4F:56EA9BB9:hamigrate:100:user@pve:
Mar 17 12:57:54 node1 pve-ha-lrm[11535]: shutdown CT 100: UPID:node1:00002D0F:00792DCA:56EA9BC2:vzshutdown:100:root@pam:
Mar 17 12:57:54 node1 pve-ha-lrm[11534]: <root@pam> starting task UPID:node1:00002D0F:00792DCA:56EA9BC2:vzshutdown:100:root@pam:
Mar 17 12:57:56 node1 kernel: [79447.973838] audit_printk_skb: 3 callbacks suppressed
Mar 17 12:57:56 node1 kernel: [79447.974599] audit: type=1400 audit(1458215876.395:447): apparmor="DENIED" operation="mount" info="failed flags match" e$
ror=-13 profile="lxc-container-default" name="/" pid=11889 comm="mount" flags="ro, remount, relatime"
Mar 17 12:57:58 node1 kernel: [79450.134462] device veth100i1 left promiscuous mode
Mar 17 12:57:58 node1 kernel: [79450.135481] device veth100i0 left promiscuous mode
Mar 17 12:57:59 node1 pve-ha-lrm[11534]: <root@pam> end task UPID:node1:00002D0F:00792DCA:56EA9BC2:vzshutdown:100:root@pam: OK
Mar 17 12:57:59 node1 pve-ha-lrm[11534]: can't migrate running container without --online
Mar 17 12:58:00 node1 ntpd[3373]: Deleting interface #27 veth100i0, fe80::fc9e:c0ff:fed2:a8ab#123, interface stats: received=0, sent=0, dropped=0, activ$
_time=433 secs
Mar 17 12:58:00 node1 ntpd[3373]: Deleting interface #26 veth100i1, fe80::fc8f:b5ff:feda:436f#123, interface stats: received=0, sent=0, dropped=0, activ$
_time=433 secs
Mar 17 12:58:00 node1 ntpd[3373]: peers refreshed
Mar 17 12:58:01 node1 CRON[11974]: (root) CMD (/usr/local/rtm/bin/rtm 36 > /dev/null 2> /dev/null)
Mar 17 12:58:24 node1 node1 pve-ha-lrm[12090]: starting service ct:100
Mar 17 12:58:24 node1 pve-ha-lrm[12091]: starting CT 100: UPID:node1:00002F3B:00793985:56EA9BE0:vzstart:100:root@pam:
Mar 17 12:58:24 node1 pve-ha-lrm[12090]: <root@pam> starting task UPID:node1:00002F3B:00793985:56EA9BE0:vzstart:100:root@pam:
 
Can you post once again your actual pveversion -v output?

From your log it seems that the shutdown command returns before it reached the stopped state, we had such a race condition but this was already fixed in January.
I cannot reproduce that on my test cluster, I'll investigate further.
 
Ok, I updated once more, including the two packages that were held back before - and now, all is perfect!
Thanks again for the help :)