[SOLVED] HA with LXC CT not working

brichter-zt · Mar 17, 2016

Hi,

we have set up a Proxmox Cluster with Ceph (3 nodes), and all tests so far are looking good!

But we have a problem with HA for LXC CTs.

Works:

Migration of stopped LXC CT
(Online) migration of KVM - I havent actually tests offline migration
Watchdog - cluster detects missing node
HA Migration of online KVM

Problems start after adding CTs to HA:
1. HA migration does not work. When I tell Proxmox to migrate a HA CT to another node, the CT is shutdown and started, but not migrated.
2. Failover does not work. If I cut the network connection of one node, the CT is not migrated automatically.

I have created a HA group with all three nodes, all HA enabled CTs and VMs have this group.

Is this a bug? How can I further debug this?

pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-15 (running version: 4.1-15/8cd55b52)
pve-kernel-4.2.8-1-pve: 4.2.8-39
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-33
qemu-server: 4.0-62
pve-firmware: 1.1-7
libpve-common-perl: 4.0-49
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-42
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-46
pve-firewall: 2.0-18
pve-ha-manager: 1.0-24
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
openvswitch-switch: 2.3.2-2

I hope you can help me out with this.

Benjamin

dietmar · Mar 17, 2016

brichter-zt said:
1. HA migration does not work. When I tell Proxmox to migrate a HA CT to another node, the CT is shutdown and started, but not migrated.

That is how it works - there is no live migration with LXC.

brichter-zt said:
2. Failover does not work. If I cut the network connection of one node, the CT is not migrated automatically.

What kind of storage do you use for the container?

brichter-zt · Mar 17, 2016

dietmar said:
That is how it works - there is no live migration with LXC.

That makes no sense. I was expecting it to shutdown the CT, migrate it (offline) to another node, and start it again. But it only did a restart, no migration. And this is only a problem with HA, normal (offline) migration works.

dietmar said:
What kind of storage do you use for the container?

Storage is always Ceph.

dietmar · Mar 17, 2016

brichter-zt said:
I was expecting it to shutdown the CT, migrate it (offline) to another node, and start it again. But it only did a restart, no migration.

Ah, sure - it should migrate after shutdown.

brichter-zt · Mar 17, 2016

Also I just tested failover again with KVM, this was migrated to another node, but not started. I can see the task "Resume", but the machine is still stopped.

brichter-zt · Mar 17, 2016

Ah, just found the error message for the CT problem, I believe:

Code:

Mar 17 10:59:05 node2 pve-ha-crm[2997]: got unexpected error - Can't locate object method "config_file" via package "PVE::LXC::Config" (perhaps you forgot to load "PVE::LXC::Config"?) at /usr/share/perl5/PVE/HA/Resources/PVECT.pm line 37.

dietmar · Mar 17, 2016

Please can you test with latest updates from today (on pve-no-subscription)

brichter-zt · Mar 17, 2016

Ah, yes, I'm sorry. I checked the updates on one node and did not notice the other two were still behind.
Unfortunately I then proceeded to disconnect the up-to-date node, which lead to these problems

I updated all nodes, and everything looks perfekt, thank you for the support!

brichter-zt · Mar 17, 2016

I have found one remaining problem:
Migration of a HA enabled running LXC CT does not work.
When I do a online migration of a running CT, the HA migration is correctly run. The CT is stopped and restarted, but not migrated.
For failover this works fine, if the node fails the CT is restartetd on another node.

I believe this is the relevant log snippet:

Code:

Mar 17 12:57:45 node1 pvedaemon[10611]: <user@pve> starting task UPID:node1 :00002CD0:00792A4F:56EA9BB9:hamigrate:100:user@pve:
Mar 17 12:57:54 node1 pve-ha-lrm[11535]: shutdown CT 100: UPID:node1:00002D0F:00792DCA:56EA9BC2:vzshutdown:100:root@pam:
Mar 17 12:57:54 node1 pve-ha-lrm[11534]: <root@pam> starting task UPID:node1:00002D0F:00792DCA:56EA9BC2:vzshutdown:100:root@pam:
Mar 17 12:57:56 node1 kernel: [79447.973838] audit_printk_skb: 3 callbacks suppressed
Mar 17 12:57:56 node1 kernel: [79447.974599] audit: type=1400 audit(1458215876.395:447): apparmor="DENIED" operation="mount" info="failed flags match" e$
ror=-13 profile="lxc-container-default" name="/" pid=11889 comm="mount" flags="ro, remount, relatime"
Mar 17 12:57:58 node1 kernel: [79450.134462] device veth100i1 left promiscuous mode
Mar 17 12:57:58 node1 kernel: [79450.135481] device veth100i0 left promiscuous mode
Mar 17 12:57:59 node1 pve-ha-lrm[11534]: <root@pam> end task UPID:node1:00002D0F:00792DCA:56EA9BC2:vzshutdown:100:root@pam: OK
Mar 17 12:57:59 node1 pve-ha-lrm[11534]: can't migrate running container without --online
Mar 17 12:58:00 node1 ntpd[3373]: Deleting interface #27 veth100i0, fe80::fc9e:c0ff:fed2:a8ab#123, interface stats: received=0, sent=0, dropped=0, activ$
_time=433 secs
Mar 17 12:58:00 node1 ntpd[3373]: Deleting interface #26 veth100i1, fe80::fc8f:b5ff:feda:436f#123, interface stats: received=0, sent=0, dropped=0, activ$
_time=433 secs
Mar 17 12:58:00 node1 ntpd[3373]: peers refreshed
Mar 17 12:58:01 node1 CRON[11974]: (root) CMD (/usr/local/rtm/bin/rtm 36 > /dev/null 2> /dev/null)
Mar 17 12:58:24 node1 node1 pve-ha-lrm[12090]: starting service ct:100
Mar 17 12:58:24 node1 pve-ha-lrm[12091]: starting CT 100: UPID:node1:00002F3B:00793985:56EA9BE0:vzstart:100:root@pam:
Mar 17 12:58:24 node1 pve-ha-lrm[12090]: <root@pam> starting task UPID:node1:00002F3B:00793985:56EA9BE0:vzstart:100:root@pam:

t.lamprecht · Mar 18, 2016

Can you post once again your actual pveversion -v output?

From your log it seems that the shutdown command returns before it reached the stopped state, we had such a race condition but this was already fixed in January.
I cannot reproduce that on my test cluster, I'll investigate further.

brichter-zt · Mar 18, 2016

Ok, I updated once more, including the two packages that were held back before - and now, all is perfect!
Thanks again for the help

Search

Search

[SOLVED] HA with LXC CT not working

brichter-zt

New Member

dietmar

Proxmox Staff Member

brichter-zt

New Member

dietmar

Proxmox Staff Member

brichter-zt

New Member

brichter-zt

New Member

dietmar

Proxmox Staff Member

brichter-zt

New Member

brichter-zt

New Member

t.lamprecht

Proxmox Staff Member

brichter-zt

New Member

We value your privacy