maurolucc

Member
Dec 1, 2017
30
0
6
31
Hello,

I am serving a videoconference application and I want to compare the performance between VMs and LXC, specially in case of migrating the server.

For VMs, I have an NFS server where I store the disk of the VM hosting the server. Now, I wonder how to set up the configuration of LXC to perform live migration. Would I need to store the disk image (raw) in a shared storage too? It's what I needed to do with the VMs and I want to have a fair comparison in terms of metrics, i.e. migration time and downtime.

I would be grateful if someone could clarify me how can I achieve live migration of LXC in Proxmox. I know it does not offer this option by default but I'm almost sure that some of you could have tried this before. How did you install CRIU without causing a problem to Proxmox?

Thank you in advance
 
there is no working live migration for LXC, and therefore this feature is also not implemented in PVE. we regularly test with newer CRIU versions, but the current state is still far from usable (e.g., open network connections prevent container migration). if you need live migration, use VMs. if you can live with a bit of downtime, you can use containers with "restart migration" and shared storage or replication.
 
  • Like
Reactions: maurolucc
I can try with this last by now and wait for the improvement when CRIU is ready.

For "restart migration" from the console it as simple as pct migrate <id> <target> right?

Regarding the storage, I can't migrate the disk from the local-lvm to my nfs as I did with the VMs in the GUI. Can I do it from the console without damaging the container? How?

Thanks for your quick response
 
Code:
pct migrate ID TARGETNODE --restart

for moving volumes, you need to backup and restore containers (there are move volume patches on the devel list, but not yet applied / available in packages)
 
By executing the command you suggested I get a migration time of 50s approximately, even having the Root Disk attached to a shared storage (NFS).

In one occasion I saw that the migration time was 4s. I feel confused because I have not change anything and I don't know what is provoking getting this increase from 4s to 50s.


Code:
root@kcl-node1:~# pct migrate 200 kcl-node2 --restart
2017-12-14 12:27:26 shutdown CT 200
2017-12-14 12:27:26 # lxc-stop -n 200 --timeout 180
2017-12-14 12:27:26 # lxc-wait -n 200 -t 5 -s STOPPED
2017-12-14 12:27:27 starting migration of CT 200 to node 'kcl-node2' (10.81.59.102)
2017-12-14 12:27:27 volume 'nfs:200/vm-200-disk-1.raw' is on shared storage 'nfs'
2017-12-14 12:27:27 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=kcl-node2' root@10.81.59.102 pvesr set-state 200 \''{}'\'
2017-12-14 12:27:28 start final cleanup
2017-12-14 12:27:28 start container on target node
2017-12-14 12:27:28 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=kcl-node2' root@10.81.59.102 pct start 200
2017-12-14 12:28:15 migration finished successfully (duration 00:00:50)

It gets "stucked" in: 2017-12-14 12:27:28 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=kcl-node2'

Any idea how to come back to the 4s migration? What's the error?
 
check the logs on the target node.. how long does starting that container take?
 
These are the logs in the source node:

Code:
Dec 14 12:27:26 kcl-node1 pct[8298]: <root@pam> starting task UPID:kcl-node1:0000206D:0004735E:5A326E2D:vzmigrate:200:root@pam:
Dec 14 12:27:26 kcl-node1 kernel: [ 2917.028122] vmbr0: port 2(veth200i0) entered disabled state
Dec 14 12:27:26 kcl-node1 kernel: [ 2917.028253] device veth200i0 left promiscuous mode
Dec 14 12:27:26 kcl-node1 kernel: [ 2917.028254] vmbr0: port 2(veth200i0) entered disabled state
Dec 14 12:27:27 kcl-node1 pvestatd[1077]: unable to get PID for CT 200 (not running?)
Dec 14 12:27:27 kcl-node1 pvedaemon[1182]: unable to get PID for CT 200 (not running?)
Dec 14 12:27:27 kcl-node1 systemd[1]: lxc@200.service: Main process exited, code=exited, status=1/FAILURE
Dec 14 12:27:27 kcl-node1 systemd[1]: lxc@200.service: Unit entered failed state.
Dec 14 12:27:27 kcl-node1 systemd[1]: lxc@200.service: Failed with result 'exit-code'.
Dec 14 12:27:28 kcl-node1 pmxcfs[1065]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/200: -1
Dec 14 12:27:29 kcl-node1 pmxcfs[1065]: [status] notice: received log
Dec 14 12:27:42 kcl-node1 pmxcfs[1065]: [status] notice: received log
Dec 14 12:28:00 kcl-node1 systemd[1]: Starting Proxmox VE replication runner...
Dec 14 12:28:00 kcl-node1 systemd[1]: Started Proxmox VE replication runner.
Dec 14 12:28:15 kcl-node1 pmxcfs[1065]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/200: -1
Dec 14 12:28:15 kcl-node1 pmxcfs[1065]: [status] notice: received log
Dec 14 12:28:15 kcl-node1 pct[8298]: <root@pam> end task UPID:kcl-node1:0000206D:0004735E:5A326E2D:vzmigrate:200:root@pam: OK

These are the logs in the destination node:

Code:
Dec 14 12:27:23 kcl-node2 pveproxy[4803]: proxy detected vanished client connection
Dec 14 12:27:39 kcl-node2 systemd-udevd[7160]: Could not generate persistent MAC address for vethNL7MYJ: No such fi$
Dec 14 12:27:39 kcl-node2 kernel: [1102648.607924] vmbr0: port 2(veth200i0) entered blocking state
Dec 14 12:27:39 kcl-node2 kernel: [1102648.607925] vmbr0: port 2(veth200i0) entered disabled state
Dec 14 12:27:39 kcl-node2 kernel: [1102648.607978] device veth200i0 entered promiscuous mode
Dec 14 12:27:40 kcl-node2 systemd[1]: Started LXC Container: 200.
Dec 14 12:27:40 kcl-node2 pmxcfs[1062]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/200: -1
Dec 14 12:27:40 kcl-node2 pct[6986]: <root@pam> end task UPID:kcl-node2:00001B5B:0692700A:5A326E0D:vzstart:200:root$
Dec 14 12:27:40 kcl-node2 pmxcfs[1062]: [status] notice: received log
Dec 14 12:27:40 kcl-node2 pvestatd[1079]: status update time (38.362 seconds)
Dec 14 12:27:40 kcl-node2 rrdcached[1058]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-vm/200) fail$
Dec 14 12:27:41 kcl-node2 kernel: [1102649.856560] vmbr0: port 2(veth200i0) entered blocking state
Dec 14 12:27:41 kcl-node2 kernel: [1102649.856562] vmbr0: port 2(veth200i0) entered forwarding state
Dec 14 12:27:41 kcl-node2 pmxcfs[1062]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/kcl-n$
Dec 14 12:27:41 kcl-node2 pmxcfs[1062]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/kcl-n$
Dec 14 12:27:41 kcl-node2 pmxcfs[1062]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/kcl-n$
Dec 14 12:27:46 kcl-node2 kernel: [1102654.836806] TCP: request_sock_TCP: Possible SYN flooding on port 8433. Sendi$
Dec 14 12:28:00 kcl-node2 systemd[1]: Starting Proxmox VE replication runner...
Dec 14 12:28:00 kcl-node2 rrdcached[1058]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/kcl-$
Dec 14 12:28:00 kcl-node2 rrdcached[1058]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/kcl-$
Dec 14 12:28:00 kcl-node2 rrdcached[1058]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/kcl-$
Dec 14 12:28:00 kcl-node2 systemd[1]: Started Proxmox VE replication runner.
Dec 14 12:29:00 kcl-node2 systemd[1]: Starting Proxmox VE replication runner...


Regarding the start time I don't know how to measure properly the startup time, but it's really quick.
 
Ok...It seems that the problem is that the nodes are not synchronized... The time I get when I run date is different. I'll post how to sync as soon as I find how.
 
your actual downtime is only about 10s, but at least one of your storages is probably overloaded:
Dec 14 12:27:40 kcl-node2 pvestatd[1079]: status update time (38.362 seconds)

you can probably shave off a bit more time by updating to the current packages, e.g. the replication state transfer should only happen if there are replicated volumes, and this has been fixed for some time.
 
Ok... I comprehend that of 50s almost 40s the container is up... however for some reason the process is not concluded correctly...

I keep seeing RRDC update error and if I'm not wrong this is due to a synchronizing time. I do not have internet access in the lab so I am trying to sort out the problem without an NTP.
 
In fact, the logs I posted for the destination node are not correct since there's a shift in time.

This is the error that is slowing down the migration:
Code:
illegal attempt to update using time when last update time is (minimum one second step)
 
I could synchronize the clocks of the proxmox nodes via NTP but this didn't solve the problem. There's still a migration time of 50s approx.

However, this work was not in vain. Now I could check the logs correctly because there's no timeshift of course. In the destination node, it can be clearly seen that the problem is not when migrating but when starting the container.

Code:
Dec 15 09:06:12 kcl-node1 systemd[1]: Starting LXC Container: 200...
Dec 15 09:06:13 kcl-node1 kernel: [65857.952609] EXT4-fs warning (device loop0): ext4_multi_mount_protect:324: MMP interval 42 higher than expected, please wait.
Dec 15 09:06:13 kcl-node1 kernel: [65857.952609]
Dec 15 09:06:17 kcl-node1 pveproxy[29893]: Clearing outdated entries from certificate cache
Dec 15 09:06:42 kcl-node1 pveproxy[29487]: proxy detected vanished client connection
Dec 15 09:06:58 kcl-node1 kernel: [65902.943463] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
Dec 15 09:06:58 kcl-node1 systemd-udevd[30289]: Could not generate persistent MAC address for vethKJAC1R: No such file or directory
Dec 15 09:06:58 kcl-node1 kernel: [65902.987144] IPv6: ADDRCONF(NETDEV_UP): veth200i0: link is not ready
Dec 15 09:06:58 kcl-node1 kernel: [65903.253127] vmbr0: port 2(veth200i0) entered blocking state
Dec 15 09:06:58 kcl-node1 kernel: [65903.253128] vmbr0: port 2(veth200i0) entered disabled state
Dec 15 09:06:58 kcl-node1 kernel: [65903.253227] device veth200i0 entered promiscuous mode
Dec 15 09:06:58 kcl-node1 kernel: [65903.328085] enp0s31f6: renamed from vethKJAC1R
Dec 15 09:06:58 kcl-node1 systemd[1]: Started LXC Container: 200.
 
Last edited:
The same container, when it's shutted down and you start it:

Code:
Dec 15 10:23:44 kcl-node2 systemd[1]: Starting LXC Container: 200...
Dec 15 10:23:45 kcl-node2 kernel: [61797.616397] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
Dec 15 10:23:45 kcl-node2 systemd-udevd[29038]: Could not generate persistent MAC address for vethIN9EV4: No such file or di$
Dec 15 10:23:45 kcl-node2 kernel: [61797.629532] IPv6: ADDRCONF(NETDEV_UP): veth200i0: link is not ready
Dec 15 10:23:45 kcl-node2 kernel: [61797.892855] vmbr0: port 2(veth200i0) entered blocking state
Dec 15 10:23:45 kcl-node2 kernel: [61797.892856] vmbr0: port 2(veth200i0) entered disabled state
Dec 15 10:23:45 kcl-node2 kernel: [61797.892899] device veth200i0 entered promiscuous mode
Dec 15 10:23:45 kcl-node2 kernel: [61797.965511] enp0s31f6: renamed from vethIN9EV4
Dec 15 10:23:45 kcl-node2 systemd[1]: Started LXC Container: 200.

It's super quick.
 
this means that the container is not shutdown correctly, likely related to an old bug regarding waiting for container shut down. please update to the current package versions - if the problem persists, provide the output of "pveversion -v"
 
Code:
root@kcl-node1:~# pveversion -v
proxmox-ve: 5.0-19 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.17-2-pve: 4.10.17-19
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-3
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.9-pve16~bpo90
 
It's really crazy though... I've seen sometimes that it gives 4s again...

Why do you say the problem is on shutdown if we see that it gets stucked when trying to start the container at the destination node?

Check these new message I got at the receiving node:

Code:
Dec 15 14:53:53 kcl-node1 pct[25524]: <root@pam> starting task UPID:kcl-node1:000063BD:000CBC1E:5A33E201:vzstart:200:root@pam:
Dec 15 14:53:53 kcl-node1 pct[25533]: starting CT 200: UPID:kcl-node1:000063BD:000CBC1E:5A33E201:vzstart:200:root@pam:
Dec 15 14:53:53 kcl-node1 systemd[1]: Starting LXC Container: 200...
Dec 15 14:53:54 kcl-node1 kernel: [ 8346.202870] EXT4-fs warning (device loop0): ext4_multi_mount_protect:324: MMP interval 42 higher than expected, please wait.
Dec 15 14:53:54 kcl-node1 kernel: [ 8346.202870]
Dec 15 14:54:00 kcl-node1 systemd[1]: Starting Proxmox VE replication runner...
Dec 15 14:54:00 kcl-node1 systemd[1]: Started Proxmox VE replication runner.
Dec 15 14:54:39 kcl-node1 kernel: [ 8391.561582] EXT4-fs (loop0): warning: mounting fs with errors, running e2fsck is recommended
Dec 15 14:54:39 kcl-node1 kernel: [ 8391.562101] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
Dec 15 14:54:39 kcl-node1 systemd-udevd[25598]: Could not generate persistent MAC address for vethY7UC23: No such file or directory
Dec 15 14:54:39 kcl-node1 kernel: [ 8391.605753] IPv6: ADDRCONF(NETDEV_UP): veth200i0: link is not ready
Dec 15 14:54:39 kcl-node1 kernel: [ 8391.868509] vmbr0: port 2(veth200i0) entered blocking state
Dec 15 14:54:39 kcl-node1 kernel: [ 8391.868510] vmbr0: port 2(veth200i0) entered disabled state
Dec 15 14:54:39 kcl-node1 kernel: [ 8391.868555] device veth200i0 entered promiscuous mode
Dec 15 14:54:39 kcl-node1 kernel: [ 8391.930325] enp0s31f6: renamed from vethY7UC23
Dec 15 14:54:40 kcl-node1 systemd[1]: Started LXC Container: 200.


There's a problem when trying to mount... It's strange because when there's no migration and I start/stop in the same node there's no such a problem.
 
Last edited:
@fabian I have capture with tail -f /var/log/syslog the destination node while the migration is done and it get's stucked in the highlighted (!!!!!!) command.

Source node:
Code:
root@kcl-node1:~# pct migrate 200 kcl-node2 --restart
2017-12-15 16:54:01 shutdown CT 200
2017-12-15 16:54:01 # lxc-stop -n 200 --timeout 180
2017-12-15 16:54:01 # lxc-wait -n 200 -t 5 -s STOPPED
2017-12-15 16:54:02 starting migration of CT 200 to node 'kcl-node2' (10.81.59.102)
2017-12-15 16:54:03 volume 'nfs:200/vm-200-disk-1.raw' is on shared storage 'nfs'
2017-12-15 16:54:03 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=kcl-node2' root@10.81.59.102 pvesr set-state 200 \''{}'\'
2017-12-15 16:54:03 start final cleanup
2017-12-15 16:54:04 start container on target node
!!!!!!!!!!!!!!!! 2017-12-15 16:54:04 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=kcl-node2' root@10.81.59.102 pct start 200
2017-12-15 16:54:49 migration finished successfully (duration 00:00:48)

Destination node:
Code:
Dec 15 16:54:01 kcl-node2 pmxcfs[1093]: [status] notice: received log
Dec 15 16:54:01 kcl-node2 systemd[1]: Started Session 54 of user root.
Dec 15 16:54:03 kcl-node2 systemd[1]: Started Session 55 of user root.
Dec 15 16:54:03 kcl-node2 systemd[1]: Started Session 56 of user root.
Dec 15 16:54:04 kcl-node2 systemd[1]: Started Session 57 of user root.
Dec 15 16:54:04 kcl-node2 pct[31141]: <root@pam> starting task UPID:kcl-node2:000079AA:00176D92:5A33FE2C:vzstart:200:root@pam:
Dec 15 16:54:04 kcl-node2 pct[31146]: starting CT 200: UPID:kcl-node2:000079AA:00176D92:5A33FE2C:vzstart:200:root@pam:
Dec 15 16:54:04 kcl-node2 systemd[1]: Starting LXC Container: 200...
!!!!!!!!!!!!!!! Dec 15 16:54:05 kcl-node2 kernel: [15354.000767] EXT4-fs warning (device loop0): ext4_multi_mount_protect:324: MMP interval 42 higher than expected, please wait.
!!!!!!!!!!!!!!! Dec 15 16:54:05 kcl-node2 kernel: [15354.000767]
Dec 15 16:54:36 kcl-node2 pveproxy[1167]: proxy detected vanished client connection
Dec 15 16:54:48 kcl-node2 kernel: [15397.629152] EXT4-fs (loop0): warning: mounting fs with errors, running e2fsck is recommended
Dec 15 16:54:48 kcl-node2 kernel: [15397.629706] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
Dec 15 16:54:48 kcl-node2 systemd-udevd[31206]: Could not generate persistent MAC address for vethM12WXB: No such file or directory
Dec 15 16:54:48 kcl-node2 kernel: [15397.669988] IPv6: ADDRCONF(NETDEV_UP): veth200i0: link is not ready
Dec 15 16:54:49 kcl-node2 kernel: [15397.934290] vmbr0: port 2(veth200i0) entered blocking state
Dec 15 16:54:49 kcl-node2 kernel: [15397.934292] vmbr0: port 2(veth200i0) entered disabled state
Dec 15 16:54:49 kcl-node2 kernel: [15397.934336] device veth200i0 entered promiscuous mode
Dec 15 16:54:49 kcl-node2 kernel: [15397.985754] enp0s31f6: renamed from vethM12WXB
Dec 15 16:54:49 kcl-node2 systemd[1]: Started LXC Container: 200.
Dec 15 16:54:49 kcl-node2 pct[31141]: <root@pam> end task UPID:kcl-node2:000079AA:00176D92:5A33FE2C:vzstart:200:root@pam: OK
Dec 15 16:54:49 kcl-node2 pmxcfs[1093]: [status] notice: received log
Dec 15 16:54:49 kcl-node2 pvestatd[1105]: status update time (44.515 seconds)


I don't know what else I can do...
 
I start thinking this is something with the remote execution itself:

Code:
/usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=kcl-node2' root@10.81.59.102 pct start 200

May I change something in the ssh conf that solve my issue?
 
The last idea of the day...

What if I have this kind of problem?

In the container conf I see:

Root Disk | nfs:200/vm-200-disk-1.raw,size=4G

May I it be wrong?
 
please read my last post - you are running outdated packages, please upgrade!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!