Hi everybody,
we are currently running 3 node Proxmox 4.4 Cluster with Ceph and some shared Storage from a Linux Cluster. We recently migrated from an old two node proxmox 3 setup. While upgrading the cluster we changed the networking from the classic linux bridging to openvswitch. That's my guess where the problems come from.
In some cases when i live migrate a vm the task fails:
I also saved logs from another VM with the identical problems. I can provide it on request. The error is not exactly reproducable. Sometimes live migration just works. I already checked if it is storage related, but it happens if the storage is on ceph or shared from the backend cluster.
Unfortunately i couldn't find anything around the web, so if anybody has an idea please let me know
Thanks for reading and have good day!
we are currently running 3 node Proxmox 4.4 Cluster with Ceph and some shared Storage from a Linux Cluster. We recently migrated from an old two node proxmox 3 setup. While upgrading the cluster we changed the networking from the classic linux bridging to openvswitch. That's my guess where the problems come from.
In some cases when i live migrate a vm the task fails:
Jun 16 16:02:45 use dedicated network address for sending migration traffic (172.20.242.1)
Jun 16 16:02:46 starting migration of VM 104 to node 'kvm-c01-node01' (172.20.242.1)
Jun 16 16:02:46 copying disk images
Jun 16 16:02:46 starting VM 104 on remote node 'kvm-c01-node01'
Jun 16 16:02:48 starting online/live migration on tcp:172.20.242.1:60000
Jun 16 16:02:48 migrate_set_speed: 8589934592
Jun 16 16:02:48 migrate_set_downtime: 0.1
Jun 16 16:02:48 set migration_caps
Jun 16 16:02:48 set cachesize: 858993459
Jun 16 16:02:48 start migrate command to tcp:172.20.242.1:60000
Jun 16 16:02:50 migration status: active (transferred 1164679278, remaining 7429926912), total 8607571968)
Jun 16 16:02:50 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:02:52 migration status: active (transferred 2163188335, remaining 6430842880), total 8607571968)
Jun 16 16:02:52 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:02:54 migration status: active (transferred 2858173500, remaining 5734805504), total 8607571968)
Jun 16 16:02:54 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:02:56 migration status: active (transferred 3712890610, remaining 4877414400), total 8607571968)
Jun 16 16:02:56 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:02:58 migration status: active (transferred 4242548907, remaining 4346986496), total 8607571968)
Jun 16 16:02:58 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:00 migration status: active (transferred 4824522737, remaining 3763507200), total 8607571968)
Jun 16 16:03:00 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:02 migration status: active (transferred 5276740515, remaining 3309481984), total 8607571968)
Jun 16 16:03:02 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:04 migration status: active (transferred 5725033511, remaining 2858123264), total 8607571968)
Jun 16 16:03:04 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:06 migration status: active (transferred 6199944439, remaining 2378608640), total 8607571968)
Jun 16 16:03:06 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:08 migration status: active (transferred 6637811664, remaining 1930936320), total 8607571968)
Jun 16 16:03:08 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:10 migration status: active (transferred 7191819020, remaining 1370603520), total 8607571968)
Jun 16 16:03:10 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:12 migration status: active (transferred 7642999873, remaining 915116032), total 8607571968)
Jun 16 16:03:12 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:14 migration status: active (transferred 8053261811, remaining 482758656), total 8607571968)
Jun 16 16:03:14 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:15 migration status: active (transferred 8115832973, remaining 414023680), total 8607571968)
Jun 16 16:03:15 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:15 migration status: active (transferred 8174519430, remaining 350216192), total 8607571968)
Jun 16 16:03:15 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:15 migration status: active (transferred 8232968296, remaining 280850432), total 8607571968)
Jun 16 16:03:15 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:16 migration status: active (transferred 8308933851, remaining 204828672), total 8607571968)
Jun 16 16:03:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:16 migration status: active (transferred 8370092086, remaining 143613952), total 8607571968)
Jun 16 16:03:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:16 migration status: active (transferred 8430139532, remaining 82874368), total 8607571968)
Jun 16 16:03:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:16 migration status: active (transferred 8497598965, remaining 479371264), total 8607571968)
Jun 16 16:03:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 951 overflow 0
Jun 16 16:03:17 migration status: active (transferred 8642721548, remaining 334077952), total 8607571968)
Jun 16 16:03:17 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 36312 overflow 0
Jun 16 16:03:17 migration status: active (transferred 8866168881, remaining 108789760), total 8607571968)
Jun 16 16:03:17 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 90757 overflow 0
Jun 16 16:03:17 migration speed: 282.48 MB/s - downtime 93 ms
Jun 16 16:03:17 migration status: completed
Jun 16 16:03:18 ERROR: VM 104 not running
Jun 16 16:03:18 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@172.20.242.1 qm resume 104 --skiplock --nocheck' failed: exit code 2
Jun 16 16:03:21 ERROR: migration finished with problems (duration 00:00:36)
TASK ERROR: migration problems
Jun 16 16:02:46 starting migration of VM 104 to node 'kvm-c01-node01' (172.20.242.1)
Jun 16 16:02:46 copying disk images
Jun 16 16:02:46 starting VM 104 on remote node 'kvm-c01-node01'
Jun 16 16:02:48 starting online/live migration on tcp:172.20.242.1:60000
Jun 16 16:02:48 migrate_set_speed: 8589934592
Jun 16 16:02:48 migrate_set_downtime: 0.1
Jun 16 16:02:48 set migration_caps
Jun 16 16:02:48 set cachesize: 858993459
Jun 16 16:02:48 start migrate command to tcp:172.20.242.1:60000
Jun 16 16:02:50 migration status: active (transferred 1164679278, remaining 7429926912), total 8607571968)
Jun 16 16:02:50 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:02:52 migration status: active (transferred 2163188335, remaining 6430842880), total 8607571968)
Jun 16 16:02:52 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:02:54 migration status: active (transferred 2858173500, remaining 5734805504), total 8607571968)
Jun 16 16:02:54 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:02:56 migration status: active (transferred 3712890610, remaining 4877414400), total 8607571968)
Jun 16 16:02:56 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:02:58 migration status: active (transferred 4242548907, remaining 4346986496), total 8607571968)
Jun 16 16:02:58 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:00 migration status: active (transferred 4824522737, remaining 3763507200), total 8607571968)
Jun 16 16:03:00 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:02 migration status: active (transferred 5276740515, remaining 3309481984), total 8607571968)
Jun 16 16:03:02 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:04 migration status: active (transferred 5725033511, remaining 2858123264), total 8607571968)
Jun 16 16:03:04 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:06 migration status: active (transferred 6199944439, remaining 2378608640), total 8607571968)
Jun 16 16:03:06 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:08 migration status: active (transferred 6637811664, remaining 1930936320), total 8607571968)
Jun 16 16:03:08 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:10 migration status: active (transferred 7191819020, remaining 1370603520), total 8607571968)
Jun 16 16:03:10 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:12 migration status: active (transferred 7642999873, remaining 915116032), total 8607571968)
Jun 16 16:03:12 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:14 migration status: active (transferred 8053261811, remaining 482758656), total 8607571968)
Jun 16 16:03:14 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:15 migration status: active (transferred 8115832973, remaining 414023680), total 8607571968)
Jun 16 16:03:15 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:15 migration status: active (transferred 8174519430, remaining 350216192), total 8607571968)
Jun 16 16:03:15 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:15 migration status: active (transferred 8232968296, remaining 280850432), total 8607571968)
Jun 16 16:03:15 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:16 migration status: active (transferred 8308933851, remaining 204828672), total 8607571968)
Jun 16 16:03:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:16 migration status: active (transferred 8370092086, remaining 143613952), total 8607571968)
Jun 16 16:03:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:16 migration status: active (transferred 8430139532, remaining 82874368), total 8607571968)
Jun 16 16:03:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
Jun 16 16:03:16 migration status: active (transferred 8497598965, remaining 479371264), total 8607571968)
Jun 16 16:03:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 951 overflow 0
Jun 16 16:03:17 migration status: active (transferred 8642721548, remaining 334077952), total 8607571968)
Jun 16 16:03:17 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 36312 overflow 0
Jun 16 16:03:17 migration status: active (transferred 8866168881, remaining 108789760), total 8607571968)
Jun 16 16:03:17 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 90757 overflow 0
Jun 16 16:03:17 migration speed: 282.48 MB/s - downtime 93 ms
Jun 16 16:03:17 migration status: completed
Jun 16 16:03:18 ERROR: VM 104 not running
Jun 16 16:03:18 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@172.20.242.1 qm resume 104 --skiplock --nocheck' failed: exit code 2
Jun 16 16:03:21 ERROR: migration finished with problems (duration 00:00:36)
TASK ERROR: migration problems
Jun 16 16:02:46 kvm-c01-node01 qm[3039]: <root@pam> starting task UPID:kvm-c01-node01:00000BF8:05346077:5943E506:qmstart:104:root@pam:
Jun 16 16:02:46 kvm-c01-node01 qm[3064]: start VM 104: UPID:kvm-c01-node01:00000BF8:05346077:5943E506:qmstart:104:root@pam:
Jun 16 16:02:47 kvm-c01-node01 systemd[1]: Starting 104.scope.
Jun 16 16:02:47 kvm-c01-node01 systemd[1]: Started 104.scope.
Jun 16 16:02:48 kvm-c01-node01 kernel: [873233.958601] device tap104i0 entered promiscuous mode
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap104i0
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00002|db_ctl_base|ERR|no port named tap104i0
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln104i0
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00002|db_ctl_base|ERR|no port named fwln104i0
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl add-port vmbr0 tap104i0 tag=10
Jun 16 16:02:48 kvm-c01-node01 qm[3039]: <root@pam> end task UPID:kvm-c01-node01:00000BF8:05346077:5943E506:qmstart:104:root@pam: OK
Jun 16 16:02:46 kvm-c01-node01 qm[3064]: start VM 104: UPID:kvm-c01-node01:00000BF8:05346077:5943E506:qmstart:104:root@pam:
Jun 16 16:02:47 kvm-c01-node01 systemd[1]: Starting 104.scope.
Jun 16 16:02:47 kvm-c01-node01 systemd[1]: Started 104.scope.
Jun 16 16:02:48 kvm-c01-node01 kernel: [873233.958601] device tap104i0 entered promiscuous mode
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap104i0
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00002|db_ctl_base|ERR|no port named tap104i0
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln104i0
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00002|db_ctl_base|ERR|no port named fwln104i0
Jun 16 16:02:48 kvm-c01-node01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl add-port vmbr0 tap104i0 tag=10
Jun 16 16:02:48 kvm-c01-node01 qm[3039]: <root@pam> end task UPID:kvm-c01-node01:00000BF8:05346077:5943E506:qmstart:104:root@pam: OK
I also saved logs from another VM with the identical problems. I can provide it on request. The error is not exactly reproducable. Sometimes live migration just works. I already checked if it is storage related, but it happens if the storage is on ceph or shared from the backend cluster.
proxmox-ve: 4.4-88 (running kernel: 4.4.62-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.62-1-pve: 4.4.62-88
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-50
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-95
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-100
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
openvswitch-switch: 2.6.0-2
ceph: 10.2.7-1~bpo80+1
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.62-1-pve: 4.4.62-88
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-50
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-95
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-100
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
openvswitch-switch: 2.6.0-2
ceph: 10.2.7-1~bpo80+1
Unfortunately i couldn't find anything around the web, so if anybody has an idea please let me know
Thanks for reading and have good day!