[SOLVED] live migration not working after upgrade 5.1->5.3

Knuuut

Member
Jun 7, 2018
91
9
8
60
Hello Community,
after upgrading one node from 5.1.to 5.3 in a 5 node cluster, I can't do live migration of VMs. Now I'm stuck and in a bad situation, because I want to do a node by node upgrade of the cluster and I can't get the other nodes free...
Here is the migration log of migration from 5.1 to 5.3 node

Code:
2019-03-13 14:03:50 use dedicated network address for sending migration traffic (x.x.x.x)
2019-03-13 14:03:51 starting migration of VM 256 to node '1901' (x.x.x.x)
2019-03-13 14:03:51 copying disk images
2019-03-13 14:03:51 starting VM 256 on remote node '1901'
2019-03-13 14:03:53 start remote tunnel
2019-03-13 14:03:53 ssh tunnel ver 1
2019-03-13 14:03:53 starting online/live migration on tcp:x.x.x.x:60000
2019-03-13 14:03:53 migrate_set_speed: 8589934592
2019-03-13 14:03:53 migrate_set_downtime: 0.1
2019-03-13 14:03:53 set migration_caps
2019-03-13 14:03:53 set cachesize: 1717986918
2019-03-13 14:03:53 start migrate command to tcp:x.x.x.x:60000
2019-03-13 14:03:54 migration status: active (transferred 750986917, remaining 14287990784), total 17197506560)
2019-03-13 14:03:54 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-13 14:03:55 migration status: active (transferred 1294769337, remaining 11245088768), total 17197506560)
2019-03-13 14:03:55 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-13 14:03:56 migration status: active (transferred 2097557584, remaining 7959379968), total 17197506560)
2019-03-13 14:03:56 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-13 14:03:57 migration status: active (transferred 2858109376, remaining 6582280192), total 17197506560)
2019-03-13 14:03:57 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2019-03-13 14:03:58 migration speed: 3276.80 MB/s - downtime 14 ms
2019-03-13 14:03:58 migration status: completed
2019-03-13 14:03:58 ERROR: tunnel replied 'ERR: resume failed - VM 256 not running' to command 'resume 256'
2019-03-13 14:04:11 ERROR: migration finished with problems (duration 00:00:21)
TASK ERROR: migration problems

The machine is migrated, but in off state.

Here is the migration log from 5.3 to 5.1 node:

Code:
2019-03-13 14:28:59 starting migration of VM 256 to node '1801' (x.x.x.x)
2019-03-13 14:28:59 copying disk images
2019-03-13 14:28:59 starting VM 256 on remote node '1801'
2019-03-13 14:29:00 start failed: command '/usr/bin/kvm -id 256 -chardev 'socket,id=qmp,path=/var/run/qemu-server/256.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/256.pid -daemonize -smbios 'type=1,uuid=5c3cb443-3136-4d7c-852b-fb4a9aefde16' -name pmm-1801 -smp '8,sockets=2,cores=4,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/256.vnc,x509,password -cpu host,+kvm_pv_unhalt,+kvm_pv_eoi -m 16384 -object 'memory-backend-ram,id=ram-node0,size=8192M' -numa 'node,nodeid=0,cpus=0-3,memdev=ram-node0' -object 'memory-backend-ram,id=ram-node1,size=8192M' -numa 'node,nodeid=1,cpus=4-7,memdev=ram-node1' -k de -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:e0bda18eac6' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:Prod660/vm-256-disk-1:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Prod660.keyring,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=native,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -drive 'file=rbd:Prod660/vm-256-disk-2:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/Prod660.keyring,if=none,id=drive-scsi1,discard=on,format=raw,cache=none,aio=native,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1' -netdev 'type=tap,id=net0,ifname=tap256i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=DE:AD:0A:1F:1E:0D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc-i440fx-2.12' -incoming unix:/run/qemu-server/256.migrate -S' failed: exit code 1
2019-03-13 14:29:00 ERROR: online migrate failure - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=1801' root@x.x.x.x qm start 256 --skiplock --migratedfrom pmve-1901 --migration_type secure --stateuri unix --machine pc-i440fx-2.12' failed: exit code 255
2019-03-13 14:29:00 aborting phase 2 - cleanup resources
2019-03-13 14:29:00 migrate_cancel
2019-03-13 14:29:01 ERROR: migration finished with problems (duration 00:00:02)
migration problems

The machine is not migrated, still running.

Any help and suggestions are welcome.

Cheer Knuuut
 
can you post your vm config?

of corse...

Code:
bootdisk: scsi0
cores: 4
cpu: host
hotplug: disk,network,usb
ide2: none,media=cdrom
memory: 16384
name: pmm-1801
net0: virtio=DE:AD:0A:1F:1E:0D,bridge=vmbr0
numa: 1
ostype: l26
scsi0: bkp-1901:256/vm-256-disk-0.raw,discard=on,size=40G
scsi1: bkp-1901:256/vm-256-disk-1.raw,discard=on,size=400G
scsihw: virtio-scsi-pci
smbios1: uuid=5c3cb443-3136-4d7c-852b-fb4a9aefde16
sockets: 2
 
but different mainboard manufactors (Intel/Supermicro)...
if they contain different microcode updates, this may very well be the problem, can you try with a test vm with another cpu type, e.g. kvm64 (ideally is it otherwise identical to your real vm)
 
kvm64 works... :eek:

All (>200) VMs have got the cpu type set to host.

Does this mean, the only option I've got is to alter the CPU type on every machine and do a stop/start cycle?
 
Does this mean, the only option I've got is to alter the CPU type on every machine and do a stop/start cycle?
you can try to fully upgrade the bios/firmware of all mainboards, but yes to migrate them away initally, there is not really another way i am afraid
but an alternative to using the gui is using the cli tools (e.g. pvesh /qm) to change the cpu type and star stopping them