PVE 4 KVM live migration problem

alitvak69 · Oct 10, 2015

Testing live migration on 4 node quorate cluster. It is not 100% of cases but it is reproducible. I migrate vm from one node to another and I get this

Code:

task started by HA resource agent
Oct 09 22:04:22 starting migration of VM 100 to node 'virt2n2-la' (38.102.250.229)
Oct 09 22:04:22 copying disk images
Oct 09 22:04:22 starting VM 100 on remote node 'virt2n2-la'
Oct 09 22:04:24 starting ssh migration tunnel
Oct 09 22:04:24 starting online/live migration on localhost:60000
Oct 09 22:04:24 migrate_set_speed: 8589934592
Oct 09 22:04:24 migrate_set_downtime: 0.1
Oct 09 22:04:26 migration status: active (transferred 240337644, remaining 61005824), total 2156601344)
Oct 09 22:04:26 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
Oct 09 22:04:28 migration speed: 512.00 MB/s - downtime 11 ms
Oct 09 22:04:28 migration status: completed
Oct 09 22:04:29 ERROR: unable to find configuration file for VM 100 - no such machine
Oct 09 22:04:29 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 100 --skiplock' failed: exit code 2
Oct 09 22:04:31 ERROR: migration finished with problems (duration 00:00:09)
TASK ERROR: migration problems

VM is on the target node but in the paused state

When migration is successful I have following sequence

Code:

virt2n2-la HA 100 Migrate
Executing HA migrate for VM 100 to node virt2n1-la
TASK OK

virt2n1-la  VM 100 Start
Running as unit 100.scope.
libust[43261/43261]: Error: Error cancelling global ust listener thread: No such process (in lttng_ust_exit() at lttng-ust-comm.c:1592)
libust[43261/43261]: Error: Error cancelling local ust listener thread: No such process (in lttng_ust_exit() at lttng-ust-comm.c:1601)
migration listens on tcp:[localhost]:60000
TASK OK

virt2n2-la VM 100 Migrate

task started by HA resource agent
Oct 09 22:03:06 starting migration of VM 100 to node 'virt2n1-la' (38.102.250.228)
Oct 09 22:03:06 copying disk images
Oct 09 22:03:06 starting VM 100 on remote node 'virt2n1-la'
Oct 09 22:03:07 starting ssh migration tunnel
Oct 09 22:03:08 starting online/live migration on localhost:60000
Oct 09 22:03:08 migrate_set_speed: 8589934592
Oct 09 22:03:08 migrate_set_downtime: 0.1
Oct 09 22:03:10 migration status: active (transferred 239552889, remaining 60329984), total 2156601344)
Oct 09 22:03:10 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
Oct 09 22:03:12 migration speed: 512.00 MB/s - downtime 19 ms
Oct 09 22:03:12 migration status: completed
Oct 09 22:03:15 migration finished successfully (duration 00:00:09)
TASK OK


virt2n1-la VM 100 Resume
TASK OK

When Migration fails to complete I am getting

Code:

virt2n1-la HA 100 Migrate
Executing HA migrate for VM 100 to node virt2n2-la
TASK OK


virt2n1-la VM 100 Migrate
task started by HA resource agent
Oct 09 22:04:22 starting migration of VM 100 to node 'virt2n2-la' (38.102.250.229)
Oct 09 22:04:22 copying disk images
Oct 09 22:04:22 starting VM 100 on remote node 'virt2n2-la'
Oct 09 22:04:24 starting ssh migration tunnel
Oct 09 22:04:24 starting online/live migration on localhost:60000
Oct 09 22:04:24 migrate_set_speed: 8589934592
Oct 09 22:04:24 migrate_set_downtime: 0.1
Oct 09 22:04:26 migration status: active (transferred 240337644, remaining 61005824), total 2156601344)
Oct 09 22:04:26 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
Oct 09 22:04:28 migration speed: 512.00 MB/s - downtime 11 ms
Oct 09 22:04:28 migration status: completed
Oct 09 22:04:29 ERROR: unable to find configuration file for VM 100 - no such machine
Oct 09 22:04:29 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 100 --skiplock' failed: exit code 2
Oct 09 22:04:31 ERROR: migration finished with problems (duration 00:00:09)
TASK ERROR: migration problems

virt2n2-la VM 100 Start
Running as unit 100.scope.
libust[41997/41997]: Error: Error cancelling global ust listener thread: No such process (in lttng_ust_exit() at lttng-ust-comm.c:1592)
libust[41997/41997]: Error: Error cancelling local ust listener thread: No such process (in lttng_ust_exit() at lttng-ust-comm.c:1601)
migration listens on tcp:[localhost]:60000
TASK OK

No resume as you see
I can run resume manually and VM starts to work but that is not an acceptable solution.

Code:

VM Config
bootdisk: virtio0
cores: 2
cpu: host
hotplug: disk,network
ide2: none,media=cdrom
localtime: 1
memory: 2048
name: monitor1n1-la.siptalk.com
net0: virtio=D2:2B:82:35:92:EC,bridge=vmbr0,firewall=1,queues=8
net1: virtio=B2:72:6B:66:44:E4,bridge=vmbr1,firewall=1
numa: 0
ostype: l26
protection: 1
scsihw: virtio-scsi-pci
smbios1: uuid=f38594bd-b5b0-4253-8a95-f192ec4cfdf7
sockets: 2
virtio0: ceph:vm-100-disk-1,discard=on,backup=no,iothread=on,size=32G

Cluster Config

Code:

logging {
  debug: on
  to_syslog: yes
}


nodelist {
  node {
    nodeid: 4
    quorum_votes: 1
    ring0_addr: virt2n4-la
    ring1_addr: virt2n4-la-int
  }


  node {
    nodeid: 2
    quorum_votes: 1
    ring0_addr: virt2n2-la
    ring1_addr: virt2n2-la-int
  }


  node {
    nodeid: 3
    quorum_votes: 1
    ring0_addr: virt2n3-la
    ring1_addr: virt2n3-la-int
  }


  node {
    nodeid: 1
    quorum_votes: 1
    ring0_addr: virt2n1-la
    ring1_addr: virt2n1-la-int
  }


}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: virt2-jcluster
  config_version: 4
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
  interface {
    mcastaddr: 239.192.170.222
    mcastport: 5405
    ringnumber: 0
    ttl: 1
  }


  interface {
    mcastaddr: 239.192.170.223
    mcastport: 5405
    ringnumber: 1
    ttl: 1
  }


}

pvecm status

Code:

Quorum information
------------------
Date:             Fri Oct  9 22:28:04 2015
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000002
Ring ID:          368
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3  
Flags:            Quorate 


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 38.102.250.228
0x00000002          1 38.102.250.229 (local)
0x00000003          1 38.102.250.231
0x00000004          1 38.102.250.232

Syslog virtn2n1-la (source node of this migration)

Code:

Oct  9 22:04:11 virt2n1-la pvedaemon[42548]: <root@pam> starting task UPID:virt2n1-la:0000AAB9:000DB8F1:56189C4B:hamigrate:100:root@pam:
Oct  9 22:04:22 virt2n1-la pve-ha-lrm[43783]: <root@pam> starting task UPID:virt2n1-la:0000AB08:000DBD8B:56189C56:qmigrate:100:root@pam:
Oct  9 22:04:23 virt2n1-la pmxcfs[1569]: [status] notice: received log
Oct  9 22:04:23 virt2n1-la pve-ha-lrm[43783]: Task still active, waiting
Oct  9 22:04:24 virt2n1-la pmxcfs[1569]: [status] notice: received log
Oct  9 22:04:24 virt2n1-la pve-ha-lrm[43783]: Task still active, waiting
Oct  9 22:04:25 virt2n1-la pve-ha-lrm[43783]: Task still active, waiting
Oct  9 22:04:26 virt2n1-la pve-ha-lrm[43783]: Task still active, waiting
Oct  9 22:04:27 virt2n1-la pve-ha-lrm[43783]: Task still active, waiting
Oct  9 22:04:28 virt2n1-la pve-ha-lrm[43783]: Task still active, waiting
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.495736] fwbr100i0: port 2(tap100i0) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.497498] fwbr100i0: port 1(fwln100i0) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.497707] vmbr0: port 3(fwpr100p0) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.497862] device fwln100i0 left promiscuous mode
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.497498] fwbr100i0: port 1(fwln100i0) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.497707] vmbr0: port 3(fwpr100p0) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.497862] device fwln100i0 left promiscuous mode
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.497863] fwbr100i0: port 1(fwln100i0) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.498158] device fwpr100p0 left promiscuous mode
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.498160] vmbr0: port 3(fwpr100p0) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.688578] fwbr100i1: port 2(tap100i1) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.690246] fwbr100i1: port 1(fwln100i1) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.690449] vmbr1: port 2(fwpr100p1) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.690611] device fwln100i1 left promiscuous mode
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.690612] fwbr100i1: port 1(fwln100i1) entered disabled state
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.690965] device fwpr100p1 left promiscuous mode
Oct  9 22:04:29 virt2n1-la kernel: [ 9018.690966] vmbr1: port 2(fwpr100p1) entered disabled state
Oct  9 22:04:29 virt2n1-la pve-ha-lrm[43783]: Task still active, waiting
Oct  9 22:04:30 virt2n1-la pve-ha-lrm[43783]: Task still active, waiting
Oct  9 22:04:31 virt2n1-la ntpd[7311]: Deleting interface #111 fwpr100p1, fe80::20ff:3fff:fe3f:8696#123, interface stats: received=0, se
nt=0, dropped=0, active_time=80 secs
Oct  9 22:04:31 virt2n1-la ntpd[7311]: Deleting interface #110 fwln100i0, fe80::f8a4:70ff:fe6b:a1a3#123, interface stats: received=0, se
nt=0, dropped=0, active_time=80 secs
Oct  9 22:04:31 virt2n1-la ntpd[7311]: Deleting interface #109 tap100i1, fe80::f430:43ff:fe85:eda5#123, interface stats: received=0, sen
t=0, dropped=0, active_time=80 secs
Oct  9 22:04:31 virt2n1-la ntpd[7311]: Deleting interface #108 fwbr100i0, fe80::e8f2:d8ff:fe73:5140#123, interface stats: received=0, se
nt=0, dropped=0, active_time=80 secs
Oct  9 22:04:31 virt2n1-la ntpd[7311]: Deleting interface #107 tap100i0, fe80::8467:feff:fe10:c470#123, interface stats: received=0, sen
t=0, dropped=0, active_time=80 secs
Oct  9 22:04:31 virt2n1-la ntpd[7311]: Deleting interface #106 fwbr100i1, fe80::f0bb:17ff:fe40:f487#123, interface stats: received=0, se
nt=0, dropped=0, active_time=80 secs
Oct  9 22:04:31 virt2n1-la ntpd[7311]: Deleting interface #105 fwpr100p0, fe80::b043:31ff:fe57:5778#123, interface stats: received=0, se
nt=0, dropped=0, active_time=80 secs
Oct  9 22:04:31 virt2n1-la ntpd[7311]: Deleting interface #104 fwln100i1, fe80::e89c:2bff:fe2d:ebd7#123, interface stats: received=0, se
nt=0, dropped=0, active_time=80 secs
Oct  9 22:04:31 virt2n1-la ntpd[7311]: peers refreshed
Oct  9 22:04:31 virt2n1-la pve-ha-lrm[43784]: migration problems
Oct  9 22:04:31 virt2n1-la pve-ha-lrm[43783]: <root@pam> end task UPID:virt2n1-la:0000AB08:000DBD8B:56189C56:qmigrate:100:root@pam: migr
ation problems

Syslog virtn2-la (target node of this migration)

Code:

Oct  9 22:04:23 virt2n2-la qm[41978]: <root@pam> starting task UPID:virt2n2-la:0000A3FE:000CA705:56189C57:qmstart:100:root@pam:
Oct  9 22:04:23 virt2n2-la qm[41982]: start VM 100: UPID:virt2n2-la:0000A3FE:000CA705:56189C57:qmstart:100:root@pam:
Oct  9 22:04:23 virt2n2-la kernel: [ 8298.986081] device tap100i0 entered promiscuous mode
Oct  9 22:04:23 virt2n2-la kernel: [ 8298.999154] IPv6: ADDRCONF(NETDEV_UP): fwln100i0: link is not ready
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.001568] IPv6: ADDRCONF(NETDEV_CHANGE): fwln100i0: link becomes ready
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.003493] device fwln100i0 entered promiscuous mode
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.003693] fwbr100i0: port 1(fwln100i0) entered forwarding state
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.003698] fwbr100i0: port 1(fwln100i0) entered forwarding state
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.005755] device fwpr100p0 entered promiscuous mode
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.005845] vmbr0: port 2(fwpr100p0) entered forwarding state
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.005852] vmbr0: port 2(fwpr100p0) entered forwarding state
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.005852] vmbr0: port 2(fwpr100p0) entered forwarding state
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.007633] fwbr100i0: port 2(tap100i0) entered forwarding state
Oct  9 22:04:23 virt2n2-la kernel: [ 8299.007651] fwbr100i0: port 2(tap100i0) entered forwarding state
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.327133] device tap100i1 entered promiscuous mode
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.339725] IPv6: ADDRCONF(NETDEV_UP): fwln100i1: link is not ready
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.342323] IPv6: ADDRCONF(NETDEV_CHANGE): fwln100i1: link becomes ready
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.344602] device fwln100i1 entered promiscuous mode
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.344817] fwbr100i1: port 1(fwln100i1) entered forwarding state
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.344835] fwbr100i1: port 1(fwln100i1) entered forwarding state
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.346551] device fwpr100p1 entered promiscuous mode
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.346782] vmbr1: port 2(fwpr100p1) entered forwarding state
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.346798] vmbr1: port 2(fwpr100p1) entered forwarding state
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.348441] fwbr100i1: port 2(tap100i1) entered forwarding state
Oct  9 22:04:24 virt2n2-la kernel: [ 8299.348448] fwbr100i1: port 2(tap100i1) entered forwarding state
Oct  9 22:04:24 virt2n2-la qm[41978]: <root@pam> end task UPID:virt2n2-la:0000A3FE:000CA705:56189C57:qmstart:100:root@pam: OK
Oct  9 22:04:28 virt2n2-la ntpd[4117]: Listen normally on 108 tap100i0 fe80::4032:98ff:fe87:eeac UDP 123
Oct  9 22:04:28 virt2n2-la ntpd[4117]: Listen normally on 109 fwln100i1 fe80::4c3a:93ff:fe81:443b UDP 123
Oct  9 22:04:28 virt2n2-la ntpd[4117]: Listen normally on 110 tap100i1 fe80::7051:11ff:fed2:566e UDP 123
Oct  9 22:04:28 virt2n2-la ntpd[4117]: Listen normally on 111 fwln100i0 fe80::7c20:dcff:fe33:9bc9 UDP 123
Oct  9 22:04:28 virt2n2-la ntpd[4117]: Listen normally on 112 fwpr100p1 fe80::9861:84ff:fef2:fa66 UDP 123
Oct  9 22:04:28 virt2n2-la ntpd[4117]: Listen normally on 113 fwbr100i1 fe80::1c87:bfff:fee4:4276 UDP 123
Oct  9 22:04:28 virt2n2-la ntpd[4117]: Listen normally on 114 fwbr100i0 fe80::94f8:f6ff:fe65:89e6 UDP 123
Oct  9 22:04:28 virt2n2-la ntpd[4117]: Listen normally on 115 fwpr100p0 fe80::5411:a6ff:fee1:a128 UDP 123
Oct  9 22:04:28 virt2n2-la ntpd[4117]: peers refreshed

Also when I do pvecm status, I get this in syslog (don't know if it relevant)

Code:

Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] IPC credentials authenticated (2261-61669-27)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] connecting to client [61669]
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] connection created
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [CMAP  ] lib_init_fn: conn=0x562af4146f30
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] IPC credentials authenticated (2261-61669-28)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] connecting to client [61669]
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] connection created
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QUORUM] lib_init_fn: conn=0x562af4148530
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QUORUM] got quorum_type request on 0x562af4148530
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] IPC credentials authenticated (2261-61669-29)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] connecting to client [61669]
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] connection created
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] IPC credentials authenticated (2261-61669-30)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] connecting to client [61669]
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] connection created
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QUORUM] got quorate request on 0x562af4148530
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QUORUM] got trackstart request on 0x562af4148530
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QUORUM] sending initial status to 0x562af4148530
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QUORUM] sending quorum notification to 0x562af4148530, length = 64
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QUORUM] got trackstop request on 0x562af4148530
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] got getinfo request on 0x562af414bdc0 for node 2
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] getinfo response error: 1
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] got getinfo request on 0x562af414bdc0 for node 1
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] getinfo response error: 1
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] got getinfo request on 0x562af414bdc0 for node 2
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] getinfo response error: 1
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] got getinfo request on 0x562af414bdc0 for node 3
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] getinfo response error: 1
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] got getinfo request on 0x562af414bdc0 for node 4
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [VOTEQ ] getinfo response error: 1
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] HUP conn (2261-61669-27)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] qb_ipcs_disconnect(2261-61669-27) state:2
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] epoll_ctl(del): Bad file descriptor (9)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] cs_ipcs_connection_closed()
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [CMAP  ] exit_fn for conn=0x562af4146f30
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] cs_ipcs_connection_destroyed()
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-2261-61669-27-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-2261-61669-27-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-2261-61669-27-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] HUP conn (2261-61669-28)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] qb_ipcs_disconnect(2261-61669-28) state:2
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] epoll_ctl(del): Bad file descriptor (9)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] cs_ipcs_connection_closed()
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QUORUM] lib_exit_fn: conn=0x562af4148530
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] cs_ipcs_connection_destroyed()
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-quorum-response-2261-61669-28-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-quorum-event-2261-61669-28-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-quorum-request-2261-61669-28-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] HUP conn (2261-61669-29)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] qb_ipcs_disconnect(2261-61669-29) state:2
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] epoll_ctl(del): Bad file descriptor (9)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] cs_ipcs_connection_closed()
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] cs_ipcs_connection_destroyed()
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-2261-61669-29-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-event-2261-61669-29-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-2261-61669-29-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] HUP conn (2261-61669-30)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] qb_ipcs_disconnect(2261-61669-30) state:2
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] epoll_ctl(del): Bad file descriptor (9)
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] cs_ipcs_connection_closed()
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [MAIN  ] cs_ipcs_connection_destroyed()
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-votequorum-response-2261-61669-30-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-votequorum-event-2261-61669-30-header
Oct  9 22:48:44 virt2n2-la corosync[2261]:  [QB    ] Free'ing ringbuffer: /dev/shm/qb-votequorum-request-2261-61669-30-header

I tried changing hardware and options on VM. I tested migration with and without HA. It can be reproduced in both scenarios.

I don't remember these issues in 3.4 when I tested.

alitvak69 · Oct 10, 2015

I am at loss.

Can at least some one tell me what this error could mean?

Oct 10 02:44:26 ERROR: unable to find configuration file for VM 100 - no such machine
Oct 10 02:44:26 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.228 qm resume 100 --skiplock' failed: exit code 2

dietmar · Oct 10, 2015

Please can you test without using iothread option for the disk?

alitvak69 · Oct 10, 2015

Tried and it made no difference.

spirit · Oct 10, 2015

Don't seem to be a qemu bug as the migration itself is finished correctly

but

Code:

Oct 09 22:04:29 ERROR: unable to find configuration file for VM 100 - no such machine
Oct 09 22:04:29 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 100 --skiplock' failed: exit code 2
Oct 09 22:04:31 ERROR: migration finished with problems (duration 00:00:09)

maybe is it HA related ?

dietmar · Oct 10, 2015

spirit said:
Don't seem to be a qemu bug as the migration itself is finished correctly

but

Code:

Oct 09 22:04:29 ERROR: unable to find configuration file for VM 100 - no such machine Oct 09 22:04:29 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 100 --skiplock' failed: exit code 2 Oct 09 22:04:31 ERROR: migration finished with problems (duration 00:00:09)

maybe is it HA related ?

Don't think so. Also, the user wrote

I tested migration with and without HA. It can be reproduced in both scenarios.

alitvak69 · Oct 10, 2015

dietmar said:
Don't think so. Also, the user wrote

Seems to be somewhat different behavior on different cpus. 2 nodes have older cpus and there while error is the same resuming puts kvm into the error state and I have to reset kvm and then resuming works i.e. vm starts running again.

I just tried the same setup on 3.4.11 and migration went without a problem.

Is there anything I can do to troubleshoot?

I could try switching to native debian kernel if it helps.

Thank you in advance.

Sent from my SM-G900V using Tapatalk

sigxcpu · Oct 10, 2015

Code:

cpu=host

is your problem

alitvak69 · Oct 10, 2015

sigxcpu said:
Code:

cpu=host

is your problem

How is it a problem. I have no issues in 3.4, would you care to explain?

Also what cpu would you use?

Sent from my SM-G900V using Tapatalk

sigxcpu · Oct 10, 2015

You are very lucky if you can live migrate a KVM machine between different CPU type/architectures using "cpu=host" as the features never match. Using qemu64 is a common denominator, which is the best option.

alitvak69 · Oct 10, 2015

While perhaps I would agree on the different nodes but I have 2 pairs of the servers each pair has the same type of CPU, i.e. to old er and two newer servers. So the problem I have to start with happens regardless it just differs a bit when I switch between old and new machines. I cannot run qemu64 (it gives me errors and cannot start) but I tried kvm64 default and this makes no difference.

alitvak69 · Oct 10, 2015

Dear proxmox developers please help to get to the bottom of this. I have this problem on dell servers PE630 and PE610. Is there a way to debug qemu and see what causes the errors? I also noticed that after reboot migration works once and then it doesn't. It really makes no sense at this point. This is a clean install or release 4.0 and other then one vm and one ct there is nothing custom or unusual in settings as far as I know.

Oct 09 22:04:29 ERROR: unable to find configuration file for VM 100 - no such machine
Oct 09 22:04:29 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 100 --skiplock' failed: exit code 2

alitvak69 · Oct 10, 2015

On the target machine I see in case of error in syslog. Would this be a systemd issue?

Oct 10 12:30:24 virt2n2-la pmxcfs[1580]: [status] notice: received log
Oct 10 12:30:25 virt2n2-la qm[11675]: <root@pam> starting task UPID:virt2n2-la:00002DA4:000516CC:56196751:qmstart:100:root@pam:
Oct 10 12:30:25 virt2n2-la qm[11684]: start VM 100: UPID:virt2n2-la:00002DA4:000516CC:56196751:qmstart:100:root@pam:
Oct 10 12:30:25 virt2n2-la systemd[1]: [/run/systemd/system/100.scope.d/50-Description.conf:3] Missing '='.

Oct 10 12:30:25 virt2n2-la kernel: [ 3338.387623] device tap100i0 entered promiscuous mode
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.400864] IPv6: ADDRCONF(NETDEV_UP): fwln100i0: link is not ready
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.403230] IPv6: ADDRCONF(NETDEV_CHANGE): fwln100i0: link becomes ready
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.404937] device fwln100i0 entered promiscuous mode
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.405188] fwbr100i0: port 1(fwln100i0) entered forwarding state
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.405193] fwbr100i0: port 1(fwln100i0) entered forwarding state
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.407290] device fwpr100p0 entered promiscuous mode
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.407580] vmbr0: port 2(fwpr100p0) entered forwarding state
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.407585] vmbr0: port 2(fwpr100p0) entered forwarding state
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.409751] fwbr100i0: port 2(tap100i0) entered forwarding state
Oct 10 12:30:25 virt2n2-la kernel: [ 3338.409756] fwbr100i0: port 2(tap100i0) entered forwarding state
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.793683] device tap100i1 entered promiscuous mode
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.807250] IPv6: ADDRCONF(NETDEV_UP): fwln100i1: link is not ready
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.809806] IPv6: ADDRCONF(NETDEV_CHANGE): fwln100i1: link becomes ready
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.812231] device fwln100i1 entered promiscuous mode
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.812506] fwbr100i1: port 1(fwln100i1) entered forwarding state
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.812514] fwbr100i1: port 1(fwln100i1) entered forwarding state
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.814777] device fwpr100p1 entered promiscuous mode
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.815076] vmbr1: port 2(fwpr100p1) entered forwarding state
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.815084] vmbr1: port 2(fwpr100p1) entered forwarding state
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.817285] fwbr100i1: port 2(tap100i1) entered forwarding state
Oct 10 12:30:26 virt2n2-la kernel: [ 3338.817293] fwbr100i1: port 2(tap100i1) entered forwarding state
Oct 10 12:30:26 virt2n2-la qm[11675]: <root@pam> end task UPID:virt2n2-la:00002DA4:000516CC:56196751:qmstart:100:root@pam: OK

mir · Oct 10, 2015

sigxcpu said:
You are very lucky if you can live migrate a KVM machine between different CPU type/architectures using "cpu=host" as the features never match. Using qemu64 is a common denominator, which is the best option.

qemu64 is lowest common denominator;-)

alitvak69 · Oct 10, 2015

I ran few more tests and it actually comes to HA being a root of the problem.

Here is what I did.

I cloned vm 100 to vm 101. I removed vm 100 from HA group, deleted resource and then deleted 100 completely

I started vm 101. I did 5 -6 successful migrations. CPU kvm64, no HA at this point

I added vm101 to HA as resource and assigned it to a group (group actually spans all 4 nodes).

First migration from node 1 (virt2n1-la) to node 2 (virt2n2-la) was successful.
Second migration from node 2 to node 1 failed with the same errors. Same from going back to node 2 from node 1.

I removed vm 101 from group and removed HA resource. Migration while showing no sign of HA task failed again.

I stopped vm 101 migrated it off line to the node 1 just in case.
I started vm 101 and attempted migrations to all 4 nodes back in forth. No problem.

So previously reproducing it in none HA mode was only be true if I wouldn't restart VM after removing it from HA mode.

Unfortunately those are all just non-scientific observations, it would be good to be able to run a trace / debug showing what actually happens there. Can devs help?

alitvak69 · Oct 10, 2015

About to give up for now but it feels like there is a certain timer in HA mode tells qemu to not resume. I run multiple migrations in HA mode and some times from faster to slower node (hardware CPU wise) I had a successful migration. Logs that I watch (syslog) shows no difference between good and bad migration other then the messages I listed in the previous post.

Appreciate your attention to the problem

tom · Oct 11, 2015

did you stop / start all your VM after you moved away from cpu=host?

alitvak69 · Oct 11, 2015

tom said:
did you stop / start all your VM after you moved away from cpu=host?

Yes I did.

Sent from my SM-G900V using Tapatalk

alitvak69 · Oct 11, 2015

BTW,

Regardless of how I do migration, via qm, ha-manager, or web, I notice web interface for a short duration pops an error 500, machine 101 doesn't exist. Message then disappears Not sure if it has any correlation to a problem or is part of normal process.

alitvak69 · Oct 11, 2015

It looks like this message is related to vm 101 no longer present on the source node. Could it be that during the migration with ha when machine is no longer on the source your system thinks it is not there at all and throws an exception?

Error: unable to find configuration file for VM 101 - no such machine

Which in turn prevents resuming

Error: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 101 --skiplock' failed: exit code 2

In reality everything worked I can run ssh command manually just fine and vm resumes.

PVE 4 KVM live migration problem

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Distinguished Member

Proxmox Staff Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy