Hello!
My cluster consists of 5 nodes. Local storage is used on each LVM-thin host. 10g data transfer rate over a dedicated network for migration purposes only. Proxmox is used in production.
All hosts have the same configuration:
After updating Proxmox from 5.2 to 6.0, problems with online migration began:
As you can see from the log, the data blocks transfer runs without problems, but during ram transfer, the connection terminates. Moreover, after a transmission break, vm abruptly turnes off, and the only solution is to start it manually.
We tried changing the configuration of "migration_caps" (disabling / enabling in different variations of xbzrle, auto-negotiation, etc.), changed the configuration of the faulty vm, traced some parameters (based on this https://github.com/proxmox/qemu/blob/master/trace-events), upgraded the cluster to version 6.0.9, upgraded/reinstalled qemu-agent, reinstalled the system and proxmox throughout the cluster, but nothing helped. The problem appears completely chaotically and regardless which guest os is used - Linux or Windows.
I really need help, since the problem is serious and complicates the work of services.
My cluster consists of 5 nodes. Local storage is used on each LVM-thin host. 10g data transfer rate over a dedicated network for migration purposes only. Proxmox is used in production.
All hosts have the same configuration:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-helper: 6.1-1
pve-kernel-5.0: 6.0-9
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-3
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-helper: 6.1-1
pve-kernel-5.0: 6.0-9
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-3
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
After updating Proxmox from 5.2 to 6.0, problems with online migration began:
2020-01-09 15:35:41 start remote tunnel
2020-01-09 15:35:41 ssh tunnel ver 1
2020-01-09 15:35:41 starting storage migration
2020-01-09 15:35:41 scsi0: start migration to nbd:192.168.124.110:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 133169152 bytes remaining: 21341667328 bytes total: 21474836480 bytes progression: 0.62 % busy: 1 ready: 0
drive-scsi0: transferred: 492830720 bytes remaining: 20982005760 bytes total: 21474836480 bytes progression: 2.29 % busy: 1 ready: 0
...
drive-scsi0: transferred: 20392706048 bytes remaining: 1083047936 bytes total: 21475753984 bytes progression: 94.96 % busy: 1 ready: 0
drive-scsi0: transferred: 20715667456 bytes remaining: 760086528 bytes total: 21475753984 bytes progression: 96.46 % busy: 1 ready: 0
drive-scsi0: transferred: 21053308928 bytes remaining: 422510592 bytes total: 21475819520 bytes progression: 98.03 % busy: 1 ready: 0
drive-scsi0: transferred: 21377318912 bytes remaining: 98500608 bytes total: 21475819520 bytes progression: 99.54 % busy: 1 ready: 0
drive-scsi0: transferred: 21475819520 bytes remaining: 0 bytes total: 21475819520 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2020-01-09 15:36:57 starting online/live migration on tcp:192.168.124.110:60000
2020-01-09 15:36:57 migrate_set_speed: 8589934592
2020-01-09 15:36:57 migrate_set_downtime: 0.1
2020-01-09 15:36:57 set migration_caps
2020-01-09 15:36:57 set cachesize: 1073741824
2020-01-09 15:36:57 start migrate command to tcp:192.168.124.110:60000
2020-01-09 15:36:58 migration status: active (transferred 398608080, remaining 4763852800), total 6460088320)
2020-01-09 15:36:58 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:36:59 migration status: active (transferred 894569980, remaining 3994193920), total 6460088320)
2020-01-09 15:36:59 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:00 migration status: active (transferred 1428725298, remaining 3411193856), total 6460088320)
2020-01-09 15:37:00 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:01 migration status: active (transferred 1858148658, remaining 1856421888), total 6460088320)
2020-01-09 15:37:01 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:02 migration status: active (transferred 2307665402, remaining 1127608320), total 6460088320)
2020-01-09 15:37:02 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:03 migration status: active (transferred 2782963491, remaining 540790784), total 6460088320)
2020-01-09 15:37:03 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
query migrate failed: VM 1320000 qmp command 'query-migrate' failed - client closed connection
2020-01-09 15:37:04 query migrate failed: VM 1320000 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 1320000 not running
2020-01-09 15:37:06 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running
2020-01-09 15:37:08 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running
2020-01-09 15:37:10 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running
2020-01-09 15:37:12 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running
2020-01-09 15:37:14 query migrate failed: VM 1320000 not running
2020-01-09 15:37:14 ERROR: online migrate failure - too many query migrate failures - aborting
2020-01-09 15:37:14 aborting phase 2 - cleanup resources
2020-01-09 15:37:14 migrate_cancel
2020-01-09 15:37:14 migrate_cancel error: VM 1320000 not running
drive-scsi0: Cancelling block job
2020-01-09 15:37:14 ERROR: VM 1320000 not running
2020-01-09 15:37:22 ERROR: migration finished with problems (duration 00:01:44)
TASK ERROR: migration problems
2020-01-09 15:35:41 ssh tunnel ver 1
2020-01-09 15:35:41 starting storage migration
2020-01-09 15:35:41 scsi0: start migration to nbd:192.168.124.110:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 133169152 bytes remaining: 21341667328 bytes total: 21474836480 bytes progression: 0.62 % busy: 1 ready: 0
drive-scsi0: transferred: 492830720 bytes remaining: 20982005760 bytes total: 21474836480 bytes progression: 2.29 % busy: 1 ready: 0
...
drive-scsi0: transferred: 20392706048 bytes remaining: 1083047936 bytes total: 21475753984 bytes progression: 94.96 % busy: 1 ready: 0
drive-scsi0: transferred: 20715667456 bytes remaining: 760086528 bytes total: 21475753984 bytes progression: 96.46 % busy: 1 ready: 0
drive-scsi0: transferred: 21053308928 bytes remaining: 422510592 bytes total: 21475819520 bytes progression: 98.03 % busy: 1 ready: 0
drive-scsi0: transferred: 21377318912 bytes remaining: 98500608 bytes total: 21475819520 bytes progression: 99.54 % busy: 1 ready: 0
drive-scsi0: transferred: 21475819520 bytes remaining: 0 bytes total: 21475819520 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2020-01-09 15:36:57 starting online/live migration on tcp:192.168.124.110:60000
2020-01-09 15:36:57 migrate_set_speed: 8589934592
2020-01-09 15:36:57 migrate_set_downtime: 0.1
2020-01-09 15:36:57 set migration_caps
2020-01-09 15:36:57 set cachesize: 1073741824
2020-01-09 15:36:57 start migrate command to tcp:192.168.124.110:60000
2020-01-09 15:36:58 migration status: active (transferred 398608080, remaining 4763852800), total 6460088320)
2020-01-09 15:36:58 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:36:59 migration status: active (transferred 894569980, remaining 3994193920), total 6460088320)
2020-01-09 15:36:59 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:00 migration status: active (transferred 1428725298, remaining 3411193856), total 6460088320)
2020-01-09 15:37:00 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:01 migration status: active (transferred 1858148658, remaining 1856421888), total 6460088320)
2020-01-09 15:37:01 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:02 migration status: active (transferred 2307665402, remaining 1127608320), total 6460088320)
2020-01-09 15:37:02 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:03 migration status: active (transferred 2782963491, remaining 540790784), total 6460088320)
2020-01-09 15:37:03 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
query migrate failed: VM 1320000 qmp command 'query-migrate' failed - client closed connection
2020-01-09 15:37:04 query migrate failed: VM 1320000 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 1320000 not running
2020-01-09 15:37:06 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running
2020-01-09 15:37:08 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running
2020-01-09 15:37:10 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running
2020-01-09 15:37:12 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running
2020-01-09 15:37:14 query migrate failed: VM 1320000 not running
2020-01-09 15:37:14 ERROR: online migrate failure - too many query migrate failures - aborting
2020-01-09 15:37:14 aborting phase 2 - cleanup resources
2020-01-09 15:37:14 migrate_cancel
2020-01-09 15:37:14 migrate_cancel error: VM 1320000 not running
drive-scsi0: Cancelling block job
2020-01-09 15:37:14 ERROR: VM 1320000 not running
2020-01-09 15:37:22 ERROR: migration finished with problems (duration 00:01:44)
TASK ERROR: migration problems
We tried changing the configuration of "migration_caps" (disabling / enabling in different variations of xbzrle, auto-negotiation, etc.), changed the configuration of the faulty vm, traced some parameters (based on this https://github.com/proxmox/qemu/blob/master/trace-events), upgraded the cluster to version 6.0.9, upgraded/reinstalled qemu-agent, reinstalled the system and proxmox throughout the cluster, but nothing helped. The problem appears completely chaotically and regardless which guest os is used - Linux or Windows.
I really need help, since the problem is serious and complicates the work of services.