PVE 6.0: Problems with live migration on local storage

e.helix

Member
Jan 15, 2019
9
0
6
35
Hello!
My cluster consists of 5 nodes. Local storage is used on each LVM-thin host. 10g data transfer rate over a dedicated network for migration purposes only. Proxmox is used in production.
All hosts have the same configuration:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.0-9 (running version: 6.0-9/508dcee0)
pve-kernel-helper: 6.1-1
pve-kernel-5.0: 6.0-9
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-7
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-3
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1

After updating Proxmox from 5.2 to 6.0, problems with online migration began:
2020-01-09 15:35:41 start remote tunnel
2020-01-09 15:35:41 ssh tunnel ver 1
2020-01-09 15:35:41 starting storage migration
2020-01-09 15:35:41 scsi0: start migration to nbd:192.168.124.110:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 133169152 bytes remaining: 21341667328 bytes total: 21474836480 bytes progression: 0.62 % busy: 1 ready: 0
drive-scsi0: transferred: 492830720 bytes remaining: 20982005760 bytes total: 21474836480 bytes progression: 2.29 % busy: 1 ready: 0
...
drive-scsi0: transferred: 20392706048 bytes remaining: 1083047936 bytes total: 21475753984 bytes progression: 94.96 % busy: 1 ready: 0
drive-scsi0: transferred: 20715667456 bytes remaining: 760086528 bytes total: 21475753984 bytes progression: 96.46 % busy: 1 ready: 0
drive-scsi0: transferred: 21053308928 bytes remaining: 422510592 bytes total: 21475819520 bytes progression: 98.03 % busy: 1 ready: 0
drive-scsi0: transferred: 21377318912 bytes remaining: 98500608 bytes total: 21475819520 bytes progression: 99.54 % busy: 1 ready: 0
drive-scsi0: transferred: 21475819520 bytes remaining: 0 bytes total: 21475819520 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2020-01-09 15:36:57 starting online/live migration on tcp:192.168.124.110:60000
2020-01-09 15:36:57 migrate_set_speed: 8589934592
2020-01-09 15:36:57 migrate_set_downtime: 0.1
2020-01-09 15:36:57 set migration_caps
2020-01-09 15:36:57 set cachesize: 1073741824
2020-01-09 15:36:57 start migrate command to tcp:192.168.124.110:60000
2020-01-09 15:36:58 migration status: active (transferred 398608080, remaining 4763852800), total 6460088320)
2020-01-09 15:36:58 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:36:59 migration status: active (transferred 894569980, remaining 3994193920), total 6460088320)
2020-01-09 15:36:59 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:00 migration status: active (transferred 1428725298, remaining 3411193856), total 6460088320)
2020-01-09 15:37:00 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:01 migration status: active (transferred 1858148658, remaining 1856421888), total 6460088320)
2020-01-09 15:37:01 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:02 migration status: active (transferred 2307665402, remaining 1127608320), total 6460088320)
2020-01-09 15:37:02 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
2020-01-09 15:37:03 migration status: active (transferred 2782963491, remaining 540790784), total 6460088320)
2020-01-09 15:37:03 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
query migrate failed: VM 1320000 qmp command 'query-migrate' failed - client closed connection

2020-01-09 15:37:04 query migrate failed: VM 1320000 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 1320000 not running

2020-01-09 15:37:06 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running

2020-01-09 15:37:08 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running

2020-01-09 15:37:10 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running

2020-01-09 15:37:12 query migrate failed: VM 1320000 not running
query migrate failed: VM 1320000 not running

2020-01-09 15:37:14 query migrate failed: VM 1320000 not running
2020-01-09 15:37:14 ERROR: online migrate failure - too many query migrate failures - aborting
2020-01-09 15:37:14 aborting phase 2 - cleanup resources
2020-01-09 15:37:14 migrate_cancel
2020-01-09 15:37:14 migrate_cancel error: VM 1320000 not running
drive-scsi0: Cancelling block job
2020-01-09 15:37:14 ERROR: VM 1320000 not running
2020-01-09 15:37:22 ERROR: migration finished with problems (duration 00:01:44)

TASK ERROR: migration problems
As you can see from the log, the data blocks transfer runs without problems, but during ram transfer, the connection terminates. Moreover, after a transmission break, vm abruptly turnes off, and the only solution is to start it manually.
We tried changing the configuration of "migration_caps" (disabling / enabling in different variations of xbzrle, auto-negotiation, etc.), changed the configuration of the faulty vm, traced some parameters (based on this https://github.com/proxmox/qemu/blob/master/trace-events), upgraded the cluster to version 6.0.9, upgraded/reinstalled qemu-agent, reinstalled the system and proxmox throughout the cluster, but nothing helped. The problem appears completely chaotically and regardless which guest os is used - Linux or Windows.
I really need help, since the problem is serious and complicates the work of services.
 
Is the firewall enabled on the nodes?
Please try it with the latest version.
 
It helps narrow down the issue (perhaps there is a fix already in a newer version of one of the packages). If there's no fix, it makes it easier for us to reproduce it.
 
Hello! After updating to version 6.1 proxmox problems with migration until gone. There is a second month and no fall was observed.
 
I think I'm seeing the same issue here.
I'm upgrading my cluster from 5.x(latest) to 6.x(latest) and for that migrating all vm's to an already upgraded server.

The failure occurs about when the memory transfer is finished. The VM then stops. And remains on the source server. I have to manually restart it.

Running the same migration command again works. And it's not all VM's that have this, some just migrate fine. I'm in the dark as to why ...

On the target it tries to cleanup the disk volumes, but fails on that too. So I manually remove these.

Any clue what's happening?

I have one more server to get empty in order to upgrade it to 6.x


Source (cli):
Code:
2020-04-13 17:16:15 migration status: active (transferred 15438478930, remaining 56147968), total 16794853376)
2020-04-13 17:16:15 migration xbzrle cachesize: 2147483648 transferred 59977372 pages 35970 cachemiss 253905 overflow 706
2020-04-13 17:16:16 migration status: active (transferred 15450512497, remaining 29057024), total 16794853376)
2020-04-13 17:16:16 migration xbzrle cachesize: 2147483648 transferred 67073774 pages 39333 cachemiss 254937 overflow 876
2020-04-13 17:16:16 migration status: active (transferred 15462509367, remaining 31367168), total 16794853376)
2020-04-13 17:16:16 migration xbzrle cachesize: 2147483648 transferred 74193389 pages 45632 cachemiss 255992 overflow 1010
2020-04-13 17:16:16 migration status: active (transferred 15474388214, remaining 8556544), total 16794853376)
2020-04-13 17:16:16 migration xbzrle cachesize: 2147483648 transferred 80553705 pages 54346 cachemiss 257078 overflow 1269
2020-04-13 17:16:16 migration status: active (transferred 15483154620, remaining 28991488), total 16794853376)
2020-04-13 17:16:16 migration xbzrle cachesize: 2147483648 transferred 87054418 pages 73900 cachemiss 257553 overflow 1345
2020-04-13 17:16:16 migrate_set_downtime: 0.2
query migrate failed: VM 195 qmp command 'query-migrate' failed - client closed connection

2020-04-13 17:16:18 query migrate failed: VM 195 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 195 not running

2020-04-13 17:16:19 query migrate failed: VM 195 not running
query migrate failed: VM 195 not running

2020-04-13 17:16:20 query migrate failed: VM 195 not running
query migrate failed: VM 195 not running

2020-04-13 17:16:21 query migrate failed: VM 195 not running
query migrate failed: VM 195 not running

2020-04-13 17:16:22 query migrate failed: VM 195 not running
query migrate failed: VM 195 not running

2020-04-13 17:16:24 query migrate failed: VM 195 not running
2020-04-13 17:16:24 ERROR: online migrate failure - too many query migrate failures - aborting
2020-04-13 17:16:24 aborting phase 2 - cleanup resources
2020-04-13 17:16:24 migrate_cancel
2020-04-13 17:16:24 migrate_cancel error: VM 195 not running
drive-scsi0: Cancelling block job
drive-scsi1: Cancelling block job
2020-04-13 17:16:24 ERROR: VM 195 not running
2020-04-13 17:16:37 ERROR: migration finished with problems (duration 00:32:23)
migration problems



Source (syslog):
Code:
Apr 13 17:16:17 sourcenode qmeventd[12655]: Starting cleanup for 195
Apr 13 17:16:17 sourcenode qmeventd[12655]: trying to acquire lock...
Apr 13 17:16:18 sourcenode kernel: [26936572.965579] vmbr402: port 4(tap195i0) entered disabled state
Apr 13 17:16:18 sourcenode kernel: [26936572.965901] vmbr402: port 4(tap195i0) entered disabled state
Apr 13 17:16:18 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 qmp command 'query-migrate' failed - client closed connection
Apr 13 17:16:19 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:20 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:21 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:22 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:27 sourcenode qmeventd[12655]: can't lock file '/var/lock/qemu-server/lock-195.conf' - got timeout
Apr 13 17:16:29 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:30 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:35 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:36 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:36 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:37 sourcenode qm[15114]: migration problems
Apr 13 17:16:37 sourcenode qm[15113]: <root@pam> end task UPID:sourcenode:00003B0A:A08AE770:5E947ABE:qmigrate:195:root@pam: migration problems


Target (syslog):
Code:
Apr 13 17:16:24 targetnode pvesm[5572]: <root@pam> starting task UPID:targetnode:000015C5:001DAAA2:5E948248:imgdel:195@thin_pool_hwraid:root@pam:
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] pmtud: Starting PMTUD for host: 5 link: 0
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] udp: detected kernel MTU: 1500
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] pmtud: PMTUD completed for host: 5 link: 0 current link mtu: 1397
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] pmtud: Starting PMTUD for host: 1 link: 0
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] udp: detected kernel MTU: 1500
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397
Apr 13 17:16:29 targetnode pvesm[5573]: lvremove 'hwraid/vm-195-disk-1' error:   Logical volume hwraid/vm-195-disk-1 in use.
Apr 13 17:16:29 targetnode pvesm[5572]: <root@pam> end task UPID:targetnode:000015C5:001DAAA2:5E948248:imgdel:195@thin_pool_hwraid:root@pam: lvremove 'hwraid/vm-195-disk-1' error:   Logical volume hwraid/vm-195-disk-1 in use.
Apr 13 17:16:29 targetnode systemd[1]: session-1356.scope: Succeeded.
Apr 13 17:16:30 targetnode corosync[3262]:   [KNET  ] pmtud: Starting PMTUD for host: 6 link: 0
Apr 13 17:16:30 targetnode corosync[3262]:   [KNET  ] udp: detected kernel MTU: 1500
Apr 13 17:16:30 targetnode corosync[3262]:   [KNET  ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397
Apr 13 17:16:30 targetnode pvesm[5627]: <root@pam> starting task UPID:targetnode:000015FC:001DACE4:5E94824E:imgdel:195@thin_pool_hwraid:root@pam:
Apr 13 17:16:35 targetnode pvesm[5628]: lvremove 'hwraid/vm-195-disk-0' error:   Logical volume hwraid/vm-195-disk-0 in use.
Apr 13 17:16:35 targetnode pvesm[5627]: <root@pam> end task UPID:targetnode:000015FC:001DACE4:5E94824E:imgdel:195@thin_pool_hwraid:root@pam: lvremove 'hwraid/vm-195-disk-0' error:   Logical volume hwraid/vm-195-disk-0 in use.
Apr 13 17:16:35 targetnode systemd[1]: session-1357.scope: Succeeded.
Apr 13 17:16:36 targetnode qm[5705]: <root@pam> starting task UPID:targetnode:0000164A:001DAF41:5E948254:qmstop:195:root@pam:
Apr 13 17:16:36 targetnode qm[5706]: stop VM 195: UPID:targetnode:0000164A:001DAF41:5E948254:qmstop:195:root@pam:
Apr 13 17:16:36 targetnode qm[5705]: <root@pam> end task UPID:targetnode:0000164A:001DAF41:5E948254:qmstop:195:root@pam: OK
Apr 13 17:16:36 targetnode systemd[1]: session-1358.scope: Succeeded.
Apr 13 17:16:36 targetnode systemd[1]: session-1203.scope: Succeeded.
Apr 13 17:16:36 targetnode kernel: [19454.327496] vmbr402: port 2(tap195i0) entered disabled state
Apr 13 17:16:36 targetnode systemd[1]: 195.scope: Succeeded.
Apr 13 17:16:37 targetnode systemd[1]: session-1359.scope: Succeeded.
Apr 13 17:16:37 targetnode pmxcfs[2819]: [status] notice: received log
 
I think I'm seeing the same issue here.
I'm upgrading my cluster from 5.x(latest) to 6.x(latest) and for that migrating all vm's to an already upgraded server.

The failure occurs about when the memory transfer is finished. The VM then stops. And remains on the source server. I have to manually restart it.

Running the same migration command again works. And it's not all VM's that have this, some just migrate fine. I'm in the dark as to why ...

On the target it tries to cleanup the disk volumes, but fails on that too. So I manually remove these.

Any clue what's happening?

I have one more server to get empty in order to upgrade it to 6.x


Source (cli):
Code:
2020-04-13 17:16:15 migration status: active (transferred 15438478930, remaining 56147968), total 16794853376)
2020-04-13 17:16:15 migration xbzrle cachesize: 2147483648 transferred 59977372 pages 35970 cachemiss 253905 overflow 706
2020-04-13 17:16:16 migration status: active (transferred 15450512497, remaining 29057024), total 16794853376)
2020-04-13 17:16:16 migration xbzrle cachesize: 2147483648 transferred 67073774 pages 39333 cachemiss 254937 overflow 876
2020-04-13 17:16:16 migration status: active (transferred 15462509367, remaining 31367168), total 16794853376)
2020-04-13 17:16:16 migration xbzrle cachesize: 2147483648 transferred 74193389 pages 45632 cachemiss 255992 overflow 1010
2020-04-13 17:16:16 migration status: active (transferred 15474388214, remaining 8556544), total 16794853376)
2020-04-13 17:16:16 migration xbzrle cachesize: 2147483648 transferred 80553705 pages 54346 cachemiss 257078 overflow 1269
2020-04-13 17:16:16 migration status: active (transferred 15483154620, remaining 28991488), total 16794853376)
2020-04-13 17:16:16 migration xbzrle cachesize: 2147483648 transferred 87054418 pages 73900 cachemiss 257553 overflow 1345
2020-04-13 17:16:16 migrate_set_downtime: 0.2
query migrate failed: VM 195 qmp command 'query-migrate' failed - client closed connection

2020-04-13 17:16:18 query migrate failed: VM 195 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 195 not running

2020-04-13 17:16:19 query migrate failed: VM 195 not running
query migrate failed: VM 195 not running

2020-04-13 17:16:20 query migrate failed: VM 195 not running
query migrate failed: VM 195 not running

2020-04-13 17:16:21 query migrate failed: VM 195 not running
query migrate failed: VM 195 not running

2020-04-13 17:16:22 query migrate failed: VM 195 not running
query migrate failed: VM 195 not running

2020-04-13 17:16:24 query migrate failed: VM 195 not running
2020-04-13 17:16:24 ERROR: online migrate failure - too many query migrate failures - aborting
2020-04-13 17:16:24 aborting phase 2 - cleanup resources
2020-04-13 17:16:24 migrate_cancel
2020-04-13 17:16:24 migrate_cancel error: VM 195 not running
drive-scsi0: Cancelling block job
drive-scsi1: Cancelling block job
2020-04-13 17:16:24 ERROR: VM 195 not running
2020-04-13 17:16:37 ERROR: migration finished with problems (duration 00:32:23)
migration problems



Source (syslog):
Code:
Apr 13 17:16:17 sourcenode qmeventd[12655]: Starting cleanup for 195
Apr 13 17:16:17 sourcenode qmeventd[12655]: trying to acquire lock...
Apr 13 17:16:18 sourcenode kernel: [26936572.965579] vmbr402: port 4(tap195i0) entered disabled state
Apr 13 17:16:18 sourcenode kernel: [26936572.965901] vmbr402: port 4(tap195i0) entered disabled state
Apr 13 17:16:18 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 qmp command 'query-migrate' failed - client closed connection
Apr 13 17:16:19 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:20 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:21 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:22 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode qm[15114]: VM 195 qmp command failed - VM 195 not running
Apr 13 17:16:24 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:27 sourcenode qmeventd[12655]: can't lock file '/var/lock/qemu-server/lock-195.conf' - got timeout
Apr 13 17:16:29 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:30 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:35 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:36 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:36 sourcenode pmxcfs[32003]: [status] notice: received log
Apr 13 17:16:37 sourcenode qm[15114]: migration problems
Apr 13 17:16:37 sourcenode qm[15113]: <root@pam> end task UPID:sourcenode:00003B0A:A08AE770:5E947ABE:qmigrate:195:root@pam: migration problems


Target (syslog):
Code:
Apr 13 17:16:24 targetnode pvesm[5572]: <root@pam> starting task UPID:targetnode:000015C5:001DAAA2:5E948248:imgdel:195@thin_pool_hwraid:root@pam:
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] pmtud: Starting PMTUD for host: 5 link: 0
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] udp: detected kernel MTU: 1500
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] pmtud: PMTUD completed for host: 5 link: 0 current link mtu: 1397
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] pmtud: Starting PMTUD for host: 1 link: 0
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] udp: detected kernel MTU: 1500
Apr 13 17:16:26 targetnode corosync[3262]:   [KNET  ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397
Apr 13 17:16:29 targetnode pvesm[5573]: lvremove 'hwraid/vm-195-disk-1' error:   Logical volume hwraid/vm-195-disk-1 in use.
Apr 13 17:16:29 targetnode pvesm[5572]: <root@pam> end task UPID:targetnode:000015C5:001DAAA2:5E948248:imgdel:195@thin_pool_hwraid:root@pam: lvremove 'hwraid/vm-195-disk-1' error:   Logical volume hwraid/vm-195-disk-1 in use.
Apr 13 17:16:29 targetnode systemd[1]: session-1356.scope: Succeeded.
Apr 13 17:16:30 targetnode corosync[3262]:   [KNET  ] pmtud: Starting PMTUD for host: 6 link: 0
Apr 13 17:16:30 targetnode corosync[3262]:   [KNET  ] udp: detected kernel MTU: 1500
Apr 13 17:16:30 targetnode corosync[3262]:   [KNET  ] pmtud: PMTUD completed for host: 6 link: 0 current link mtu: 1397
Apr 13 17:16:30 targetnode pvesm[5627]: <root@pam> starting task UPID:targetnode:000015FC:001DACE4:5E94824E:imgdel:195@thin_pool_hwraid:root@pam:
Apr 13 17:16:35 targetnode pvesm[5628]: lvremove 'hwraid/vm-195-disk-0' error:   Logical volume hwraid/vm-195-disk-0 in use.
Apr 13 17:16:35 targetnode pvesm[5627]: <root@pam> end task UPID:targetnode:000015FC:001DACE4:5E94824E:imgdel:195@thin_pool_hwraid:root@pam: lvremove 'hwraid/vm-195-disk-0' error:   Logical volume hwraid/vm-195-disk-0 in use.
Apr 13 17:16:35 targetnode systemd[1]: session-1357.scope: Succeeded.
Apr 13 17:16:36 targetnode qm[5705]: <root@pam> starting task UPID:targetnode:0000164A:001DAF41:5E948254:qmstop:195:root@pam:
Apr 13 17:16:36 targetnode qm[5706]: stop VM 195: UPID:targetnode:0000164A:001DAF41:5E948254:qmstop:195:root@pam:
Apr 13 17:16:36 targetnode qm[5705]: <root@pam> end task UPID:targetnode:0000164A:001DAF41:5E948254:qmstop:195:root@pam: OK
Apr 13 17:16:36 targetnode systemd[1]: session-1358.scope: Succeeded.
Apr 13 17:16:36 targetnode systemd[1]: session-1203.scope: Succeeded.
Apr 13 17:16:36 targetnode kernel: [19454.327496] vmbr402: port 2(tap195i0) entered disabled state
Apr 13 17:16:36 targetnode systemd[1]: 195.scope: Succeeded.
Apr 13 17:16:37 targetnode systemd[1]: session-1359.scope: Succeeded.
Apr 13 17:16:37 targetnode pmxcfs[2819]: [status] notice: received log

Hi, Helmo!
I was helped then by a complete upgrade of the cluster to the latest minor version. I also searched the forums for similar errors, but did not find them. It seems to me that the problem was precisely in qemu and its inconsistency with proxmox configs. I think I was just lucky then. Therefore, I am now a little afraid to upgrade the cluster to minor versions 6.1
Here is my version of the cluster with which this problem disappeared:
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 1.2.8-1+pve4
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-2
pve-cluster: 6.1-2
pve-container: 3.0-16
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-4
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
I hope this helps you in some way.
 
I'd like to upgrade the last node in my cluster, but to do that I first want to migrate all vm's to another node.

:( dependency loop...

I've not been able to identify why some migrations worked and some others failed. (while migrating between the same nodes)
 
Hi all,

I had the exact same issue today with Proxmox 7.2, previously upgraded from 6.3.X. I could not live migrate VMs between other nodes in the cluster

Errors included:

* VM 105 qmp command failed - VM 105 qmp command 'query-proxmox-support' failed - got timeout
* VM 120 qmp command failed - VM 120 qmp command 'query-proxmox-support' failed - unable to connect to VM 120 qmp socket - timeout>


Turned out to be the inbuilt Firewall on Proxmox, this was set to enabled on all hosts (previously this worked, I think the upgrade to 7.2 turned this back on) and Proxmox could not communicate to the nodes via another network adapter to ssh/scp/communicate it seems.

For those stuck, try disabling or tuning the inbuilt Proxmox firewall, solved it for me.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!