Live migration failed

Klekele

Member
Jul 6, 2022
17
0
6
Hi,

I have been doing live migration (local storage to local storage) from 6.3 node to 7.2 node, there have not been any issues with 5 of the VMs, but with the last one, I had issues, and the output of the error is:

Code:
2022-07-05 23:47:53 migration status: active (transferred 29642295186, remaining 33935360), total 23086309376)
2022-07-05 23:47:53 migration xbzrle cachesize: 4294967296 transferred 763131042 pages 844328 cachemiss 1716347 overflow 29330
query migrate failed: VM 500 qmp command 'query-migrate' failed - client closed connection

2022-07-05 23:47:57 query migrate failed: VM 500 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 500 not running

2022-07-05 23:47:58 query migrate failed: VM 500 not running
query migrate failed: VM 500 not running

2022-07-05 23:47:59 query migrate failed: VM 500 not running
query migrate failed: VM 500 not running

2022-07-05 23:48:00 query migrate failed: VM 500 not running
query migrate failed: VM 500 not running

2022-07-05 23:48:01 query migrate failed: VM 500 not running
query migrate failed: VM 500 not running

2022-07-05 23:48:02 query migrate failed: VM 500 not running
2022-07-05 23:48:02 ERROR: online migrate failure - too many query migrate failures - aborting
2022-07-05 23:48:02 aborting phase 2 - cleanup resources
2022-07-05 23:48:02 migrate_cancel
2022-07-05 23:48:02 migrate_cancel error: VM 500 not running
drive-scsi0: Cancelling block job
2022-07-05 23:48:02 ERROR: VM 500 not running
2022-07-05 23:48:08 ERROR: migration finished with problems (duration 02:05:56)
TASK ERROR: migration problems

As you can see from the log, it started to migrate cache from the migration, I have checked logs, but there is nothing special in it (maybe I'm not checking the right logs)
 
Last edited:
there should be output on the source node in syslog/the journal - either because the VM crashed (in which case either the kvm process or the kernel should log a reason/message) or because the kernel killed it (in which case the kernel should have logged an OOM message or something similar).
 
on the source node i have this in syslog:

Code:
Jul  5 23:47:23 px corosync[1357]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul  5 23:47:54 px QEMU[33652]: kvm: block/io.c:1891: bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
Jul  5 23:47:57 px kernel: [48474956.376368] fwbr500i0: port 2(tap500i0) entered disabled state
Jul  5 23:47:57 px kernel: [48474956.403897] fwbr500i0: port 2(tap500i0) entered disabled state
Jul  5 23:47:57 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 qmp command 'query-migrate' failed - client closed connection
Jul  5 23:47:57 px pvedaemon[29255]: query migrate failed: VM 500 qmp command 'query-migrate' failed - client closed connection#012
Jul  5 23:47:57 px systemd[1]: 500.scope: Succeeded.
Jul  5 23:47:58 px qmeventd[970]: Starting cleanup for 500
Jul  5 23:47:58 px qmeventd[970]: trying to acquire lock...
Jul  5 23:47:58 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 not running
Jul  5 23:47:58 px pvedaemon[29255]: query migrate failed: VM 500 not running#012
Jul  5 23:47:59 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 not running
Jul  5 23:47:59 px pvedaemon[29255]: query migrate failed: VM 500 not running#012
Jul  5 23:48:00 px systemd[1]: Starting Proxmox VE replication runner...
Jul  5 23:48:00 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 not running
Jul  5 23:48:00 px pvedaemon[29255]: query migrate failed: VM 500 not running#012
Jul  5 23:48:01 px systemd[1]: pvesr.service: Succeeded.
Jul  5 23:48:01 px systemd[1]: Started Proxmox VE replication runner.
Jul  5 23:48:01 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 not running
Jul  5 23:48:01 px pvedaemon[29255]: query migrate failed: VM 500 not running#012
Jul  5 23:48:02 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 not running
Jul  5 23:48:02 px pvedaemon[29255]: query migrate failed: VM 500 not running#012
Jul  5 23:48:02 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 not running
Jul  5 23:48:02 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 not running
Jul  5 23:48:02 px pvedaemon[29255]: VM 500 qmp command failed - VM 500 not running
Jul  5 23:48:04 px pmxcfs[1225]: [status] notice: received log
Jul  5 23:48:04 px pmxcfs[1225]: [status] notice: received log
Jul  5 23:48:05 px pmxcfs[1225]: [status] notice: received log
Jul  5 23:48:07 px pmxcfs[1225]: [status] notice: received log
Jul  5 23:48:08 px qmeventd[970]: can't lock file '/var/lock/qemu-server/lock-500.conf' - got timeout
Jul  5 23:48:08 px pvedaemon[29255]: migration problems
Jul  5 23:48:08 px pvedaemon[24370]: <root@pam> end task UPID:px:00007247:120E4A210:62C49414:qmigrate:500:root@pam: migration problems

and on the VM there is nothing after Jul 5 23:45:59
 
Hi,
there's an assertion failure in QEMU which crashed the instance
Code:
Jul  5 23:47:54 px QEMU[33652]: kvm: block/io.c:1891: bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
Could you share the VM config qm config 500 and relevant parts of the storage config /etc/pve/storage.cfg? What is the currently installed version of QEMU, and when was the last time the VM was started (guest reboot doesn't count, EDIT: migration from another node does)? If you know the exact version that the VM was running with, it would be even better.
 
Last edited:
qm config 500 output:
Code:
bootdisk: scsi0
cores: 3
description: Reseller server IP
ide2: none,media=cdrom
memory: 22000
name: HOSTNAME_OF_THE_SERVER
net0: virtio=FE:99:10:F8:50:59,bridge=vmbr0,firewall=1,rate=100
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-500-disk-0,size=750G
scsihw: virtio-scsi-pci
smbios1: uuid=ef57a412-f86f-42f4-9620-afa7886dc971
sockets: 2
vmgenid: c83d27dd-08b9-4e61-a5aa-3309ef82858d

cat /etc/pve/storage.cfg

Code:
dir: local
    path /var/lib/vz
    content vztmpl,images,iso,snippets,backup
    shared 0

lvmthin: local-lvm
    thinpool data
    vgname pve
    content rootdir,images

nfs: nas-px
    export /home/px/px
    path /mnt/pve/nas-px
    server IP_OF_NAS
    content backup,images
    prune-backups keep-last=24

lvmthin: nvme
    thinpool nvme
    vgname nvme
    content rootdir,images
    nodes px4

pveversion --verbose output

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 14.2.16-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Uptime of the server was:

Code:
5 Month(s) 18 Day(s)

Regarding the VM version im not sure if you mean by VMs OS version etc, if you meant this this is the output:


Code:
Linux HOSTNAME_OF_SERVER 3.10.0-962.3.2.lve1.5.38.el7.x86_64 #1 SMP Thu Jun 18 05:28:41 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

CloudLinux release 7.9 (Boris Yegorov)
NAME="CloudLinux"
VERSION="7.9 (Boris Yegorov)"
ID="cloudlinux"
ID_LIKE="rhel fedora centos"
VERSION_ID="7.9"
PRETTY_NAME="CloudLinux 7.9 (Boris Yegorov)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:cloudlinux:cloudlinux:7.9:GA:server"
HOME_URL="https://www.cloudlinux.com/"
BUG_REPORT_URL="https://www.cloudlinux.com/support"

CloudLinux release 7.9 (Boris Yegorov)
CloudLinux release 7.9 (Boris Yegorov)
cpe:/o:cloudlinux:cloudlinux:7.9:ga:server
 
qm config 500 output:
Code:
bootdisk: scsi0
cores: 3
description: Reseller server IP
ide2: none,media=cdrom
memory: 22000
name: HOSTNAME_OF_THE_SERVER
net0: virtio=FE:99:10:F8:50:59,bridge=vmbr0,firewall=1,rate=100
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-500-disk-0,size=750G
scsihw: virtio-scsi-pci
smbios1: uuid=ef57a412-f86f-42f4-9620-afa7886dc971
sockets: 2
vmgenid: c83d27dd-08b9-4e61-a5aa-3309ef82858d

cat /etc/pve/storage.cfg

Code:
dir: local
    path /var/lib/vz
    content vztmpl,images,iso,snippets,backup
    shared 0

lvmthin: local-lvm
    thinpool data
    vgname pve
    content rootdir,images

nfs: nas-px
    export /home/px/px
    path /mnt/pve/nas-px
    server IP_OF_NAS
    content backup,images
    prune-backups keep-last=24

lvmthin: nvme
    thinpool nvme
    vgname nvme
    content rootdir,images
    nodes px4

pveversion --verbose output

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 14.2.16-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Uptime of the server was:

Code:
5 Month(s) 18 Day(s)

Regarding the VM version im not sure if you mean by VMs OS version etc, if you meant this this is the output:
Sorry, I meant the version of pve-qemu-kvm with which the VM was started before the crash happened. I know that's not something people will just remember, but by checking when you last upgraded that package, in the files
Code:
/var/log/apt/history.log
/var/log/apt/history.log.1.gz
/var/log/apt/history.log.2.gz
...
and guessing how long the VM was up before, you might be able to tell. Could be helpful, but it's not super important.
 
Version of pve-qemu-kvm: pve-qemu-kvm:amd64 (4.0.0-5, 5.1.0-7)
and for uptime i dont know that info.

Strange is that only this VM crashed. As I already wrote in the first post, I've migrated a few of the VMs from this proxmox to the 7.2 destination proxmox.
 
Did you retry the migration in the meantime?

I only found https://gitlab.com/qemu-project/qemu/-/commit/a13de40a05478e64726dd9861135d344837f3c30 (included in QEMU >= 6.2) which does talk about the same assertion failure. Usually locks in PVE should prevent migration during backup, but I'll still ask if there's any chance that there was a backup of the VM running close to migration?

Even if it's a different issue, QEMU 5.1 is rather old and there's a good chance the issue was already fixed since then. But if such an issue ever happens again on newer versions, please don't hesitate to report it.
 
I have tried to remigrate it, but I got the same error, and I am 100% sure there was no other backup job.
 
Ok, then it's a different issue. If you'd like to help identify it, please do the following:
  1. Install debugger/debug symbols: apt install gdb pve-qemu-kvm-dbg
  2. Start the VM and get its PID (e.g. using qm list).
  3. Run gdb -ex 'handle SIGUSR1 nostop noprint' --ex 'handle SIGPIPE nostop noprint' --ex 'set pagination off' --ex 'c' -p <PID> with the PID you obtained before.
  4. Start the migration and wait for the crash.
  5. After the crash, enter t a a bt into gdb and share the output here. This will show the backtraces which should give good hints on what is triggering the faulty write after migration already marked the drives as inactive.
If you just want to migrate the VM, you can try migrating it offline.
 
I have a similar scenario as this thread, migrating vm's from a 6.x to a 7.x install, upgrading the servers one by one to 7.x... but could also replicate it between two 7.x servers.


After many vm's that migrated just fine I have one that keeps failing. It's a fairly busy monitoring server. I get the feeling that this server is just too busy. The final few % of the disk image also took longer then expected, bouncing between 99% and back to 98%.


In the end it helped to stop the busiest daemon on the vm being migrated during the final fase.


Syslog:

Code:
QEMU[10593]: kvm: ../block/io.c:1817: bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
pvedaemon[11550]: VM 144 qmp command failed - VM 144 qmp command 'query-migrate' failed - client closed connection
pvedaemon[11550]: query migrate failed: VM 144 qmp command 'query-migrate' failed - client closed connection#012
kernel: [36325274.324914] vmbr4: port 2(tap144i0) entered disabled state
kernel: [36325274.325122] vmbr4: port 2(tap144i0) entered disabled state
systemd[1]: 144.scope: Succeeded.
qmeventd[787]: Starting cleanup for 144
qmeventd[787]: trying to acquire lock...
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running


Job log:
Code:
2022-09-22 23:38:06 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 70.3 MiB/s
2022-09-22 23:38:06 xbzrle: send updates to 6816 pages in 10.3 MiB encoded memory, cache-miss 97.69%, overflow 156
2022-09-22 23:38:07 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 84.0 MiB/s
2022-09-22 23:38:07 xbzrle: send updates to 25045 pages in 13.9 MiB encoded memory, cache-miss 97.69%, overflow 260
2022-09-22 23:38:08 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 112.9 MiB/s
2022-09-22 23:38:08 xbzrle: send updates to 36351 pages in 19.3 MiB encoded memory, cache-miss 52.63%, overflow 306
2022-09-22 23:38:09 migration active, transferred 2.7 GiB of 4.0 GiB VM-state, 166.8 MiB/s
2022-09-22 23:38:09 xbzrle: send updates to 52007 pages in 21.3 MiB encoded memory, cache-miss 42.32%, overflow 326
2022-09-22 23:38:09 auto-increased downtime to continue migration: 200 ms
query migrate failed: VM 144 qmp command 'query-migrate' failed - client closed connection



I've collected the back trace on both the sending and the receiving nodes.

On the receiving node it's not listed in qm list yet, so I just used ps | grep <vmname> to get the PID.

After the failure an 'Erase data' job runs to cleanup on the receiving node, however the original vm is down and not restarted. It would be nice if the vm were started again in this cleanup step.

On the receiving node the output was limited to:

Code:
[New Thread 0x7fb913fff700 (LWP 3902001)]
[Thread 0x7fb913fff700 (LWP 3902001) exited]
 
 
Thread 1 "kvm" received signal SIGTERM, Terminated.
0x00007fba7292eee6 in __ppoll (fds=0x55cf0d3c5e40, nfds=10, timeout=<optimized out>, timeout@entry=0x7ffcd79b2450, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      in ../sysdeps/unix/sysv/linux/ppoll.c


The backtrace of the sending side is in the attachment.
 

Attachments

Hi,
I have a similar scenario as this thread, migrating vm's from a 6.x to a 7.x install, upgrading the servers one by one to 7.x... but could also replicate it between two 7.x servers.


After many vm's that migrated just fine I have one that keeps failing. It's a fairly busy monitoring server. I get the feeling that this server is just too busy. The final few % of the disk image also took longer then expected, bouncing between 99% and back to 98%.


In the end it helped to stop the busiest daemon on the vm being migrated during the final fase.


Syslog:

Code:
QEMU[10593]: kvm: ../block/io.c:1817: bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
pvedaemon[11550]: VM 144 qmp command failed - VM 144 qmp command 'query-migrate' failed - client closed connection
pvedaemon[11550]: query migrate failed: VM 144 qmp command 'query-migrate' failed - client closed connection#012
kernel: [36325274.324914] vmbr4: port 2(tap144i0) entered disabled state
kernel: [36325274.325122] vmbr4: port 2(tap144i0) entered disabled state
systemd[1]: 144.scope: Succeeded.
qmeventd[787]: Starting cleanup for 144
qmeventd[787]: trying to acquire lock...
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: query migrate failed: VM 144 not running#012
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running
pvedaemon[11550]: VM 144 qmp command failed - VM 144 not running


Job log:
Code:
2022-09-22 23:38:06 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 70.3 MiB/s
2022-09-22 23:38:06 xbzrle: send updates to 6816 pages in 10.3 MiB encoded memory, cache-miss 97.69%, overflow 156
2022-09-22 23:38:07 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 84.0 MiB/s
2022-09-22 23:38:07 xbzrle: send updates to 25045 pages in 13.9 MiB encoded memory, cache-miss 97.69%, overflow 260
2022-09-22 23:38:08 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 112.9 MiB/s
2022-09-22 23:38:08 xbzrle: send updates to 36351 pages in 19.3 MiB encoded memory, cache-miss 52.63%, overflow 306
2022-09-22 23:38:09 migration active, transferred 2.7 GiB of 4.0 GiB VM-state, 166.8 MiB/s
2022-09-22 23:38:09 xbzrle: send updates to 52007 pages in 21.3 MiB encoded memory, cache-miss 42.32%, overflow 326
2022-09-22 23:38:09 auto-increased downtime to continue migration: 200 ms
query migrate failed: VM 144 qmp command 'query-migrate' failed - client closed connection



I've collected the back trace on both the sending and the receiving nodes.

On the receiving node it's not listed in qm list yet, so I just used ps | grep <vmname> to get the PID.

After the failure an 'Erase data' job runs to cleanup on the receiving node, however the original vm is down and not restarted. It would be nice if the vm were started again in this cleanup step.

On the receiving node the output was limited to:

Code:
[New Thread 0x7fb913fff700 (LWP 3902001)]
[Thread 0x7fb913fff700 (LWP 3902001) exited]
 
 
Thread 1 "kvm" received signal SIGTERM, Terminated.
0x00007fba7292eee6 in __ppoll (fds=0x55cf0d3c5e40, nfds=10, timeout=<optimized out>, timeout@entry=0x7ffcd79b2450, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      in ../sysdeps/unix/sysv/linux/ppoll.c


The backtrace of the sending side is in the attachment.
thanks for the details! Please also post the output of pveversion -v and qm config 144 and tell us what type of storage the VM is using. Hopefully the backtrace will give some clues as to what's going on.
 
qm config 144
Code:
balloon: 1532
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 4096
name: ...
net0: virtio=AA:B1:E5:96:F4:BD,bridge=vmbr4
numa: 0
onboot: 1
ostype: l26
scsi0: thin_pool_hwraid:vm-144-disk-0,discard=on,format=raw,size=16192M
scsi1: thin_pool_hwraid:vm-144-disk-1,discard=on,format=raw,size=81G
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=04604ff7-3f98-4ffd-af3d-e44f43f41350
sockets: 1

pveversion -v
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-15 (running version: 6.4-15/af7986e6)
pve-kernel-5.4: 6.4-20
pve-kernel-helper: 6.4-20
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.195-1-pve: 5.4.195-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-5
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.14-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1
 
qm config 144
Code:
balloon: 1532
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 4096
name: ...
net0: virtio=AA:B1:E5:96:F4:BD,bridge=vmbr4
numa: 0
onboot: 1
ostype: l26
scsi0: thin_pool_hwraid:vm-144-disk-0,discard=on,format=raw,size=16192M
scsi1: thin_pool_hwraid:vm-144-disk-1,discard=on,format=raw,size=81G
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=04604ff7-3f98-4ffd-af3d-e44f43f41350
sockets: 1
What kind of storage is thin_pool_hwraid?

pveversion -v
Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-15 (running version: 6.4-15/af7986e6)
pve-kernel-5.4: 6.4-20
pve-kernel-helper: 6.4-20
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.195-1-pve: 5.4.195-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-5
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.14-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1
What about the 7.x host?

Could you also share the migration task logs (you can cut out most of the progress messages in the middle if you want)?
 
What kind of storage is thin_pool_hwraid?

LMV thin pool


Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-10
pve-kernel-5.4: 6.4-7
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1
 

Attachments

Thank you again for all the information! I was able to reproduce the issue now, although I had to patch my QEMU to use a much longer delay in a certain place in the drive mirror. The issue seems to only happen with lots of IO and bad luck when live migrating with local disks.

AFAICS there is no enforcement that the drive mirror has finished before the migration code inactivates the block drives, which would be the obvious way to avoid this issue. I'll test around a bit more and see if I can come up with a solution (maybe together with QEMU upstream developers if it's an issue there as well).
 
  • Like
Reactions: Helmo
AFAICS there is no enforcement that the drive mirror has finished before the migration code inactivates the block drives, which would be the obvious way to avoid this issue.
Well, there is if one uses the correct parameters for drive-mirror ;) I sent a patch for discussion.
 
  • Like
Reactions: decaen
Hi,
I just had the same problem.

Do you plan to integrate this patch in PVE 7.2 ?
the issue with the patch at hand is that the network would become a bottleneck for IO of the VM during all of the migration, not just the final phase, which is not very nice. There is an alternative way, of pausing the VM right before finishing the block job (so all IO will be mirrored in time), but that requires a bit of rework of our migration handling. I worked out a proof-of-concept, but I'm not sure the fix will make it for the next point release (7.3 is expected in a few weeks). More likely it will happen as an update for 7.3 later.
 
Unfortunately I'm seeing this same issue still in an proxmox 8.0 -> 8.2 live migration. (Same VM as last time).
My work around is still to just stop the IO intensive daemon inside that vm during the last fase of the migration.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!