Live migrations failing seemingly randomly over many months and versions. Device or resource busy.

cliffpercent · Aug 3, 2021

The usual failure looks like:

Code:

...
2021-08-03 15:24:47 migration active, transferred 32.6 GiB of 32.0 GiB VM-state, 120.2 MiB/s
2021-08-03 15:24:48 migration active, transferred 32.7 GiB of 32.0 GiB VM-state, 117.0 MiB/s
2021-08-03 15:24:49 migration active, transferred 32.8 GiB of 32.0 GiB VM-state, 305.7 MiB/s
2021-08-03 15:24:49 xbzrle: send updates to 9806 pages in 11.7 MiB encoded memory, cache-miss 96.32%, overflow 942
2021-08-03 15:24:50 migration active, transferred 33.0 GiB of 32.0 GiB VM-state, 167.8 MiB/s
2021-08-03 15:24:50 xbzrle: send updates to 38677 pages in 54.0 MiB encoded memory, cache-miss 96.32%, overflow 3865
2021-08-03 15:24:51 migration active, transferred 33.1 GiB of 32.0 GiB VM-state, 132.1 MiB/s
2021-08-03 15:24:51 xbzrle: send updates to 44964 pages in 57.5 MiB encoded memory, cache-miss 96.32%, overflow 4044
2021-08-03 15:24:52 migration active, transferred 33.2 GiB of 32.0 GiB VM-state, 216.6 MiB/s
2021-08-03 15:24:52 xbzrle: send updates to 65806 pages in 57.6 MiB encoded memory, cache-miss 96.32%, overflow 4050
2021-08-03 15:24:53 average migration speed: 112.2 MiB/s - downtime 263 ms
2021-08-03 15:24:53 migration status: completed
2021-08-03 15:24:54 ERROR: unable to open file '/etc/pve/nodes/pve1/qemu-server/109.conf.tmp.3939' - Device or resource busy
2021-08-03 15:24:54 ERROR: migration finished with problems (duration 00:05:00)
TASK ERROR: migration problems

Here's an another failure today (although this is an easy qm unlock 150 on pve1, and everything continues running):

Code:

2021-08-03 19:09:55 migration active, transferred 3.6 GiB of 4.0 GiB VM-state, 148.3 MiB/s
2021-08-03 19:09:56 migration active, transferred 3.7 GiB of 4.0 GiB VM-state, 133.5 MiB/s
2021-08-03 19:09:58 average migration speed: 120.5 MiB/s - downtime 202 ms
2021-08-03 19:09:58 migration status: completed
2021-08-03 19:10:02 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve1' root@2001:redacted qm unlock 150
2021-08-03 19:10:02 ERROR: failed to clear migrate lock: Configuration file 'nodes/pve1/qemu-server/150.conf' does not exist
2021-08-03 19:10:02 ERROR: migration finished with problems (duration 00:00:43)
TASK ERROR: migration problems

Every now and then, about one in 20 migrations goes bad. It's been like this for months. Currently on latest pve7, happened on pve6.* as well.

When a migration fails, the VM is frozen. It is locked in a 'postmigrate' state. Unlocking it, in order to not reset the VM, it has to be hibernated and resumed (all other attempts cause the state to be lost, or keep the VM frozen). Attempting migration again, it always fails, and cleans up the last migration:

Code:

2021-08-03 18:29:36 starting migration of VM 109 to node 'pve2' (2001:redacted2)
2021-08-03 18:29:36 starting VM 109 on remote node 'pve2'
2021-08-03 18:29:38 [pve2] VM 109 already running
2021-08-03 18:29:38 ERROR: online migrate failure - remote command failed with exit code 255
2021-08-03 18:29:38 aborting phase 2 - cleanup resources
2021-08-03 18:29:38 migrate_cancel
2021-08-03 18:29:40 ERROR: migration finished with problems (duration 00:00:04)
TASK ERROR: migration problems

The second migration works fine, but it's more likely than usual to fail again. They tend to fail 3-4 times in a row. Keep in mind, that each cycle, especially hibernation, takes many minutes of (down)time. One migration may end up taking an hour or two. This has postponed upgrades on many occasions.

It seems that quorum is lost in the middle of the migration. Even if it is regained, the migration fails. Sometimes migrations fail mid-way, immediately on quorum loss (aborting migration, not a problem). Often times it's the result of a quorum outage a minute or two ago. Other times, there is no quorum outage to be seen. I'd make a guess that the temporary file doesn't exist, is read-only, or timeouts in opening it. Other posts indicate, that the nodes might be out of storage.

The three nodes each run on one SSD, 8/256G* (* 245G) used, and mount a common NFS mount:

Code:

root@pve2:~# df -h
Filesystem                                                  Size  Used Avail Use% Mounted on
udev                                                         95G     0   95G   0% /dev
tmpfs                                                        19G  1.2M   19G   1% /run
/dev/sda1                                                   229G  7.3G  210G   4% /
tmpfs                                                        95G   60M   95G   1% /dev/shm
tmpfs                                                       5.0M     0  5.0M   0% /run/lock
/dev/fuse                                                   128M  112K  128M   1% /etc/pve
[2001:redacted]:/mnt/proxmox/pve-cluster  1.3T  371G  942G  29% /mnt/pve/pve-cluster
tmpfs                                                        19G     0   19G   0% /run/user/0

dcsapak · Aug 4, 2021

cliffpercent said:
It seems that quorum is lost in the middle of the migration. Even if it is regained, the migration fails. Sometimes migrations fail mid-way, immediately on quorum loss (aborting migration, not a problem). Often times it's the result of a quorum outage a minute or two ago. Other times, there is no quorum outage to be seen

do you by any chance migrate over the same network as the cluster communication? this could lead to increased load during the migration and triggering an no-quorum situation.
maybe a second ring on a seperate nic + switch could help retain quorum during migration? or alternatively limiting the migration speed to reduce load during migration (datacenter->options->bandwidth limits)

Dale.Sykora · Jul 13, 2023

I have experienced the same issue today. However, this is the first time I have seen it, and I live migrate 12 VMs every 2-3 weeks during updates to our 4 hosts. We do use the same network for migration and quorum.

bbgeek17 · Jul 13, 2023

A portion of live migration issues could likely be addresses by following advice in this article:
https://kb.blockbridge.com/technote/proxmox-concurrent-vm-migration/

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

timproxmox · Sep 5, 2024

My workaround was to move the storage to another lun, migrate worked, then moved back to the original lun - worked. Strange hey.

Search

Search

Live migrations failing seemingly randomly over many months and versions. Device or resource busy.

cliffpercent

Member

dcsapak

Proxmox Staff Member

Dale.Sykora

Member

bbgeek17

Distinguished Member

timproxmox

Member

We value your privacy