The usual failure looks like:
Here's an another failure today (although this is an easy qm unlock 150 on pve1, and everything continues running):
Every now and then, about one in 20 migrations goes bad. It's been like this for months. Currently on latest pve7, happened on pve6.* as well.
When a migration fails, the VM is frozen. It is locked in a 'postmigrate' state. Unlocking it, in order to not reset the VM, it has to be hibernated and resumed (all other attempts cause the state to be lost, or keep the VM frozen). Attempting migration again, it always fails, and cleans up the last migration:
The second migration works fine, but it's more likely than usual to fail again. They tend to fail 3-4 times in a row. Keep in mind, that each cycle, especially hibernation, takes many minutes of (down)time. One migration may end up taking an hour or two. This has postponed upgrades on many occasions.
It seems that quorum is lost in the middle of the migration. Even if it is regained, the migration fails. Sometimes migrations fail mid-way, immediately on quorum loss (aborting migration, not a problem). Often times it's the result of a quorum outage a minute or two ago. Other times, there is no quorum outage to be seen. I'd make a guess that the temporary file doesn't exist, is read-only, or timeouts in opening it. Other posts indicate, that the nodes might be out of storage.
The three nodes each run on one SSD, 8/256G* (* 245G) used, and mount a common NFS mount:
Code:
...
2021-08-03 15:24:47 migration active, transferred 32.6 GiB of 32.0 GiB VM-state, 120.2 MiB/s
2021-08-03 15:24:48 migration active, transferred 32.7 GiB of 32.0 GiB VM-state, 117.0 MiB/s
2021-08-03 15:24:49 migration active, transferred 32.8 GiB of 32.0 GiB VM-state, 305.7 MiB/s
2021-08-03 15:24:49 xbzrle: send updates to 9806 pages in 11.7 MiB encoded memory, cache-miss 96.32%, overflow 942
2021-08-03 15:24:50 migration active, transferred 33.0 GiB of 32.0 GiB VM-state, 167.8 MiB/s
2021-08-03 15:24:50 xbzrle: send updates to 38677 pages in 54.0 MiB encoded memory, cache-miss 96.32%, overflow 3865
2021-08-03 15:24:51 migration active, transferred 33.1 GiB of 32.0 GiB VM-state, 132.1 MiB/s
2021-08-03 15:24:51 xbzrle: send updates to 44964 pages in 57.5 MiB encoded memory, cache-miss 96.32%, overflow 4044
2021-08-03 15:24:52 migration active, transferred 33.2 GiB of 32.0 GiB VM-state, 216.6 MiB/s
2021-08-03 15:24:52 xbzrle: send updates to 65806 pages in 57.6 MiB encoded memory, cache-miss 96.32%, overflow 4050
2021-08-03 15:24:53 average migration speed: 112.2 MiB/s - downtime 263 ms
2021-08-03 15:24:53 migration status: completed
2021-08-03 15:24:54 ERROR: unable to open file '/etc/pve/nodes/pve1/qemu-server/109.conf.tmp.3939' - Device or resource busy
2021-08-03 15:24:54 ERROR: migration finished with problems (duration 00:05:00)
TASK ERROR: migration problems
Here's an another failure today (although this is an easy qm unlock 150 on pve1, and everything continues running):
Code:
2021-08-03 19:09:55 migration active, transferred 3.6 GiB of 4.0 GiB VM-state, 148.3 MiB/s
2021-08-03 19:09:56 migration active, transferred 3.7 GiB of 4.0 GiB VM-state, 133.5 MiB/s
2021-08-03 19:09:58 average migration speed: 120.5 MiB/s - downtime 202 ms
2021-08-03 19:09:58 migration status: completed
2021-08-03 19:10:02 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve1' root@2001:redacted qm unlock 150
2021-08-03 19:10:02 ERROR: failed to clear migrate lock: Configuration file 'nodes/pve1/qemu-server/150.conf' does not exist
2021-08-03 19:10:02 ERROR: migration finished with problems (duration 00:00:43)
TASK ERROR: migration problems
Every now and then, about one in 20 migrations goes bad. It's been like this for months. Currently on latest pve7, happened on pve6.* as well.
When a migration fails, the VM is frozen. It is locked in a 'postmigrate' state. Unlocking it, in order to not reset the VM, it has to be hibernated and resumed (all other attempts cause the state to be lost, or keep the VM frozen). Attempting migration again, it always fails, and cleans up the last migration:
Code:
2021-08-03 18:29:36 starting migration of VM 109 to node 'pve2' (2001:redacted2)
2021-08-03 18:29:36 starting VM 109 on remote node 'pve2'
2021-08-03 18:29:38 [pve2] VM 109 already running
2021-08-03 18:29:38 ERROR: online migrate failure - remote command failed with exit code 255
2021-08-03 18:29:38 aborting phase 2 - cleanup resources
2021-08-03 18:29:38 migrate_cancel
2021-08-03 18:29:40 ERROR: migration finished with problems (duration 00:00:04)
TASK ERROR: migration problems
The second migration works fine, but it's more likely than usual to fail again. They tend to fail 3-4 times in a row. Keep in mind, that each cycle, especially hibernation, takes many minutes of (down)time. One migration may end up taking an hour or two. This has postponed upgrades on many occasions.
It seems that quorum is lost in the middle of the migration. Even if it is regained, the migration fails. Sometimes migrations fail mid-way, immediately on quorum loss (aborting migration, not a problem). Often times it's the result of a quorum outage a minute or two ago. Other times, there is no quorum outage to be seen. I'd make a guess that the temporary file doesn't exist, is read-only, or timeouts in opening it. Other posts indicate, that the nodes might be out of storage.
The three nodes each run on one SSD, 8/256G* (* 245G) used, and mount a common NFS mount:
Code:
root@pve2:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 95G 0 95G 0% /dev
tmpfs 19G 1.2M 19G 1% /run
/dev/sda1 229G 7.3G 210G 4% /
tmpfs 95G 60M 95G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/fuse 128M 112K 128M 1% /etc/pve
[2001:redacted]:/mnt/proxmox/pve-cluster 1.3T 371G 942G 29% /mnt/pve/pve-cluster
tmpfs 19G 0 19G 0% /run/user/0
Last edited: