Failed Fail Over

ballybob · Jul 21, 2024

I have a two node cluster w/ qdevice. I have an ha group setup (mainly to set priority node) and using replication to keep data in sync on both devices.
I yanked the plug on the first device, and waited a few minutes. Everything failed over great, I was very impressed. Took a little longer than I expected for it to notice, but it did it all like a champ.

Then I plugged in node 1, and watched it fail back. It did this as well, all but one VM.

This VM failed (error below), and it would keep trying the migration over and over without success. I tried to stop it, I tried to shut it down, and do my own migration while it is shut down, but it ignored it all and just kept trying to migrate over and over. Eventually I rebooted that node, and all was well again.

Below is the final messages in the migration log for this VM.

Code:

drive-efidisk0: Completing block job...
drive-efidisk0: Completed successfully.
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-efidisk0: Cancelling block job
drive-scsi0: Cancelling block job
drive-efidisk0: Done.
WARN: drive-scsi0: Input/output error (io-status: ok)
drive-scsi0: Done.
2024-07-21 09:31:29 ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-efidisk0: Input/output error (io-status: ok)
2024-07-21 09:31:29 aborting phase 2 - cleanup resources
2024-07-21 09:31:29 migrate_cancel
2024-07-21 09:31:29 scsi0: removing block-dirty-bitmap 'repl_scsi0'
2024-07-21 09:31:29 efidisk0: removing block-dirty-bitmap 'repl_efidisk0'
2024-07-21 09:31:31 ERROR: migration finished with problems (duration 00:00:19)
TASK ERROR: migration problems

LnxBil · Jul 21, 2024

Can you provide the configuration (in CODE tags) of the failed VM and of one that worked? Maybe there is something different.

ballybob · Jul 21, 2024

LnxBil said:
Can you provide the configuration (in CODE tags) of the failed VM and of one that worked? Maybe there is something different.

How do I get the text configuration without using file system? or is that only way?
I think it has to do with EFI partition. This one has EFI (it's a home assistant OS vm with OMVF bios).

LnxBil · Jul 21, 2024

ballybob said:
How do I get the text configuration without using file system? or is that only way?
I think it has to do with EFI partition. This one has EFI (it's a home assistant OS vm with OMVF bios).

Yes, the error message was about UEFI. Is your UEFI disk not replicated?

ballybob · Jul 21, 2024

LnxBil said:
Yes, the error message was about UEFI. Is your UEFI disk not replicated?

I don't really have a choice to do that. I assumed the entire VM and everything having to do with it would been replicated. Here is what I see.

LnxBil · Jul 21, 2024

What about the VM configuration?

ballybob · Jul 21, 2024

LnxBil said:
What about the VM configuration?

How do I post that as text?

It's very simple:

I've manually migrated it a few times to test in the past.

ballybob · Jul 23, 2024

LnxBil said:
Yes, the error message was about UEFI. Is your UEFI disk not replicated?

I confirmed in logs, both primary and uefi disk is replicated.
After node failure/disconnect, it just tries over and over to migrate. It migrated from primary to secondary on failure, but can't get back to primary and is in an endless loop of trying to migrate which I can't do anything about fixing without taking the 2nd node offline (which now only has this VM).

It migrates perfectly fine if I reboot the 2nd node and it moves back on it's own.

fiona · Jul 23, 2024

Hi,
please share the output of pveversion -v from both nodes as well as the VM configuration with qm config 101.

ballybob said:

Code:

drive-efidisk0: Completing block job...
drive-efidisk0: Completed successfully.
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-efidisk0: Cancelling block job
drive-scsi0: Cancelling block job
drive-efidisk0: Done.
WARN: drive-scsi0: Input/output error (io-status: ok)
drive-scsi0: Done.
2024-07-21 09:31:29 ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-efidisk0: Input/output error (io-status: ok)

So it seems like after successful completion an error appears in the job. That sounds like it shouldn't happen. There were recent changes to improve error detection for mirror jobs, but if it's already completed, such an error is probably better ignored. But it would still be good to find out where it comes from in the first place.

fiona · Jul 23, 2024

It seems like the Completed successfully logs came from a bug: https://lists.proxmox.com/pipermail/pve-devel/2024-July/064848.html

The question now becomes why the I/O error triggers during completion of the drive mirror.

ballybob · Jul 23, 2024

fiona said:
Hi,
please share the output of pveversion -v from both nodes as well as the VM configuration with qm config 101.

So it seems like after successful completion an error appears in the job. That sounds like it shouldn't happen. There were recent changes to improve error detection for mirror jobs, but if it's already completed, such an error is probably better ignored. But it would still be good to find out where it comes from in the first place.

I can ignore it, but the end result is still a VM stuck on a node trying to migrate over and over every 5 seconds until I reboot the node.

ballybob · Jul 23, 2024

fiona said:
Hi,
please share the output of pveversion -v from both nodes as well as the VM configuration with qm config 101.

So it seems like after successful completion an error appears in the job. That sounds like it shouldn't happen. There were recent changes to improve error detection for mirror jobs, but if it's already completed, such an error is probably better ignored. But it would still be good to find out where it comes from in the first place.

Code:

Linux pve1 6.8.8-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-3 (2024-07-16T16:16Z) x86_64


The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.


Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Jul 23 02:57:51 EDT 2024 on pts/0
root@pve1:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-3-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.8-3
proxmox-kernel-6.8.8-3-pve-signed: 6.8.8-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.0-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.2
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

Code:

agent: 1
bios: ovmf
boot: order=scsi0
cores: 2
cpu: host
description: <div align='center'><a href='https%3A//Helper-Scripts.com' target='_blank' rel='noopener noreferrer'><img src='https%3A//raw.githubusercontent.com/tteck/Proxmox/main/misc/images/logo-81x112.png'/></a>%0A%0A  # Home Assistant OS%0A%0A  <a href='https%3A//ko-fi.com/D1D7EP4GF'><img src='https%3A//img.shields.io/badge/&#x2615;-Buy me a coffee-blue' /></a>%0A  </div>
efidisk0: local-zfs:vm-101-disk-0,efitype=4m,size=528K
localtime: 1
memory: 4096
meta: creation-qemu=8.1.5,ctime=1717440377
name: haos12.3
net0: virtio=02....,bridge=vmbr0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-101-disk-1,cache=writethrough,discard=on,size=32G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=1ba8829......
tablet: 0
tags: proxmox-helper-scripts
vmgenid: f5a8fbe9......

fiona · Jul 24, 2024

ballybob said:
I can ignore it, but the end result is still a VM stuck on a node trying to migrate over and over every 5 seconds until I reboot the node.

Sorry, I didn't mean for you to ignore it. I was thinking about making the error handling ignore it, because it seemed to be an error after completion of the mirror job. But in reality, the Successfully completed message was mistakenly logged, meaning the error doesn't occur after completion, but during completion, so it cannot be ignored.

Could you also share the system logs for both source and target node around the time the issue occurs?

ballybob · Jul 25, 2024

fiona said:
Sorry, I didn't mean for you to ignore it. I was thinking about making the error handling ignore it, because it seemed to be an error after completion of the mirror job. But in reality, the Successfully completed message was mistakenly logged, meaning the error doesn't occur after completion, but during completion, so it cannot be ignored.

Could you also share the system logs for both source and target node around the time the issue occurs?

I didn't want to parse through past logs to find when it happened so I just disconnected the primary and forced a failover.
As before, it failed to recover back to the primary when it was brought back online. All VMs and LXC returned to the primary.
This time however, it eventually worked after 4 attempts of migrating, the 5th attempt actually worked without intervention.

Something is obviously wrong, but this time I didn't have to reboot the 2nd node to force the first one to pick it up locally. It does appear to be related to the UEFI disk in the vm, but if I read the replication logs, it clearly says it backed up both the main disk and the uefi (disk0) successfully each time. For some reason pve2 doesn't have or doesn't think it has the UEFI disk0 for the vm. This is the only VM using UEFI.

Below are the logs on pve2. I don't see anything relevant on pve1 oddly.

Code:

Jul 25 01:15:10 pve2 pve-ha-crm[1222]: service 'vm:101': state changed from 'migrate' to 'started'  (node = pve2)
Jul 25 01:15:10 pve2 pve-ha-crm[1222]: migrate service 'vm:101' to node 'pve1' (running)
Jul 25 01:15:10 pve2 pve-ha-crm[1222]: service 'vm:101': state changed from 'started' to 'migrate'  (node = pve2, target = pve1)
Jul 25 01:15:16 pve2 pve-ha-lrm[572287]: <root@pam> starting task UPID:pve2:0008:qmigrate:101:root@pam:
Jul 25 01:15:19 pve2 systemd[1]: Stopping user@0.service - User Manager for UID 0...
Jul 25 01:15:19 pve2 systemd[571984]: Activating special unit exit.target...
Jul 25 01:15:19 pve2 systemd[571984]: Stopped target default.target - Main User Target.
Jul 25 01:15:19 pve2 systemd[571984]: Stopped target basic.target - Basic System.
Jul 25 01:15:19 pve2 systemd[571984]: Stopped target paths.target - Paths.
Jul 25 01:15:19 pve2 systemd[571984]: Stopped target sockets.target - Sockets.
Jul 25 01:15:19 pve2 systemd[571984]: Stopped target timers.target - Timers.
Jul 25 01:15:19 pve2 systemd[571984]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Jul 25 01:15:19 pve2 systemd[571984]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jul 25 01:15:19 pve2 systemd[571984]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Jul 25 01:15:19 pve2 systemd[571984]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Jul 25 01:15:19 pve2 systemd[571984]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Jul 25 01:15:19 pve2 systemd[571984]: Removed slice app.slice - User Application Slice.
Jul 25 01:15:19 pve2 systemd[571984]: Reached target shutdown.target - Shutdown.
Jul 25 01:15:19 pve2 systemd[571984]: Finished systemd-exit.service - Exit the Session.
Jul 25 01:15:19 pve2 systemd[571984]: Reached target exit.target - Exit the Session.
Jul 25 01:15:19 pve2 systemd[1]: user@0.service: Deactivated successfully.
Jul 25 01:15:19 pve2 systemd[1]: Stopped user@0.service - User Manager for UID 0.
Jul 25 01:15:19 pve2 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Jul 25 01:15:19 pve2 systemd[1]: run-user-0.mount: Deactivated successfully.
Jul 25 01:15:19 pve2 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Jul 25 01:15:19 pve2 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Jul 25 01:15:19 pve2 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Jul 25 01:15:19 pve2 systemd[1]: user-0.slice: Consumed 5.628s CPU time.
Jul 25 01:15:21 pve2 pve-ha-lrm[572287]: Task 'UPID:pve2:0008BB80:qmigrate:101:root@pam:' still active, waiting
Jul 25 01:15:21 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:22 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:24 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:24 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:24 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:25 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:25 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:25 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:26 pve2 pve-ha-lrm[572287]: Task 'UPID:pve2:0008BB80:qmigrate:101:root@pam:' still active, waiting
Jul 25 01:15:30 pve2 kernel:  zd16: p1 p2 p3 p4 p5 p6 p7 p8
Jul 25 01:15:31 pve2 pve-ha-lrm[572287]: Task 'UPID:pve2:0008BB80:qmigrate:101:root@pam:' still active, waiting
Jul 25 01:15:31 pve2 pve-ha-lrm[572288]: VM 101 qmp command failed - VM 101 qmp command 'block-job-cancel' failed - Job 'drive-scsi0' in state 'concluded' cannot accept command verb 'cancel'
Jul 25 01:15:31 pve2 pve-ha-lrm[572288]: VM 101 qmp command failed - VM 101 qmp command 'block-job-cancel' failed - Block job 'drive-efidisk0' not found
Jul 25 01:15:33 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:33 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:34 pve2 pve-ha-lrm[572288]: migration problems
Jul 25 01:15:34 pve2 pve-ha-lrm[572287]: <root@pam> end task UPID:pve2:0008BB80:qmigrate:101:root@pam: migration problems
Jul 25 01:15:34 pve2 pve-ha-lrm[572287]: service vm:101 not moved (migration error)

fiona · Jul 25, 2024

ballybob said:
I didn't want to parse through past logs to find when it happened so I just disconnected the primary and forced a failover.
As before, it failed to recover back to the primary when it was brought back online. All VMs and LXC returned to the primary.
This time however, it eventually worked after 4 attempts of migrating, the 5th attempt actually worked without intervention.

Was the failure the same in all 4 attempts? Please share a full migration task log for a failed attempt.

ballybob said:
Something is obviously wrong, but this time I didn't have to reboot the 2nd node to force the first one to pick it up locally. It does appear to be related to the UEFI disk in the vm, but if I read the replication logs, it clearly says it backed up both the main disk and the uefi (disk0) successfully each time.

Can you post the output of zfs get all rpool/data/vm-101-disk-0 on both nodes?

ballybob said:
For some reason pve2 doesn't have or doesn't think it has the UEFI disk0 for the vm.

From the log in your first comment, it rather seems like there is some IO error happening when QEMU tries to complete the drive mirror job, not that the target wasn't found at all.

ballybob said:
Jul 25 01:15:31 pve2 pve-ha-lrm[572288]: VM 101 qmp command failed - VM 101 qmp command 'block-job-cancel' failed - Job 'drive-scsi0' in state 'concluded' cannot accept command verb 'cancel'
Jul 25 01:15:31 pve2 pve-ha-lrm[572288]: VM 101 qmp command failed - VM 101 qmp command 'block-job-cancel' failed - Block job 'drive-efidisk0' not found

This just means the mirror job for the disk was already gone. After a failure, Proxmox VE tries to clean up again, just to be sure.

ballybob · Jul 25, 2024

fiona said:
Can you post the output of zfs get all rpool/data/vm-101-disk-0 on both nodes?

Code:

root@pve1:~# zfs get all rpool/data/vm-101-disk-0
NAME                      PROPERTY              VALUE                     SOURCE
rpool/data/vm-101-disk-0  type                  volume                    -
rpool/data/vm-101-disk-0  creation              Sat Jul 20 17:04 2024     -
rpool/data/vm-101-disk-0  used                  56K                       -
rpool/data/vm-101-disk-0  available             874G                      -
rpool/data/vm-101-disk-0  referenced            56K                       -
rpool/data/vm-101-disk-0  compressratio         1.00x                     -
rpool/data/vm-101-disk-0  reservation           none                      default
rpool/data/vm-101-disk-0  volsize               1M                        local
rpool/data/vm-101-disk-0  volblocksize          16K                       default
rpool/data/vm-101-disk-0  checksum              on                        default
rpool/data/vm-101-disk-0  compression           on                        inherited from rpool
rpool/data/vm-101-disk-0  readonly              off                       default
rpool/data/vm-101-disk-0  createtxg             327                       -
rpool/data/vm-101-disk-0  copies                1                         default
rpool/data/vm-101-disk-0  refreservation        none                      default
rpool/data/vm-101-disk-0  guid                  12720012326769879953      -
rpool/data/vm-101-disk-0  primarycache          all                       default
rpool/data/vm-101-disk-0  secondarycache        all                       default
rpool/data/vm-101-disk-0  usedbysnapshots       0B                        -
rpool/data/vm-101-disk-0  usedbydataset         56K                       -
rpool/data/vm-101-disk-0  usedbychildren        0B                        -
rpool/data/vm-101-disk-0  usedbyrefreservation  0B                        -
rpool/data/vm-101-disk-0  logbias               latency                   default
rpool/data/vm-101-disk-0  objsetid              141                       -
rpool/data/vm-101-disk-0  dedup                 off                       default
rpool/data/vm-101-disk-0  mlslabel              none                      default
rpool/data/vm-101-disk-0  sync                  standard                  inherited from rpool
rpool/data/vm-101-disk-0  refcompressratio      1.00x                     -
rpool/data/vm-101-disk-0  written               0                         -
rpool/data/vm-101-disk-0  logicalused           28K                       -
rpool/data/vm-101-disk-0  logicalreferenced     28K                       -
rpool/data/vm-101-disk-0  volmode               default                   default
rpool/data/vm-101-disk-0  snapshot_limit        none                      default
rpool/data/vm-101-disk-0  snapshot_count        none                      default
rpool/data/vm-101-disk-0  snapdev               hidden                    default
rpool/data/vm-101-disk-0  context               none                      default
rpool/data/vm-101-disk-0  fscontext             none                      default
rpool/data/vm-101-disk-0  defcontext            none                      default
rpool/data/vm-101-disk-0  rootcontext           none                      default
rpool/data/vm-101-disk-0  redundant_metadata    all                       default
rpool/data/vm-101-disk-0  encryption            off                       default
rpool/data/vm-101-disk-0  keylocation           none                      default
rpool/data/vm-101-disk-0  keyformat             none                      default
rpool/data/vm-101-disk-0  pbkdf2iters           0                         default
rpool/data/vm-101-disk-0  snapshots_changed     Thu Jul 25 10:30:10 2024  -
rpool/data/vm-101-disk-0  prefetch              all                       default

Code:

root@pve2:~# zfs get all rpool/data/vm-101-disk-0
NAME                      PROPERTY              VALUE                     SOURCE
rpool/data/vm-101-disk-0  type                  volume                    -
rpool/data/vm-101-disk-0  creation              Sat Jul 20 19:54 2024     -
rpool/data/vm-101-disk-0  used                  56K                       -
rpool/data/vm-101-disk-0  available             878G                      -
rpool/data/vm-101-disk-0  referenced            56K                       -
rpool/data/vm-101-disk-0  compressratio         1.00x                     -
rpool/data/vm-101-disk-0  reservation           none                      default
rpool/data/vm-101-disk-0  volsize               1M                        local
rpool/data/vm-101-disk-0  volblocksize          16K                       default
rpool/data/vm-101-disk-0  checksum              on                        default
rpool/data/vm-101-disk-0  compression           on                        inherited from rpool
rpool/data/vm-101-disk-0  readonly              off                       default
rpool/data/vm-101-disk-0  createtxg             1555                      -
rpool/data/vm-101-disk-0  copies                1                         default
rpool/data/vm-101-disk-0  refreservation        none                      default
rpool/data/vm-101-disk-0  guid                  9213793783124796618       -
rpool/data/vm-101-disk-0  primarycache          all                       default
rpool/data/vm-101-disk-0  secondarycache        all                       default
rpool/data/vm-101-disk-0  usedbysnapshots       0B                        -
rpool/data/vm-101-disk-0  usedbydataset         56K                       -
rpool/data/vm-101-disk-0  usedbychildren        0B                        -
rpool/data/vm-101-disk-0  usedbyrefreservation  0B                        -
rpool/data/vm-101-disk-0  logbias               latency                   default
rpool/data/vm-101-disk-0  objsetid              369                       -
rpool/data/vm-101-disk-0  dedup                 off                       default
rpool/data/vm-101-disk-0  mlslabel              none                      default
rpool/data/vm-101-disk-0  sync                  standard                  inherited from rpool
rpool/data/vm-101-disk-0  refcompressratio      1.00x                     -
rpool/data/vm-101-disk-0  written               0                         -
rpool/data/vm-101-disk-0  logicalused           28K                       -
rpool/data/vm-101-disk-0  logicalreferenced     28K                       -
rpool/data/vm-101-disk-0  volmode               default                   default
rpool/data/vm-101-disk-0  snapshot_limit        none                      default
rpool/data/vm-101-disk-0  snapshot_count        none                      default
rpool/data/vm-101-disk-0  snapdev               hidden                    default
rpool/data/vm-101-disk-0  context               none                      default
rpool/data/vm-101-disk-0  fscontext             none                      default
rpool/data/vm-101-disk-0  defcontext            none                      default
rpool/data/vm-101-disk-0  rootcontext           none                      default
rpool/data/vm-101-disk-0  redundant_metadata    all                       default
rpool/data/vm-101-disk-0  encryption            off                       default
rpool/data/vm-101-disk-0  keylocation           none                      default
rpool/data/vm-101-disk-0  keyformat             none                      default
rpool/data/vm-101-disk-0  pbkdf2iters           0                         default
rpool/data/vm-101-disk-0  snapshots_changed     Thu Jul 25 10:30:11 2024  -
rpool/data/vm-101-disk-0  prefetch              all                       default

ballybob · Jul 25, 2024

fiona said:
Was the failure the same in all 4 attempts? Please share a full migration task log for a failed attempt.

Yes, it's always this error:

Code:

2024-07-25 01:15:32 ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-efidisk0: Input/output error (io-status: ok)

Code:

task started by HA resource agent
2024-07-25 01:13:36 starting migration of VM 101 to node 'pve1' (10.75.54.40)
2024-07-25 01:13:36 found local, replicated disk 'local-zfs:vm-101-disk-0' (attached)
2024-07-25 01:13:36 found local, replicated disk 'local-zfs:vm-101-disk-1' (attached)
2024-07-25 01:13:36 efidisk0: start tracking writes using block-dirty-bitmap 'repl_efidisk0'
2024-07-25 01:13:36 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2024-07-25 01:13:36 replicating disk images
2024-07-25 01:13:36 start replication job
2024-07-25 01:13:36 guest => VM 101, running => 568950
2024-07-25 01:13:36 volumes => local-zfs:vm-101-disk-0,local-zfs:vm-101-disk-1
2024-07-25 01:13:37 freeze guest filesystem
2024-07-25 01:13:37 create snapshot '__replicate_101-0_1721884416__' on local-zfs:vm-101-disk-0
2024-07-25 01:13:37 create snapshot '__replicate_101-0_1721884416__' on local-zfs:vm-101-disk-1
2024-07-25 01:13:37 thaw guest filesystem
2024-07-25 01:13:37 using secure transmission, rate limit: none
2024-07-25 01:13:37 incremental sync 'local-zfs:vm-101-disk-0' (__replicate_101-0_1721883607__ => __replicate_101-0_1721884416__)
2024-07-25 01:13:38 send from @__replicate_101-0_1721883607__ to rpool/data/vm-101-disk-0@__replicate_101-0_1721884416__ estimated size is 624B
2024-07-25 01:13:38 total estimated size is 624B
2024-07-25 01:13:38 TIME        SENT   SNAPSHOT rpool/data/vm-101-disk-0@__replicate_101-0_1721884416__
2024-07-25 01:13:39 successfully imported 'local-zfs:vm-101-disk-0'
2024-07-25 01:13:39 incremental sync 'local-zfs:vm-101-disk-1' (__replicate_101-0_1721883607__ => __replicate_101-0_1721884416__)
2024-07-25 01:13:40 send from @__replicate_101-0_1721883607__ to rpool/data/vm-101-disk-1@__replicate_101-0_1721884416__ estimated size is 78.6M
2024-07-25 01:13:40 total estimated size is 78.6M
2024-07-25 01:13:40 TIME        SENT   SNAPSHOT rpool/data/vm-101-disk-1@__replicate_101-0_1721884416__
2024-07-25 01:13:41 successfully imported 'local-zfs:vm-101-disk-1'
2024-07-25 01:13:41 (remote_finalize_local_job) delete stale replication snapshot '__replicate_101-0_1721883607__' on local-zfs:vm-101-disk-0
2024-07-25 01:13:41 (remote_finalize_local_job) delete stale replication snapshot '__replicate_101-0_1721883607__' on local-zfs:vm-101-disk-1
2024-07-25 01:13:41 end replication job
2024-07-25 01:13:41 starting VM 101 on remote node 'pve1'
2024-07-25 01:13:43 volume 'local-zfs:vm-101-disk-0' is 'local-zfs:vm-101-disk-0' on the target
2024-07-25 01:13:43 volume 'local-zfs:vm-101-disk-1' is 'local-zfs:vm-101-disk-1' on the target
2024-07-25 01:13:43 start remote tunnel
2024-07-25 01:13:44 ssh tunnel ver 1
2024-07-25 01:13:44 starting storage migration
2024-07-25 01:13:44 scsi0: start migration to nbd:unix:/run/qemu-server/101_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 16.8 MiB (0.00%) in 0s
drive-scsi0: transferred 16.8 MiB of 16.8 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2024-07-25 01:13:45 efidisk0: start migration to nbd:unix:/run/qemu-server/101_nbd.migrate:exportname=drive-efidisk0
drive mirror re-using dirty bitmap 'repl_efidisk0'
drive mirror is starting for drive-efidisk0
all 'mirror' jobs are ready
2024-07-25 01:13:45 starting online/live migration on unix:/run/qemu-server/101.migrate
2024-07-25 01:13:45 set migration capabilities
2024-07-25 01:13:45 migration downtime limit: 100 ms
2024-07-25 01:13:45 migration cachesize: 512.0 MiB
2024-07-25 01:13:45 set migration parameters
2024-07-25 01:13:45 start migrate command to unix:/run/qemu-server/101.migrate
2024-07-25 01:13:46 migration active, transferred 143.8 MiB of 4.0 GiB VM-state, 158.1 MiB/s
2024-07-25 01:13:47 migration active, transferred 284.4 MiB of 4.0 GiB VM-state, 141.3 MiB/s
2024-07-25 01:13:48 migration active, transferred 425.2 MiB of 4.0 GiB VM-state, 143.6 MiB/s
2024-07-25 01:13:49 migration active, transferred 558.3 MiB of 4.0 GiB VM-state, 250.1 MiB/s
2024-07-25 01:13:50 migration active, transferred 698.8 MiB of 4.0 GiB VM-state, 143.6 MiB/s
2024-07-25 01:13:51 migration active, transferred 839.8 MiB of 4.0 GiB VM-state, 142.3 MiB/s
2024-07-25 01:13:52 migration active, transferred 980.2 MiB of 4.0 GiB VM-state, 138.2 MiB/s
2024-07-25 01:13:53 migration active, transferred 1.1 GiB of 4.0 GiB VM-state, 142.3 MiB/s
2024-07-25 01:13:54 migration active, transferred 1.2 GiB of 4.0 GiB VM-state, 142.6 MiB/s
2024-07-25 01:13:55 migration active, transferred 1.4 GiB of 4.0 GiB VM-state, 165.2 MiB/s
2024-07-25 01:13:57 average migration speed: 343.1 MiB/s - downtime 84 ms
2024-07-25 01:13:57 migration status: completed
all 'mirror' jobs are ready
drive-efidisk0: Completing block job...
drive-efidisk0: Completed successfully.
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0: Cancelling block job
drive-efidisk0: Cancelling block job
WARN: drive-scsi0: Input/output error (io-status: ok)
drive-efidisk0: Done.
drive-scsi0: Done.
2024-07-25 01:13:59 ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-efidisk0: Input/output error (io-status: ok)
2024-07-25 01:13:59 aborting phase 2 - cleanup resources
2024-07-25 01:13:59 migrate_cancel
2024-07-25 01:13:59 scsi0: removing block-dirty-bitmap 'repl_scsi0'
2024-07-25 01:13:59 efidisk0: removing block-dirty-bitmap 'repl_efidisk0'
2024-07-25 01:14:01 ERROR: migration finished with problems (duration 00:00:25)
TASK ERROR: migration problems

fiona · Jul 26, 2024

Nothing interesting in the zvol properties, they are identical except dates and IDs.

Could you also share the log for the successful migration for comparison? I wasn't able to reproduce the issue yet locally.

ballybob · Jul 26, 2024

fiona said:
Nothing interesting in the zvol properties, they are identical except dates and IDs.

Could you also share the log for the successful migration for comparison? I wasn't able to reproduce the issue yet locally.

Looking through the logs on PVE2, I can't see a successful migration. It looks like it just failed, then appeared on the primary. I assume it might have picked it up from local storage? I don't really get it. Before, it tried to migrate 10-20 times before i ended up just rebooting pve2 so it would appear on pve1 again. This time, it did it 4-5 times all failures, but then just showed up on pve1 without reboot. Every simulated failure (pulling power from PVE1) results in this problem.

Code:

Jul 25 01:15:31 pve2 pve-ha-lrm[572288]: VM 101 qmp command failed - VM 101 qmp command 'block-job-cancel' failed - Job 'drive-scsi0' in state 'concluded' cannot accept command verb 'cancel'
Jul 25 01:15:31 pve2 pve-ha-lrm[572288]: VM 101 qmp command failed - VM 101 qmp command 'block-job-cancel' failed - Block job 'drive-efidisk0' not found
Jul 25 01:15:33 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:33 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:34 pve2 pve-ha-lrm[572288]: migration problems
Jul 25 01:15:34 pve2 pve-ha-lrm[572287]: <root@pam> end task UPID:pve2:0008BB80:qmigrate:101:root@pam: migration problems
Jul 25 01:15:34 pve2 pve-ha-lrm[572287]: service vm:101 not moved (migration error)
Jul 25 01:15:40 pve2 pve-ha-crm[1222]: service 'vm:101' - migration failed (exit code 1)
Jul 25 01:15:40 pve2 pve-ha-crm[1222]: service 'vm:101': state changed from 'migrate' to 'started'  (node = pve2)
Jul 25 01:15:40 pve2 pve-ha-crm[1222]: migrate service 'vm:101' to node 'pve1' (running)
Jul 25 01:15:40 pve2 pve-ha-crm[1222]: service 'vm:101': state changed from 'started' to 'migrate'  (node = pve2, target = pve1)
Jul 25 01:15:45 pve2 pve-ha-lrm[572590]: <root@pam> starting task UPID:pve2:0008BCAF:qmigrate:101:root@pam:
Jul 25 01:15:50 pve2 pve-ha-lrm[572590]: Task 'UPID:pve2:0008BCAF:qmigrate:101:root@pam:' still active, waiting
Jul 25 01:15:52 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:53 pve2 pmxcfs[1061]: [status] notice: received log
Jul 25 01:15:55 pve2 pve-ha-lrm[572590]: Task 'UPID:pve2:0008BCAF:qmigrate:101:root@pam:' still active, waiting
Jul 25 01:16:00 pve2 pve-ha-lrm[572590]: Task 'UPID:pve2:0008BCAF:qmigrate:101:root@pam:' still active, waiting
Jul 25 01:16:02 pve2 kernel:  zd16: p1 p2 p3 p4 p5 p6 p7 p8
Jul 25 01:16:05 pve2 pve-ha-lrm[572590]: Task 'UPID:pve2:0008BCAF:qmigrate:101:root@pam:' still active, waiting
Jul 25 01:16:07 pve2 kernel: tap101i0: left allmulticast mode
Jul 25 01:16:07 pve2 kernel: vmbr0: port 2(tap101i0) entered disabled state
Jul 25 01:16:07 pve2 qmeventd[872]: read: Connection reset by peer
Jul 25 01:16:07 pve2 systemd[1]: 101.scope: Deactivated successfully.
Jul 25 01:16:07 pve2 systemd[1]: 101.scope: Consumed 40.643s CPU time.
Jul 25 01:16:08 pve2 pve-ha-lrm[572591]: migration problems

ballybob · Jul 29, 2024

fiona said:
Nothing interesting in the zvol properties, they are identical except dates and IDs.

Could you also share the log for the successful migration for comparison? I wasn't able to reproduce the issue yet locally.

If you want to reproduce it, you could use the Proxmox Helper Script to create a Home Assistant OS VM. Then try to fail it over and see if it does it.

https://tteck.github.io/Proxmox/#home-assistant-os-vm

My VM is this script.

Failed Fail Over

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Member