ZFS Live Migration (Windows IO Blocked)

mjoconr

Renowned Member
Dec 5, 2009
88
1
73
Hi All
Does any one know of a reason why a Windows 2012R2 using close the lastest virtio drivers would have blocked IO after a live migration between two instances running ZFS storage (not shared).
Migrating it back cause the blocked IO to clear up.

We did get the following error
------
2022-10-19 06:20:10 stopping NBD storage migration server on target.
2022-10-19 06:20:11 issuing guest fstrim
2022-10-19 06:20:15 ERROR: fstrim failed - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox1' root@10.1.5.1 qm guest cmd 100 fstrim' failed: exit code 255
2022-10-19 06:20:32 ERROR: migration finished with problems (duration 01:08:31)
TASK ERROR: migration problems
----

Looking for pointers as to where I should look.

Thanks
 
Hi,
did you get the error both times? Does issuing qm guest cmd 100 fstrim manually work? Please share the output of pveversion -v from both servers, qm config 100 and the full migration log.

You could also check /var/log/syslog for any related messages around the time of the migration.
 
Hi,
did you get the error both times? Does issuing qm guest cmd 100 fstrim manually work? Please share the output of pveversion -v from both servers, qm config 100 and the full migration log.

You could also check /var/log/syslog for any related messages around the time of the migration.

Code:
root@proxmox1:~# qm config 100
agent: 1,fstrim_cloned_disks=1
balloon: 0
bootdisk: virtio0
cores: 8
ide0: none,media=cdrom
memory: 87125
name: 2012R2-RDS
net0: virtio=AE:21:DA:C4:B3:C3,bridge=vmbr0
numa: 0
onboot: 1
ostype: win8
scsihw: virtio-scsi-single
sockets: 2
virtio0: rbd:vm-100-disk-0,cache=writeback,discard=on,format=raw,iothread=1,size=500G
virtio1: zbackup:vm-100-disk-0,backup=0,cache=writeback,discard=on,format=raw,iothread=1,size=250G


pve1
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-13
pve-kernel-5.15: 7.2-12
pve-kernel-5.4: 6.4-19
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.4.195-1-pve: 5.4.195-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

pve2
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.60-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-13
pve-kernel-5.15: 7.2-12
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

FSTrim failes
Code:
root@proxmox1:~# qm guest cmd 100 fstrim
VM 100 qmp command 'guest-fstrim' failed - got timeout
root@proxmox1:~#
But agent seems to report as working because there are ip addresses
Screen Shot 2022-10-19 at 6.26.24 pm.png

I do not see how to
 

Attachments

  • pve2topve1.log
    18.7 KB · Views: 3
I do know how to get hold of older logs (from late yesterday) there not showing in the GUI
 
The timeout for guest fstrim was increased recently, but it has not been packaged yet. It will be included in the next version, i.e. qemu-server >= 7.2-5.

My guess would be that the fstrim command continued running in the VM, and that it was simply overloaded a bit, but it's just a guess.

I do know how to get hold of older logs (from late yesterday) there not showing in the GUI
There are Task History panels for each guest and each node. Or do you mean syslog? That one is rotated to /var/log/syslog.1 etc.
 
Extra logs
 

Attachments

  • Proxmox1.log
    35.4 KB · Views: 1
  • Proxmox2.log
    55.4 KB · Views: 1
This would avoid the error message and migration would wait for the fstrim longer, but if VM/storage was overloaded by the trim command and/or migration, it wouldn't avoid the actual issue.

So the migration finished at
2022-10-19 06:20:32 ERROR: migration finished with problems (duration 01:08:31)
but the errors about guest-ping failing only started about 15 minutes later:
/var/log/syslog:Oct 19 06:34:15 proxmox1 pvedaemon[1750400]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout

Did you see the VM's IO getting stuck directly after migration? The log seems to indicate a bit of a delay.


Regarding some other errors in the syslog:
parse error in '/etc/pve/datacenter.cfg' - 'migration': invalid format - format error
migration.type: property is missing and it is not optional
Example for how it can look like: migration: secure,network=10.10.50.0/24

zfs error: cannot open 'zbackup': no such pool
If the storage is not available on that node you can restrict using the nodes property in the storage configuration (or do so via GUI).
 
This would avoid the error message and migration would wait for the fstrim longer, but if VM/storage was overloaded by the trim command and/or migration, it wouldn't avoid the actual issue.

So the migration finished at

but the errors about guest-ping failing only started about 15 minutes later:


Did you see the VM's IO getting stuck directly after migration? The log seems to indicate a bit of a delay.
Sorry, I'm not sure I do not have a login for the VM, when it first transfer the agent was not connected, A little later it did connect and later when the person who owner it login they where getting a number of warnings about lagged IO.
Regarding some other errors in the syslog:

Example for how it can look like: migration: secure,network=10.10.50.0/24
Yes, it took me a while to get the config setting correct earlier in the day. The config using secure did not seem to work, so I tried setting insecure and a transfer of linux worked.
Should I try secure again?
If the storage is not available on that node you can restrict using the nodes property in the storage configuration (or do so via GUI).
This is a new second box and I had gotten the name wrong for the second pool, I fixed this before the windows migration was attempted. Earlier migration using Linux did work before the pool name was fixed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!