ZFS replication & migration fails after migration (CT & VM)

H25E

Member
Nov 5, 2020
68
4
13
33
Hello everybody,

After tinkering a little big with PBS, I have uninstalled it and installed PVE too in my second node. Then I have tried to set up replication between both of my nodes. Everthing worked well at the end.

Then I tried migrating a CT and a VM to my new node. After that, in the web GUI, seemed like replication configurated automatically to the old node correctly, but it stopped working.

When I run it manually I get:
Code:
root@pve02:~# pvesr run --id=104-0 --verbose
start replication job
guest => CT 104, running => 1
volumes => pveS-HDD:subvol-104-nginx
freeze guest filesystem
create snapshot '__replicate_104-0_1671545732__' on pveS-HDD:subvol-104-nginx
thaw guest filesystem
using secure transmission, rate limit: none
incremental sync 'pveS-HDD:subvol-104-nginx' (__replicate_104-0_1671544736__ => __replicate_104-0_1671545732__)
send from @__replicate_104-0_1671544736__ to HDD/subvol-104-nginx@__replicate_104-0_1671545732__ estimated size is 5.44M
total estimated size is 5.44M
Unknown option: snapshot
400 unable to parse option
pvesm import <volume> <format> <filename> [OPTIONS]
warning: cannot send 'HDD/subvol-104-nginx@__replicate_104-0_1671545732__': signal received
cannot send 'HDD/subvol-104-nginx': I/O error
command 'zfs send -Rpv -I __replicate_104-0_1671544736__ -- HDD/subvol-104-nginx@__replicate_104-0_1671545732__' failed: exit code 1
send/receive failed, cleaning up snapshot(s)..
delete previous replication snapshot '__replicate_104-0_1671545732__' on pveS-HDD:subvol-104-nginx
end replication job with error: command 'set -o pipefail && pvesm export pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671545732__ -base __replicate_104-0_1671544736__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox' root@192.168.100.2 -- pvesm import pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671545732__ -allow-rename 0' failed: exit code 255
command 'set -o pipefail && pvesm export pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671545732__ -base __replicate_104-0_1671544736__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox' root@192.168.100.2 -- pvesm import pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671545732__ -allow-rename 0' failed: exit code 255

Then I have deleted all snapshots from the dataset in both nodes, but still:
Code:
root@pve02:~# pvesr run --id=104-0 --verbose
start replication job
guest => CT 104, running => 1
volumes => pveS-HDD:subvol-104-nginx
freeze guest filesystem
create snapshot '__replicate_104-0_1671553394__' on pveS-HDD:subvol-104-nginx
thaw guest filesystem
using secure transmission, rate limit: none
full sync 'pveS-HDD:subvol-104-nginx' (__replicate_104-0_1671553394__)
full send of HDD/subvol-104-nginx@__replicate_104-0_1671553394__ estimated size is 1.55G
total estimated size is 1.55G
Unknown option: snapshot
400 unable to parse option
pvesm import <volume> <format> <filename> [OPTIONS]
warning: cannot send 'HDD/subvol-104-nginx@__replicate_104-0_1671553394__': signal received
cannot send 'HDD/subvol-104-nginx': I/O error
command 'zfs send -Rpv -- HDD/subvol-104-nginx@__replicate_104-0_1671553394__' failed: exit code 1
send/receive failed, cleaning up snapshot(s)..
delete previous replication snapshot '__replicate_104-0_1671553394__' on pveS-HDD:subvol-104-nginx
end replication job with error: command 'set -o pipefail && pvesm export pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671553394__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox' root@192.168.100.2 -- pvesm import pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671553394__ -allow-rename 0' failed: exit code 255
command 'set -o pipefail && pvesm export pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671553394__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox' root@192.168.100.2 -- pvesm import pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671553394__ -allow-rename 0' failed: exit code 255

Migration from GUI fails now too:
Code:
2022-12-20 17:34:21 shutdown CT 104
2022-12-20 17:34:24 starting migration of CT 104 to node 'proxmox' (192.168.100.2)
2022-12-20 17:34:24 found local volume 'pveS-HDD:subvol-104-nginx' (in current VM config)
2022-12-20 17:34:24 start replication job
2022-12-20 17:34:24 guest => CT 104, running => 0
2022-12-20 17:34:24 volumes => pveS-HDD:subvol-104-nginx
2022-12-20 17:34:25 create snapshot '__replicate_104-0_1671554064__' on pveS-HDD:subvol-104-nginx
2022-12-20 17:34:25 using secure transmission, rate limit: none
2022-12-20 17:34:25 full sync 'pveS-HDD:subvol-104-nginx' (__replicate_104-0_1671554064__)
2022-12-20 17:34:26 full send of HDD/subvol-104-nginx@__replicate_104-0_1671554064__ estimated size is 1.55G
2022-12-20 17:34:26 total estimated size is 1.55G
2022-12-20 17:34:26 Unknown option: snapshot
2022-12-20 17:34:26 400 unable to parse option
2022-12-20 17:34:26 pvesm import <volume> <format> <filename> [OPTIONS]
2022-12-20 17:34:26 command 'zfs send -Rpv -- HDD/subvol-104-nginx@__replicate_104-0_1671554064__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2022-12-20 17:34:26 delete previous replication snapshot '__replicate_104-0_1671554064__' on pveS-HDD:subvol-104-nginx
2022-12-20 17:34:26 end replication job with error: command 'set -o pipefail && pvesm export pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671554064__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox' root@192.168.100.2 -- pvesm import pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671554064__ -allow-rename 0' failed: exit code 255
2022-12-20 17:34:26 ERROR: command 'set -o pipefail && pvesm export pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671554064__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox' root@192.168.100.2 -- pvesm import pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671554064__ -allow-rename 0' failed: exit code 255
2022-12-20 17:34:26 aborting phase 1 - cleanup resources
2022-12-20 17:34:26 start final cleanup
2022-12-20 17:34:26 start container on source node
2022-12-20 17:34:27 ERROR: migration aborted (duration 00:00:06): command 'set -o pipefail && pvesm export pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671554064__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox' root@192.168.100.2 -- pvesm import pveS-HDD:subvol-104-nginx zfs - -with-snapshots 1 -snapshot __replicate_104-0_1671554064__ -allow-rename 0' failed: exit code 255
TASK ERROR: migration aborted

So now it is stuck in the new node. For the VM, it happened the same, with the difference that the VM hang up during migration and a hard reset was necessary to make it work again. I don't know if that changes anything.

Also, the old node it's v6.4 (pending to update) and the new one it's v7.3.
 
Last edited:
I would strongly recommend upgrading the old node to v7.3 first and then attempting again. We can give no guarantees for migration between nodes with different major versions. Particularly when migrating from the new node to the old node.
 
Hello @shanreich,

You are right and I noticed that it's specified in the docs too, so my fault. I have updated the old node and now CT migration and VM offline migration works perfectly and neither of them break replication.

In the other hand VM online migration sometimes (50% of the times aprox) fails with the following message:
Code:
2022-12-21 20:54:36 issuing guest fstrim
2022-12-21 20:54:40 ERROR: fstrim failed - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmox' root@192.168.100.2 qm guest cmd 111 fstrim' failed: exit code 255
2022-12-21 20:54:42 ERROR: migration finished with problems (duration 00:00:19)

When this happens, the guest is left in a hung up state with high cpu usage for an undetermined amount of time. The only way I have found to make the VM responsive again is to forcefully reset it (no SSH connection and the VNC client connects but the terminal is frozen).

For what I have seen it's not a specific behavior from a single VM and it can happen from node A to B and viceversa.

(Should I create a new thread with a more appropiate tittle for the new problem?)
 
This is an example, but it has happened with all the VM I have tested:

Code:
agent: 1,fstrim_cloned_disks=1
balloon: 4096
bootdisk: scsi0
cores: 8
cpu: host
memory: 8192
name: WebApi
net0: virtio=06:92:B6:4E:6C:2E,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: pveS-SSD:vm-110-webapi-system,discard=on,size=100G
scsi1: pveS-SSD:vm-110-webapi-data,discard=on,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=ef134303-d30a-489d-8d49-82dbfc60cfd2
sockets: 1
vmgenid: 174a001a-cb2f-42f7-830b-c2baa51679a5

Tell me if more info is needed.

EDIT: Could it be the CPU? Both nodes are Intel CPU's but from different generations. Maybe should I try with KVM generic CPU?
 
Last edited:
Do you have the Qemu guest agent installed inside the VM? The config is set with fstrim_cloned_disks, so the node will try to issue an fstrim command via qm guest cmd. If Qemu guest agent is not installed properly this will fail. You could either unset this (Select VM > Options > Qemu Guest Agent > Edit) or make sure that the Qemu guest agent is properly installed.
 
as a workaround you could also disable that option and manually trim after migration.. (and it might make sense for PVE to treat a failing trim as non-fatal, since it doesn't actually fail the migration for real..)
 
  • Like
Reactions: shanreich
Qemu aguest agent it's installed and working inside the VM.

I have disabled the option
Code:
fstrim_cloned_disks
and now it's the same as before but without fstrim error. Now migration always returns succesfully but half of the times the VM is hanged up. So the fstrim error was a consecuence of the VM being hanged up and not an issue by itself.

I have also changed the cpu type from host to qemu64 and the behaviour hasn't changed.
 
what are the source and target physical CPUs of your PVE nodes?
 
Additionally your kernel version(s) would be interesting: uname -a
 
  • Old node:
    • CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
    • uname -a: Linux proxmox 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
  • New node:
    • CPU: 12th Gen Intel(R) Core(TM) i9-12900K
    • uname -a: Linux pve02 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
Yesterday was failing mostly from old node to new node if I remember correctly. Today it's the inverse.
 
Sorry, I'm not a professional Linux admin. Could you point me to some docs about how to do that process?
 
Sorry, I'm not a professional Linux admin. Could you point me to some docs about how to do that process?

The linked thread should contain the info, the process should be straightforward:

How to install:
  • apt update
  • apt install pve-kernel-5.19
  • reboot
 
  • Like
Reactions: H25E
Oh, I didn't know it was that easy! Rookies gonna rock!

I have updated both nodes and now online VM migration works flawlessly with and without fstrim. Thank you two very much! You are doing an amazing job. I hope one day I can start my own business and buy a license from you.

The only thing is that (obviously) online migration doesn't work with host type CPU while having different CPU families, so, what generic CPU should I use? kvm64 or qemu64? I'm failing to get a comparison between them on the internet.
 
Last edited:
@shanreich I have noticed that kernel v6.1 it's out of the supported zfs kernels list (up to 6.0) of zfs v2.1.7 (last release).

It's my data in danger?

I have installed zabbix in a CT and since the update I'm receiving a lot of alerts saying Problem name: sdX: Disk read/write request responses are too high (read > 20 ms for 15m or write > 20 ms for 15m) when the disk are idle or low load. The problem solves itself when load is high. It's this something expected?

Should I downgrade to 5.19?

Thanks for your time.
 
@shanreich I have noticed that kernel v6.1 it's out of the supported zfs kernels list (up to 6.0) of zfs v2.1.7 (last release).

It's my data in danger?

I have installed zabbix in a CT and since the update I'm receiving a lot of alerts saying Problem name: sdX: Disk read/write request responses are too high (read > 20 ms for 15m or write > 20 ms for 15m) when the disk are idle or low load. The problem solves itself when load is high. It's this something expected?

Should I downgrade to 5.19?

Thanks for your time.

Yes, in this case it might be smarter to stay on 5.19. We currently know of some issues with ZFS with regards to the 6.1 kernel - you can find them in the thread I linked you [1]. They shouldn't put your data at risk, but of course we cannot guarantee that 100%. I am not sure whether your particular problem is related to ZFS and kernel 6.1 but checking whether 5.19 fixes your issue is worth a try.
 
I downgraded the kernel to 5.19, so now it's a supported version by openZFS. VM still migrates correctly but the new problem still persists too. Zabbix chart shows that this started just after kernel upgrade:

1672155617393.png

It only happens at low load and only for read requests. If I do if=/dev/sdx... waiting time it's reduced to 2-3 ms aprox. Could be something about some energy saving setting in the new kernel?