severe performance regression on virtual disk migration for qcow2 on ZFS with 5.15.39-2-pve

thanks for the heads up! :)
 
i have dug into this some more and found that the performance-problem goes away when setting "relatime=on" or "atime=off" for the dataset.
[...]
ADDON (after finding out more , see below) :
the performance problem also goes away when setting preallocation-policy to "off" on the datastore.
Many many many thanks for that one !
I was becoming mad since a week...
I made a mistake last week as I made two different upgrades in the same time: I upgraded PVE 7 to 8 and TrueNAS 12 to 13. Everything went ok (except lost of connectivity with ifupdown2, perhaps I forgot to read some release notes, my bad).
But during the night backup job it was hell: at the morning some VMs were stuck with IO from 9600k modem time: for example a VM took 25 minutes to backup on PVE7 and the same VM, the day after on PVE8 took 1h48 to backup !
I have two PVE clusters and 3 TrueNAS sharing NFS for qcow2 files and it was really a mistery...
Since 2 days I tried many things without success: mitigations=off, iothread=1, virtio scsi single, even mtu tweaking or crossing mount between the 2 clusters and the 3 TrueNAS: but none of tests gave logical results.
iperf3 give max bandwidth, with dd on PVE host or inside a VM bandwidth is on the top, but moving a disk (offline) never ends.
I was moving a 64 Gb disk while finding your post: during 5 hours it moved 33%, I went on the TrueNAS, put atime=off and moved the 67% remaining in a few minutes !
I added the option preallocation off on all my nfs datastores and launched a backup which seems to go many faster than previously.
I am not sure that all regressions are gone but is is far better for the moment... I'll check carefully during next days.
 
  • Like
Reactions: RolandK
what zfs version do your truenas have?
It is TrueNAS 13.0-U6.1:
zfs-2.1.14-1
zfs-kmod-v2023120100-zfs_f4871096b
But I did not activate the new features yet.

I checked the backups and there are still slowness issues compared to before and many weird results:

OS Hardware Virtio Guest Disk Backup on PVE v7 Backup on PVE v8qemu-kvm
version Agent size
117 pve2-1 Deb10 i440fx 3.1.0 16 Gb cache=writethrough 00:10:54 6.28 GB 00:34:44 6.21 GB
207 pve2-1 Win2k19 i440fx-5.2 21500 102.10.0 512 Gb cache=writethrough 00:27:28 35.50 GB 01:19:53 36.11 GB
253 pve2-1 W10/22H2 i440fx-7.2 22900 105.00.2 64 Gb cache=writethrough 00:09:14 16.49 GB 04:43:19 16.26 GB
101 pve2-2 Deb12 i440fx 5.2.0 32 Gb cache=writethrough 00:11:44 4.23 GB 00:06:15 4.11 GB
107 pve2-2 Deb9 i440fx 3.1.0 32 Gb cache=writethrough 00:02:45 0.82 GB 01:45:32 0.87 GB
146 pve2-2 Win2k19 i440fx-5.2 22900 105.00.2 128 Gb cache=writeback 00:26:17 21.26 GB 00:55:37 21.55 GB
210 pve2-2 Deb10 i440fx 16 Gb cache=writethrough 00:02:16 0.72 GB 00:02:05 0.72 GB
231 pve2-2 Deb11 i440fx 16 Gb cache=writethrough 00:03:31 1.05 GB 01:09:50 1.13 GB
220 pve2-3 Win2k19 i440fx-5.1 21500 102.10.0 256 Gb cache=default(no) 00:17:07 19.86 GB 00:09:37 20.22 GB
105 pve1-1 Deb10 i440fx 3.1.0 32 Gb cache=writethrough 00:06:47 2.30 GB 02:23:23 2.28 GB
132 pve1-4 Win2k19 i440fx-5.1 21500 102.10.0 64 Gb cache=default(no) 00:14:31 12.03 GB 02:41:26 12.59 GB
133 pve1-4 Win2k19 i440fx-5.1 21500 102.10.0 64 Gb cache=default(no) 00:12:58 12.14 GB 00:08:09 12.67 GB
 
i have observed a similar performance issue on zfs shared via samba, unfortunately i'm not yet able to reproduce.

when live migrating a qcow2 virtual disk hosted on zfs/samba share to disk to local ssd, i observed pathological slowness and saw lot's of write IOPs on the source (!) whereas i would not expect any write iops for this. it did not go away by disabling atime, as before.
i had seen similar slownewss during backup when vdisk was on zfs/samba share. i'm using that for clrearing up local ssd space when a VM is obsolete or not used for longer.

the read performance during live migrate was very well below 1MB/s

i stopped the migration and tried to migrate the virtual disk offline, which performed surprisingly fast and worked without a problem.

then i did migrate the vdisk back to the samba share , which also performed well and then tried again, but for my curiousity, the problem was gone and isn't reproducible.

very very weird....
 
i have observed a similar performance issue on zfs shared via samba, unfortunately i'm not yet able to reproduce.

when live migrating a qcow2 virtual disk hosted on zfs/samba share to disk to local ssd, i observed pathological slowness and saw lot's of write IOPs on the source (!) whereas i would not expect any write iops for this. it did not go away by disabling atime, as before.
i had seen similar slownewss during backup when vdisk was on zfs/samba share. i'm using that for clrearing up local ssd space when a VM is obsolete or not used for longer.

the read performance during live migrate was very well below 1MB/s

i stopped the migration and tried to migrate the virtual disk offline, which performed surprisingly fast and worked without a problem.

then i did migrate the vdisk back to the samba share , which also performed well and then tried again, but for my curiousity, the problem was gone and isn't reproducible.

very very weird....

that has been isolated as a samba problem and is resolved a while ago, see https://github.com/openzfs/zfs/issues/16490