severe performance regression on virtual disk migration for qcow2 on ZFS with 5.15.39-2-pve

RolandK · Mar 4, 2023

NOTE:
old title of this thread is "weird disk write i/o pattern on source disks when moving virtual disk".
it has been adjusted to match the bugzilla ticket .it's about a performance regression finding which seems to affect virtual machines with qcow2 when those are stored on zfs. it was found/tested with local zfs. currently it's unknown, if this is reproducable with remote/nfs attached zfs storage. maybe someone has a setup to test this. please report the results.

hello,

i made some really weird observation i cannot explain to myself.

i moved some vm's from an old 6.4 cluster to a new 7.3 cluster. (copy via scp)

the virtual machines have qcow2 disks and are located on zfs datasets.

after moving the virtual machine to the new cluster, i moved one of the vm's virtual disk from a hdd zfs dataset to an ssd zfs dataset with "move disk" via webgui (while vm was running but same applies when vm is offline and qemu-img is being used instead of drive mirror)

the disk move is totally slow , about 5-20Mb/s

the weird thing is, that it is slow because apparently there are high WRITE IOPS on the source dataset, which bring the ordinary harddisk to their iops limit.

there is nothing other on that dataset - only the virtual machine , which is completely idle inside.

it's the disk move causing the high write IOPS on the SOURCE dataset, but i can't determine what's being written there and who's writing - and why.

from my understanding, there should be reads on the source and writes on the target dataset.

does qemu-img or qemu drive mirror issue writes to the source virtual disk file when being moved ?

does anybody have a clue what's happening here and why the problem goes away after the first move ?

i can move the virtual disk forth and back and it's always fast and i don't see the high write iops anymore.

only the first move is slow

Code:

                                                         capacity     operations     bandwidth
pool                                                   alloc   free   read  write   read  write
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
hddpool                                                 444G   668G     30  1.04K<-! 918K  6.56M
  mirror-0                                              222G   334G     25    536   775K  2.85M
    scsi-35000cca0561119d4                                 -      -     10    271   319K  1.43M
    scsi-35000cca05601fd28                                 -      -     14    265   456K  1.42M
  mirror-1                                              222G   334G      4    534   143K  3.71M
    scsi-35000cca05601fad8                                 -      -      4    263   143K  1.85M
    scsi-35000cca043dae6dc                                 -      -      0    271      0  1.86M
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                  2.67G   106G      0      0      0      0
  mirror-0                                             2.67G   106G      0      0      0      0
    ata-INTEL_SSDSC2BB120G6R_PHWA6384053K120CGN-part3      -      -      0      0      0      0
    ata-INTEL_SSDSC2BB120G4_PHWL442300A2120LGN-part3       -      -      0      0      0      0
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
ssdpool                                                 277G   611G      0    440      0  20.2M
  mirror-0                                              277G   611G      0    440      0  20.2M
    sdd                                                    -      -      0    216      0  10.1M
    sdc                                                    -      -      0    224      0  10.1M
-----------------------------------------------------  -----  -----  -----  -----  -----  -----

slow
https://pastebin.com/VeXeJdnz

fast:
https://pastebin.com/hrLhnFhQ

Code:

# cat /etc/pve/qemu-server/223.conf
agent: 1
boot: order=ide2;scsi0
cores: 4
cpu: host
ide2: none,media=cdrom
memory: 8192
name: gitlab
net0: virtio=72:A6:BD:68:E4:3A,bridge=vmbr1,firewall=1,tag=23
numa: 0
onboot: 1
ostype: l26
scsi0: vms-qcow2-ssdpool:223/vm-223-disk-0.qcow2,aio=threads,discard=on,iothread=1,size=40G
scsi1: vms-qcow2-ssdpool:223/vm-223-disk-1.qcow2,aio=threads,discard=on,iothread=1,size=50G
scsi2: vms-qcow2-hddpool:223/vm-223-disk-1.qcow2,aio=threads,discard=on,iothread=1,size=300G
scsihw: virtio-scsi-single
smbios1: uuid=0df9d070-b1a6-4fa5-8512-0ddf8673fe87
sockets: 1
tablet: 0
tags: centos7
vmgenid: 509ea882-dabf-4394-8477-06aaa931da1b

RolandK · Mar 4, 2023

apparently, there is something about my observation....

https://forum.proxmox.com/threads/l...seems-to-be-rewriting-the-source-disk.121798/

RolandK · Mar 4, 2023

Code:

root@pve2:/hddpool/qcow2# cat /etc/pve/storage.cfg
dir: vms-qcow2-ssdpool
    path /ssdpool/qcow2
    content images
    nodes pve3,pve5,pve1,pve4,pve6,pve2
    prune-backups keep-all=1
    shared 0

dir: vms-qcow2-hddpool
    path /hddpool/qcow2
    content images
    nodes pve1,pve6,pve4,pve2,pve3,pve5
    prune-backups keep-all=1
    shared 0

<snip>

it has nothing to do with the VM copy from the old to the new cluster

i hot added a new virtual disk to an existing ubuntu VM and did not initialize it from inside the VM, so it's virgin one.

live migrating this disk from zfs hdd pool/dataset to zfs ssd pool/dataset shows the same i/o behaviour as before - write iops nearly up to the limit of the rotating harddisks, when there should only be reads.

moving the qcow2 file on command line with "mv" command is fast/normal. no write iops on the source observed.

Code:

                                                         capacity     operations     bandwidth
pool                                                   alloc   free   read  write   read  write
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
hddpool                                                48.9M   556G      0  1.10K <-!  0  5.51M
  mirror-0                                             48.9M   556G      0  1.10K      0  5.51M
    scsi-35000cca043d57c28                                 -      -      0    565      0  2.75M
    scsi-35000cca043da1740                                 -      -      0    563      0  2.77M
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                  2.45G   107G      0      0      0      0
  mirror-0                                             2.45G   107G      0      0      0      0
    ata-INTEL_SSDSC2BB120G4_PHWL4423009V120LGN-part3       -      -      0      0      0      0
    ata-INTEL_SSDSC2BB120G6R_PHWA640600U8120CGN-part3      -      -      0      0      0      0
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
ssdpool                                                 351G   537G      0      0      0      0
  mirror-0                                              351G   537G      0      0      0      0
    sdd                                                    -      -      0      0      0      0
    sdc                                                    -      -      0      0      0      0
-----------------------------------------------------  -----  -----  -----  -----  -----  -----

Code:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.54    0.00    2.12    4.84    0.00   91.50

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              2.00      0.00     0.00   0.00    0.00     0.00   96.00   1516.00     0.00   0.00    0.10    15.79    0.00      0.00     0.00   0.00    0.00     0.00    2.00    0.00    0.01   1.20
sdb              2.00      0.00     0.00   0.00    0.00     0.00   99.00   1516.00     1.00   1.00    0.09    15.31    0.00      0.00     0.00   0.00    0.00     0.00    2.00    0.50    0.01   1.20
sdc              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdd              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sde             12.00      0.00     0.00   0.00   33.75     0.00  532.00   2507.00     0.00   0.00    1.25     4.71    0.00      0.00     0.00   0.00    0.00     0.00   12.00   33.83    1.48  85.60
sdf             12.00      0.00     0.00   0.00   29.17     0.00  531.00   2502.00     0.00   0.00    1.40     4.71    0.00      0.00     0.00   0.00    0.00     0.00   12.00   29.08    1.44  86.40

RolandK · Mar 4, 2023

the problem only seems to happen with qcow2 but not with raw or vmdk virtual disk

RolandK · Mar 4, 2023

while trying to create a repro-case script to show the problem, i made another weird observation.

can someone explain why qemu-img info shows the virtual disk with different "disk size" on each invocation? ( disk size = "How much space the image file occupies on the host file system" )

this only seems to happen on zfs but not on ext4

Code:

root@dell-r620:~# while true;do /usr/bin/qemu-img create -o preallocation=metadata -f qcow2 /hddpool/disk.qcow2 33554432K;sync; qemu-img info /hddpool/disk.qcow2;rm -rf /hddpool/disk.qcow2;done |grep "disk size"
disk size: 1.6 MiB
disk size: 512 B
disk size: 1.12 MiB
disk size: 512 B
disk size: 594 KiB
disk size: 512 B
disk size: 177 KiB
disk size: 2.01 MiB
disk size: 512 B
disk size: 1.6 MiB
disk size: 512 B
disk size: 1.12 MiB
disk size: 512 B
disk size: 702 KiB
disk size: 512 B
disk size: 212 KiB
disk size: 1.9 MiB
disk size: 512 B
disk size: 1.46 MiB
^C


root@dell-r620:~# while true;do /usr/bin/qemu-img create -o preallocation=metadata -f qcow2 /ext4pool/disk.qcow2 33554432K;sync; qemu-img info /ext4pool/disk.qcow2;rm -rf /ext4pool/disk.qcow2;done |grep "disk size"
disk size: 5.2 MiB
disk size: 5.2 MiB
disk size: 5.2 MiB
disk size: 5.2 MiB
disk size: 5.2 MiB
disk size: 5.2 MiB
disk size: 5.2 MiB
disk size: 5.2 MiB
disk size: 5.2 MiB
disk size: 5.2 MiB
^C


root@dell-r620:~# while true;do /usr/bin/qemu-img create -o preallocation=metadata -f qcow2 /nvmepool/disk.qcow2 33554432K;sync; qemu-img info /nvmepool/disk.qcow2;rm -rf /nvmepool/disk.qcow2;done | grep "disk size"
disk size: 416 KiB
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 592 KiB
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B
disk size: 512 B

RolandK · Mar 4, 2023

here is the repo-case script

Code:

#!/bin/bash
hddfile="/hddpool/disk.qcow2"
nvmefile="/nvmepool/disk.qcow2"

rm -rf $hddfile
rm -rf $nvmefile

/usr/bin/qemu-img create -o preallocation=metadata -f qcow2 $hddfile 33554432K
echo 3 >/proc/sys/vm/drop_caches

# hddpool -> nvmepool slow
time /usr/bin/qemu-img convert -p -f qcow2 -O qcow2 $hddfile $nvmefile

rm -rf $hddfile
echo 3 >/proc/sys/vm/drop_caches

# nvmepool -> hddpool fast
time /usr/bin/qemu-img convert -p -f qcow2 -O qcow2 $nvmefile $hddfile

rm -rf $nvmefile
echo 3 >/proc/sys/vm/drop_caches

# hddpool -> nvmepool fast
time /usr/bin/qemu-img convert -p -f qcow2 -O qcow2 $hddfile $nvmefile

we can see, the initial move of the freshly created qcow2 file is slow. that's where the write iops being observed on the source.

subsequent moves of the file are fast.

it looks that "qemu-img convert" does change the qcow2 file in a special way , which makes it behave different from the initial version created with "qemu-img create"

i could not find out what's the difference

Code:

root@dell-r620:/hddpool# /root/test.sh
Formatting '/hddpool/disk.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=34359738368 lazy_refcounts=off refcount_bits=16
    (100.00/100%)

real    0m22.993s  <- !!!
user    0m0.051s
sys    0m0.075s
    (100.00/100%)

real    0m0.083s
user    0m0.018s
sys    0m0.017s
    (100.00/100%)

real    0m0.059s
user    0m0.026s
sys    0m0.033s

22.993s vs 0.059s is a slowdown by a factor of 389

RolandK · Mar 4, 2023

i have dug into this some more and found that the performance-problem goes away when setting "relatime=on" or "atime=off" for the dataset.

apparently, the initial conversion of the qcow2 file is causing rapid atime change, which is causing the slowness.

questions:
- why does this only happen on the initial conversion run, but not on subsequent ones ?

i did compare the source file with md5 before and after conversion, it's the same as before, so it seems there are no writes to the source file being issued

ADDON (after finding out more , see below) :
the performance problem also goes away when setting preallocation-policy to "off" on the datastore.

RolandK · Mar 4, 2023

i did strace the "qemu-img convert" commands and the slow one is issuing a lot of lseek an the source file. not sure if lseek is causing atime change, though.

apparently, the "qemu-img convert" modifies the qcow2 image in a way that lseek does not happen anymore on subsequent runs.

i found this nice tool at https://github.com/zhangyoujia/qcow2-dump and had a closer look

There is significant difference in L1 table and in refcount table.

not sure, but maybe "qemu-img convert -n" does not preserve preallocated metadata from qemu-img create -o preallocation=metadata , so moving a metadata preallocated disk de-allocates the preallocated metadata!?

when qcow2 is created via proxmox gui, then preallocation=metadata is being set

maybe "--target-is-zero" is missing for the conversion/disk-move ? ( https://bugzilla.redhat.com/show_bug.cgi?id=1845518 )
at least, i don't see the advantage of virtual disk being createed with metadata being allocated , when it's removed again on disk move.

Code:

<snipp>

disk_size: 395776 / 0.38M / 0.00G                                         |     disk_size: 512 / 0.00M / 0.00G

<snipp>
Active Snapshot:                                                                Active Snapshot:
----------------------------------------------------------------------          ----------------------------------------------------------------------
L1 Table:       [offset: 0x30000, len: 1]                                       L1 Table:       [offset: 0x30000, len: 1]

l1 table[   0], l2 offset: 0x40000                                              l1 table[   0], l2 offset: 0x40000
        l2 table[   0], data offset: 0x50000 | vdisk offset: 0x0          |             l2 table[   0], data offset: 0x0 | vdisk offset: 0x0
        l2 table[   1], data offset: 0x60000 | vdisk offset: 0x10000      |             l2 table[   1], data offset: 0x0 | vdisk offset: 0x10000
        l2 table[   2], data offset: 0x70000 | vdisk offset: 0x20000      |             l2 table[   2], data offset: 0x0 | vdisk offset: 0x20000
        l2 table[   3], data offset: 0x80000 | vdisk offset: 0x30000      |             l2 table[   3], data offset: 0x0 | vdisk offset: 0x30000
        l2 table[   4], data offset: 0x90000 | vdisk offset: 0x40000      |             l2 table[   4], data offset: 0x0 | vdisk offset: 0x40000
        l2 table[   5], data offset: 0xa0000 | vdisk offset: 0x50000      |             l2 table[   5], data offset: 0x0 | vdisk offset: 0x50000
        l2 table[   6], data offset: 0xb0000 | vdisk offset: 0x60000      |             l2 table[   6], data offset: 0x0 | vdisk offset: 0x60000
        l2 table[   7], data offset: 0xc0000 | vdisk offset: 0x70000      |             l2 table[   7], data offset: 0x0 | vdisk offset: 0x70000
        l2 table[   8], data offset: 0xd0000 | vdisk offset: 0x80000      |             l2 table[   8], data offset: 0x0 | vdisk offset: 0x80000
        l2 table[   9], data offset: 0xe0000 | vdisk offset: 0x90000      |             l2 table[   9], data offset: 0x0 | vdisk offset: 0x90000
        l2 table[  10], data offset: 0xf0000 | vdisk offset: 0xa0000      |             l2 table[  10], data offset: 0x0 | vdisk offset: 0xa0000
        l2 table[  11], data offset: 0x100000 | vdisk offset: 0xb0000     |             l2 table[  11], data offset: 0x0 | vdisk offset: 0xb0000
    
<snipp>

Refcount Table:                                                                 Refcount Table:
----------------------------------------------------------------------          ----------------------------------------------------------------------
Refcount Table: [offset: 0x10000, len: 8192]                                    Refcount Table: [offset: 0x10000, len: 8192]

refcount table[   0], offset: 0x20000                                           refcount table[   0], offset: 0x20000
        refcount block[    0] cluster[0], refcount: 1 | reference: 1                    refcount block[    0] cluster[0], refcount: 1 | reference: 1
        refcount block[    1] cluster[1], refcount: 1 | reference: 1                    refcount block[    1] cluster[1], refcount: 1 | reference: 1
        refcount block[    2] cluster[2], refcount: 1 | reference: 1                    refcount block[    2] cluster[2], refcount: 1 | reference: 1
        refcount block[    3] cluster[3], refcount: 1 | reference: 1                    refcount block[    3] cluster[3], refcount: 1 | reference: 1
        refcount block[    4] cluster[4], refcount: 1 | reference: 1                    refcount block[    4] cluster[4], refcount: 1 | reference: 1
        refcount block[    5] cluster[5], refcount: 1 | reference: 1      |             refcount block[    5] cluster[5], refcount: 0 | reference: 0
        refcount block[    6] cluster[6], refcount: 1 | reference: 1      |             refcount block[    6] cluster[6], refcount: 0 | reference: 0
        refcount block[    7] cluster[7], refcount: 1 | reference: 1      |             refcount block[    7] cluster[7], refcount: 0 | reference: 0
        refcount block[    8] cluster[8], refcount: 1 | reference: 1      |             refcount block[    8] cluster[8], refcount: 0 | reference: 0
        refcount block[    9] cluster[9], refcount: 1 | reference: 1      |             refcount block[    9] cluster[9], refcount: 0 | reference: 0
        refcount block[   10] cluster[10], refcount: 1 | reference: 1     |             refcount block[   10] cluster[10], refcount: 0 | reference: 0
        refcount block[   11] cluster[11], refcount: 1 | reference: 1     |             refcount block[   11] cluster[11], refcount: 0 | reference: 0
        refcount block[   12] cluster[12], refcount: 1 | reference: 1     |             refcount block[   12] cluster[12], refcount: 0 | reference: 0
        refcount block[   13] cluster[13], refcount: 1 | reference: 1     |             refcount block[   13] cluster[13], refcount: 0 | reference: 0
        refcount block[   14] cluster[14], refcount: 1 | reference: 1     |             refcount block[   14] cluster[14], refcount: 0 | reference: 0
        refcount block[   15] cluster[15], refcount: 1 | reference: 1     |             refcount block[   15] cluster[15], refcount: 0 | reference: 0
        refcount block[   16] cluster[16], refcount: 1 | reference: 1     |             refcount block[   16] cluster[16], refcount: 0 | reference: 0
        refcount block[   17] cluster[17], refcount: 1 | reference: 1     |             refcount block[   17] cluster[17], refcount: 0 | reference: 0
        refcount block[   18] cluster[18], refcount: 1 | reference: 1     |             refcount block[   18] cluster[18], refcount: 0 | reference: 0
    
<snipp>


        Refcount: error: 0, leak: 0, unused: 1019, used: 8197             |             Refcount: error: 0, leak: 0, unused: 9211, used: 5
        --------------------------------------------------------------                  --------------------------------------------------------------

Result:                                                                         Result:
Refcount Table: unaligned: 0, invalid: 0, unused: 8191, used: 1                 Refcount Table: unaligned: 0, invalid: 0, unused: 8191, used: 1
Refcount:       error: 0, leak: 0, unused: 1019, used: 8197               |     Refcount:       error: 0, leak: 0, unused: 9211, used: 5

======================================================================          ======================================================================

COPIED OFLAG:                                                                   COPIED OFLAG:
----------------------------------------------------------------------          ----------------------------------------------------------------------

Result:                                                                         Result:
L1 Table ERROR OFLAG_COPIED: 0                                                  L1 Table ERROR OFLAG_COPIED: 0
L2 Table ERROR OFLAG_COPIED: 0                                                  L2 Table ERROR OFLAG_COPIED: 0

Active Cluster:   8192 [536870912 / 512M / 0G]                            |     Active Cluster:   0 [0 / 0M / 0G]
Active L2 COPIED: 8192 [536870912 / 512M / 0G]                            |     Active L2 COPIED: 0 [0 / 0M / 0G]
All Used Cluster: 8197 [537198592 / 512M / 0G]                            |     All Used Cluster: 5 [327680 / 0M / 0G]

======================================================================          ======================================================================

Summary:                                                                        Summary:
preallocation:  metadata                                                        preallocation:  metadata
Refcount Table: unaligned: 0, invalid: 0, used: 1                               Refcount Table: unaligned: 0, invalid: 0, used: 1
Refcount:       error: 0, leak: 0, unused: 1019, used: 8197               |     Refcount:       error: 0, leak: 0, unused: 9211, used: 5
L1 Table:       unaligned: 0, invalid: 0, unused: 0, used: 1                    L1 Table:       unaligned: 0, invalid: 0, unused: 0, used: 1
L2 Table:       unaligned: 0, invalid: 0, unused: 0, used: 8192           |     L2 Table:       unaligned: 0, invalid: 0, unused: 0, used: 0
                zero_alloc: 0, zero_plain: 0                              |                     zero_alloc: 0, zero_plain: 8192

######################################################################          ######################################################################
######             Haha:  qcow2 image is good!  Y(^_^)Y             ##          ######             Haha:  qcow2 image is good!  Y(^_^)Y             ##
######################################################################          ######################################################################

RolandK · Mar 6, 2023

i tested moving empty virtual disk in proxmox 6.4 between zfs datastores and it's moving nearly instantaneus there.

in proxmox 7.3 it's slow

@sterzy , @Thomas Lamprecht @fabian @fiona not sure if this is worth a look for you. can open a bugticket so this won't get lost if you like.

there may be 2 possible issues described in this:

- pathological slowness on qcow2 image file read , because of bulk atime updates on the source disk
- loss of pre-allocated metadata on disk move

this could have quite some performance impact for some people and may explain this or that slowness in the past...

fiona · Mar 6, 2023

Hi,

RolandK said:
i tested moving empty virtual disk in proxmox 6.4 between zfs datastores and it's moving nearly instantaneus there.

in proxmox 7.3 it's slow

@sterzy , @Thomas Lamprecht @fabian @fiona not sure if this is worth a look for you. can open a bugticket so this won't get lost if you like.

there may be 2 possible issues described in this:

- pathological slowness on qcow2 image file read , because of bulk atime updates on the source disk
- loss of pre-allocated metadata on disk move

this could have quite some performance impact for some people and may explain this or that slowness in the past...

yes, if it's a regression, please open a bug for it. Did you already test with older kernel and QEMU versions on Proxmox VE 7? Knowing which component/version introduced the issue would be very helpful. But IMHO, using qcow2 on top of ZFS is a bad idea to begin with

RolandK said:
maybe "--target-is-zero" is missing for the conversion/disk-move ? ( https://bugzilla.redhat.com/show_bug.cgi?id=1845518 )
at least, i don't see the advantage of virtual disk being createed with metadata being allocated , when it's removed again on disk move.

Well, if it's because of that, it's unfortunate. We'd need to carefully evaluate when we can guarantee the image to be zero if we'd like to use that option.

RolandK said:
apparently, there is something about my observation....

https://forum.proxmox.com/threads/l...seems-to-be-rewriting-the-source-disk.121798/

Hmm, but it's very strange that it's happening during live migration too. That uses drive-mirror with NBD exports and the resolution of the bug report makes it sound like it shouldn't happen for NBD.

RolandK · Mar 6, 2023

Hmm, but it's very strange that it's happening during live migration too. That

yes, but it does

the problem goes away for both migration scenarios (offline + online) when setting relatime=on for the source datapool

RolandK · Mar 6, 2023

yes, if it's a regression, please open a bug for it.

it is

Did you already test with older kernel and QEMU versions on Proxmox VE 7?
Knowing which component/version introduced the issue would be very helpful.

i dug further and apparently it's a kernel issue.

creating a fresh 100gb qcow2 disk on hddpool , reboot afterwards on each create (for fresh zpool statistics) and moving the file to another device give this result.

with kernel 5.15.39-1-pve it's much much faster then with 5.15.39-2-pve.

maybe change from zfs 2.1.4 to 2.1.5 ? @Stoiko Ivanov ?

you can see there are lots of additional writes issued for the same disk move operation, which are not there with kernel 5.15.39-1-pve

Code:

5.15.39-1-pve with zfs 2.1.4:
time qm disk move 100 scsi0  nvmepool
real    0m19.974s

zpool iostat -r hddpool:
 
hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            35      0      0      0      1      0    316      0     10      0      0      0
1K            307      0    144      0     54      0    329      5     41      0      0      0
2K             33      0      0      0      0      0    252     56      2     12      0      0
4K              9      0      0      0      0      2    335     77      0     12      0      0
8K            213      0     12      0      2      5      0    153     12      7      0      0
16K             0      0      0      0      0      5      0     71      0     20      0      0
32K             0      0      0      0      0      2      0      6      0     15      0      0
64K            16      0     32      0      0      0      0      0      0      2      0      0
128K            6      0      0      0      0      0      0      0      0      5      0      0
256K            0      0      0      0      0      0      0      0      0      3      0      0
512K            0      0      0      0      0      0      0      0      0      3      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

Code:

5.15.39-2-pve with zfs 2.1.5:
time qm disk move 100 scsi0  nvmepool
real    1m10.214s

zpool iostat -r hddpool:

hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            33      0      0      0      3      0  13.2K      0      7      0      0      0
1K            324      0  5.94K      0     61      0  15.9K      0    171      1      0      0
2K             47      0      0      0      0      5  10.0K  2.62K     11      9      0      0
4K             16      0      0      0      0      3  22.9K  2.94K      8      8      0      0
8K            216      0     12      0      2      4     20  3.73K      1     13      0      0
16K             2      0      0      0      0      7    238  3.04K      0      8      0      0
32K             0      0      0      0      0      0      0    111      0      8      0      0
64K            16      0     32      0      0      0      0     87      0      0      0      0
128K            1      0      0      0      0      0    217      7      0      2      0      0
256K            0      0      0      0      0      0      0     44      0      2      0      0
512K            0      0      0      0      0      0      0      0      0      4      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

Code:

5.15.39-2-pve with zfs 2.1.5 with atime=on/relatime=on:
time qm disk move 100 scsi0  nvmepool
real    0m3.365s


# zpool iostat -r hddpool

hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            38      0      0      0      0      0    169      0      5      0      0      0
1K             77      0     80      0     64      0    198      7    113      3      0      0
2K             17      0      0      0      0      3    133     13      4     20      0      0
4K             11      0      0      0      0      7    171     46      4      5      0      0
8K            216      0     12      0      2      5      0     52      5      6      0      0
16K             1      0      0      0      0      5      0     33      0      5      0      0
32K             1      0      0      0      0      1      0      0      0     11      0      0
64K            16      0     32      0      0      0      0      0      0      4      0      0
128K            1      0      0      0      0      0      0      0      0      3      0      0
256K            0      0      0      0      0      0      0      0      0      4      0      0
512K            0      0      0      0      0      0      0      0      0      4      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

But IMHO, using qcow2 on top of ZFS is a bad idea to begin with

we run qcow2 on top of zfs local storage with sanoid for replication for quite a while and are happy with that.
we prefer managing files instead of virtual devices, that's easier to handle.
performance is good for us.

RolandK · Mar 6, 2023

yes, if it's a regression, please open a bug for it.

done

https://bugzilla.proxmox.com/show_bug.cgi?id=4574

RolandK · Mar 7, 2023

i've investigated further and i downgraded zfs in 5.15.85-1-pve to plain vanilla zfs 2.1.4 , and the problem is gone there.

so, the problem must have been introduced by zfs 2.1.5

swo · Mar 7, 2023

This seems to be a change in 2.1.5 that could/would affect sparse image creation

https://github.com/openzfs/zfs/pull/13338

RolandK · Mar 7, 2023

yes, i have seen that, but how could this explain change in behaviour on READ / atime updates of the file ?

i have also opened an upstream bugreport at https://github.com/openzfs/zfs/issues/14594

Stoiko Ivanov · Mar 7, 2023

Since it seems the discussion here is more active than in the bugreport in bugzilla - I'll respond here ...

RolandK said:
i've investigated further and i downgraded zfs in 5.15.85-1-pve to plain vanilla zfs 2.1.4 , and the problem is gone there.

so, the problem must have been introduced by zfs 2.1.5

How exactly did you do that?
* install debian's dkms modules on top of the PVE kernel?
* only downgrade the userspace part (plain-debian, or our packages)?
* recompile the kernel with our ZFS submodule at zfs 2.1.4

Does the issue also persist with pve-kernel-6.1 or pve-kernel-5.19 (the latter does not get any updates anymore - but it still might help to narrow this down)

RolandK · Mar 7, 2023

How exactly did you do that?
* install debian's dkms modules on top of the PVE kernel?

i compiled from source and did "make install" so the modules were installed in the kernel module "extra" dir. then cleaned tools from /usr/local afterwards and re-installed zfs-initramfs package because setting up initramfs was broken after that (the hook script was also overwritten by make install).
so i have zfs 2.1.4 with 2.1.5 userspace tools. that shouldn't matter. it's a testing system, so no problem if it gets broken...

Does the issue also persist with pve-kernel-6.1 or pve-kernel-5.19 (the latter does not get any updates anymore - but it still might help to narrow this down)

yes, the problem persists with kernel 6.1.14-1 , including zfs-2.1.9 - and it seems it's gotten worse there:

real 1m27.114s ( 3.18s with relatime=on)

RolandK · Mar 14, 2023

should be fixed :

https://github.com/behlendorf/zfs/commit/d24dc0e833c8592c2f4b2dbae5e03e1052a5ce66

RolandK · Apr 16, 2023

that metioned patch seems to have problems (is in 2.1.10, so be careful !!!) @Stoiko Ivanov @fabian @fiona

https://github.com/openzfs/zfs/issues/14753

https://github.com/openzfs/zfs/releases/tag/zfs-2.1.10

ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #13368

severe performance regression on virtual disk migration for qcow2 on ZFS with 5.15.39-2-pve

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

​

Proxmox Staff Member

​

Famous Member

Famous Member

Famous Member

Famous Member

Member

Famous Member

Proxmox Staff Member

Famous Member

Famous Member

Famous Member

We value your privacy