High I/O delay and non-responsive VMs while migrating vm disk to zvol

Marcel Lanz · Aug 4, 2019

I migrate VMDK disks to zvols and while I running something like

Code:

qemu-img convert -f vmdk sv0044_2.vmdk -O raw /dev/zvol/rpool/vm-1044-disk-1

I get quite high IO delay on the proxmox node and even VMs that are running are non-responsive and logging

Code:

[ 3019.135857] sd 2:0:0:1: [sdb] abort
[ 3019.135878] sd 2:0:0:1: [sdb] abort
[ 3019.135894] sd 2:0:0:1: [sdb] abort
[ 3019.135911] sd 2:0:0:1: [sdb] abort
[ 3080.804019] sd 2:0:0:1: [sdb] abort

inside the vm. The disk above is about 64GiB in size and take a good amount of time (but thats not my concern).

While I'm a ZFS newbie, I have the impression that all/many other I/O is blocked during the dump above. At least, the VMs I have running, about 5, do not do any I/O during this time.

What might be happening here?

LnxBil · Aug 4, 2019

Please post the output of the following commands in CODE tags

zpool list
zpool status -v

Marcel Lanz · Aug 4, 2019

Code:

root@vmhost02:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool  2.17T   747G  1.44T        -         -     5%    33%  1.07x    ONLINE  -

Code:

root@vmhost02:~# zpool status -v
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0 days 01:42:12 with 0 errors on Sun Jul 14 02:06:13 2019
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
    logs  
      sdb       ONLINE       0     0     0
    cache
      sda       ONLINE       0     0     0

errors: No known data errors

additionally if that is interesting

Code:

root@vmhost02:~# zfs get dedup rpool
NAME   PROPERTY  VALUE          SOURCE
rpool  dedup     off            local

Code:

root@vmhost02:~# zfs get compression rpool
NAME   PROPERTY     VALUE     SOURCE
rpool  compression  lz4       local

LnxBil · Aug 4, 2019

You enabled dedup and deactivated it. That could be your main performance killer in a simple mirror setup. Recreate your pool without dedup and you should see a performance gain. There is no other way to get rid of the dedup table in your I/O path.

Another point it that you only have one mirrored disk. Even with L2ARC and SLOG device as you tried, you cannot expect miracles from this hardware. What hardware is it by the way?

Marcel Lanz · Aug 4, 2019

It is a

Code:

Dell r740, 1x 12/24x Xeon Gold 6146 @3.2GHz
PERC H740P Mini (HBA Mode)
128GB RAM
sda, sdb: SAS 3, TOSHIBA, AL15SEB24EQY, SSD, 960GB
sdc, sdd: SAS 3, SAMSUNG, MZILS960HEHP0D3, 2.4TB, 10k

Marcel Lanz · Aug 4, 2019

LnxBil said:
You enabled dedup and deactivated it. That could be your main performance killer in a simple mirror setup. Recreate your pool without dedup and you should see a performance gain. There is no other way to get rid of the dedup table in your I/O path.

Another point it that you only have one mirrored disk. Even with L2ARC and SLOG device as you tried, you cannot expect miracles from this hardware. What hardware is it by the way?

Performance is good otherwise... It only suprises me that such copy operations seems to block nearly any other I/O.

Marcel Lanz · Aug 4, 2019

LnxBil said:
You enabled dedup and deactivated it. That could be your main performance killer in a simple mirror setup. Recreate your pool without dedup and you should see a performance gain. There is no other way to get rid of the dedup table in your I/O path.

Another point it that you only have one mirrored disk. Even with L2ARC and SLOG device as you tried, you cannot expect miracles from this hardware. What hardware is it by the way?

I had enabled dedup for a very short amount of time.

Marcel Lanz · Aug 5, 2019

LnxBil said:
You enabled dedup and deactivated it. That could be your main performance killer in a simple mirror setup. Recreate your pool without dedup and you should see a performance gain. There is no other way to get rid of the dedup table in your I/O path.

Another point it that you only have one mirrored disk. Even with L2ARC and SLOG device as you tried, you cannot expect miracles from this hardware. What hardware is it by the way?

I moved all my disks to an external storage, rebuilt the pool and now move back these disks... surprise: I get the same behaviour.

If I don't know it better I tink I do experience the very same what is decribed here:
https://forum.proxmox.com/threads/k...ks-during-backup-restore-migrate.34362/page-2

during high I/O running VMs get "hung".

fabian · Aug 5, 2019

please add '-t none' to your qemu-img command

Marcel Lanz · Aug 5, 2019

fabian said:
please add '-t none' to your qemu-img command

Thank you. While I understand that this option might help somehow somewhere else, I also want to state that the above behaviour occurs when I move disks with the build in GUI command "Move disk". So using "-t…" might have an effect on some operations, but it does not help in the situation described above which is: high I/O load seems to freeze VMs

fabian · Aug 5, 2019

yes, high I/O load can affect other VMs using the same storage. if you use online move disk, you can set a bwlimit to mitigate this.

Marcel Lanz · Aug 5, 2019

fabian said:
yes, high I/O load can affect other VMs using the same storage. if you use online move disk, you can set a bwlimit to mitigate this.

The VMs where completely blocked. Even PROXMOX UI was stalled for a long time. I/O delay was at about 30-40%.

We collect more factual information about that.

Aleksej · Apr 6, 2021

Maybe offtopic, but recently i faced the same problem.

But all storages are LVM-Thin. For many tries, reviewing logs (there is nothing interesting) i found that if in VM disk "discard" option is enabled - the problem exists. Simply disabling discard in VM which needs to be migrated - all works perfect.

Marcel Lanz · Apr 7, 2021

Thanks a lot sharing you experience. We're having still the same issues but disable features like replication while doing backups. I'll have a look how disks are configured and if we can reproduce it with the help of your post here.

Aleksej · Apr 7, 2021

My hardware config is very simple. 1NVMe with proxmox installed and 3 SSD in SATA ports. With replication enabled i didn't try.
But very interesting thing that neither nmap nor in logs i couldn't see what causes this.

Marcel Lanz · Apr 7, 2021

Interesting. Yeah, I have to say, with the many years I use Linux (~25) I'm not used to experience such behaviour with things do lockout other stuff like this here. At worst, things slow down, but having lock like stops or timeouts of this size are surprising for me.

Aleksej · Apr 7, 2021

I can tell you more))) I'm using Linux not so long (~10years) but i didn't ever seen that some of disks (new hardware) suddenly was lost. I posted in in my separate thread. If interesting https://forum.proxmox.com/threads/storage-lost.87189/

Search

Search

High I/O delay and non-responsive VMs while migrating vm disk to zvol

Marcel Lanz

New Member

LnxBil

Distinguished Member

Marcel Lanz

New Member

LnxBil

Distinguished Member

Marcel Lanz

New Member

Marcel Lanz

New Member

Marcel Lanz

New Member

Marcel Lanz

New Member

fabian

Proxmox Staff Member

Marcel Lanz

New Member

fabian

Proxmox Staff Member

Marcel Lanz

New Member

Aleksej

Well-Known Member

Marcel Lanz

New Member

Aleksej

Well-Known Member

Marcel Lanz

New Member

Aleksej

Well-Known Member

We value your privacy