High I/O delay and non-responsive VMs while migrating vm disk to zvol

Jul 11, 2019
16
0
1
56
I migrate VMDK disks to zvols and while I running something like

Code:
qemu-img convert -f vmdk sv0044_2.vmdk -O raw /dev/zvol/rpool/vm-1044-disk-1

I get quite high IO delay on the proxmox node and even VMs that are running are non-responsive and logging

Code:
[ 3019.135857] sd 2:0:0:1: [sdb] abort
[ 3019.135878] sd 2:0:0:1: [sdb] abort
[ 3019.135894] sd 2:0:0:1: [sdb] abort
[ 3019.135911] sd 2:0:0:1: [sdb] abort
[ 3080.804019] sd 2:0:0:1: [sdb] abort

inside the vm. The disk above is about 64GiB in size and take a good amount of time (but thats not my concern).

While I'm a ZFS newbie, I have the impression that all/many other I/O is blocked during the dump above. At least, the VMs I have running, about 5, do not do any I/O during this time.

What might be happening here?
 
Last edited:
Code:
root@vmhost02:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool  2.17T   747G  1.44T        -         -     5%    33%  1.07x    ONLINE  -

Code:
root@vmhost02:~# zpool status -v
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0 days 01:42:12 with 0 errors on Sun Jul 14 02:06:13 2019
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
    logs  
      sdb       ONLINE       0     0     0
    cache
      sda       ONLINE       0     0     0

errors: No known data errors

additionally if that is interesting

Code:
root@vmhost02:~# zfs get dedup rpool
NAME   PROPERTY  VALUE          SOURCE
rpool  dedup     off            local

Code:
root@vmhost02:~# zfs get compression rpool
NAME   PROPERTY     VALUE     SOURCE
rpool  compression  lz4       local
 
You enabled dedup and deactivated it. That could be your main performance killer in a simple mirror setup. Recreate your pool without dedup and you should see a performance gain. There is no other way to get rid of the dedup table in your I/O path.

Another point it that you only have one mirrored disk. Even with L2ARC and SLOG device as you tried, you cannot expect miracles from this hardware. What hardware is it by the way?
 
It is a

Code:
Dell r740, 1x 12/24x Xeon Gold 6146 @3.2GHz
PERC H740P Mini (HBA Mode)
128GB RAM
sda, sdb: SAS 3, TOSHIBA, AL15SEB24EQY, SSD, 960GB
sdc, sdd: SAS 3, SAMSUNG, MZILS960HEHP0D3, 2.4TB, 10k
 
You enabled dedup and deactivated it. That could be your main performance killer in a simple mirror setup. Recreate your pool without dedup and you should see a performance gain. There is no other way to get rid of the dedup table in your I/O path.

Another point it that you only have one mirrored disk. Even with L2ARC and SLOG device as you tried, you cannot expect miracles from this hardware. What hardware is it by the way?

Performance is good otherwise... It only suprises me that such copy operations seems to block nearly any other I/O.
 
You enabled dedup and deactivated it. That could be your main performance killer in a simple mirror setup. Recreate your pool without dedup and you should see a performance gain. There is no other way to get rid of the dedup table in your I/O path.

Another point it that you only have one mirrored disk. Even with L2ARC and SLOG device as you tried, you cannot expect miracles from this hardware. What hardware is it by the way?

I had enabled dedup for a very short amount of time.
 
You enabled dedup and deactivated it. That could be your main performance killer in a simple mirror setup. Recreate your pool without dedup and you should see a performance gain. There is no other way to get rid of the dedup table in your I/O path.

Another point it that you only have one mirrored disk. Even with L2ARC and SLOG device as you tried, you cannot expect miracles from this hardware. What hardware is it by the way?

I moved all my disks to an external storage, rebuilt the pool and now move back these disks... surprise: I get the same behaviour.

If I don't know it better I tink I do experience the very same what is decribed here:
https://forum.proxmox.com/threads/k...ks-during-backup-restore-migrate.34362/page-2

during high I/O running VMs get "hung".
 
please add '-t none' to your qemu-img command
 
please add '-t none' to your qemu-img command

Thank you. While I understand that this option might help somehow somewhere else, I also want to state that the above behaviour occurs when I move disks with the build in GUI command "Move disk". So using "-t…" might have an effect on some operations, but it does not help in the situation described above which is: high I/O load seems to freeze VMs
 
yes, high I/O load can affect other VMs using the same storage. if you use online move disk, you can set a bwlimit to mitigate this.
 
yes, high I/O load can affect other VMs using the same storage. if you use online move disk, you can set a bwlimit to mitigate this.

The VMs where completely blocked. Even PROXMOX UI was stalled for a long time. I/O delay was at about 30-40%.

We collect more factual information about that.
 
Maybe offtopic, but recently i faced the same problem.

But all storages are LVM-Thin. For many tries, reviewing logs (there is nothing interesting) i found that if in VM disk "discard" option is enabled - the problem exists. Simply disabling discard in VM which needs to be migrated - all works perfect.
 
Thanks a lot sharing you experience. We're having still the same issues but disable features like replication while doing backups. I'll have a look how disks are configured and if we can reproduce it with the help of your post here.
 
My hardware config is very simple. 1NVMe with proxmox installed and 3 SSD in SATA ports. With replication enabled i didn't try.
But very interesting thing that neither nmap nor in logs i couldn't see what causes this.
 
Interesting. Yeah, I have to say, with the many years I use Linux (~25) I'm not used to experience such behaviour with things do lockout other stuff like this here. At worst, things slow down, but having lock like stops or timeouts of this size are surprising for me.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!