Terribly slow full cloning VM on Proxmox VE 5.1 with ZFS ZVOL

chrone

Renowned Member
Apr 15, 2015
115
18
83
planet earth
Using latest Proxmox VE 5.1.x and ZFS RAID1 ZVOL on 8 core CPU and 16GB RAM, full cloning an offline VM is very slow even the CPU %wa and load are below 11% and 8 on average.

Is this because of the new metldown and spectre kernel patches not compatible with ZFS and qemu image convert?

I also experienced the Proxmox GUI froze until restoring VM from vzdump backup was completed on more powerful server without VM running and on ZFS RAID1 ZVOL on Intel SSD DC S3500 series.


Code:
proxmox-ve: 5.1-36 (running kernel: 4.13.13-4-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.4.98-3-pve: 4.4.98-103
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.13.13-5-pve: 4.13.13-36
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
 
A clone with qemu-img reads from the disks, writes the buffer in RAM and stores it back on the disk. ZFS then adds its ARC to it. So, you might just run out of memory on this. You could try to make a clone with ZFS and copy the vmid.conf (with adaptions).

Also to test the difference of kpti, you can set nopti on the grub command line, to disable kpti.
 
  • Like
Reactions: chrone
A clone with qemu-img reads from the disks, writes the buffer in RAM and stores it back on the disk. ZFS then adds its ARC to it. So, you might just run out of memory on this. You could try to make a clone with ZFS and copy the vmid.conf (with adaptions).

Also to test the difference of kpti, you can set nopti on the grub command line, to disable kpti.

Hi Alwin,

I guess you're right. It seemed the qemu image convert used lots of buffer pages and at some point it used the swap up to 2GB as well until the convert was completed.

I didn't notice this slow experience on ZFS 0.6.x on Proxmox VE 4.4 last year before the Meltdown and Spectre fiasco. But I do notice ZFS 0.7.x introduced new CPU high load and high %wa when restoring VM and migrating VM online with local storage or offline .
 
I didn't notice this slow experience on ZFS 0.6.x on Proxmox VE 4.4 last year before the Meltdown and Spectre fiasco. But I do notice ZFS 0.7.x introduced new CPU high load and high %wa when restoring VM and migrating VM online with local storage or offline .
There are some other threads in the forum, the OPs post similar sightings, but these have to be fixed upstream first. :(
 
  • Like
Reactions: chrone
There are some other threads in the forum, the OPs post similar sightings, but these have to be fixed upstream first. :(

Hi Alwin, is there a bugreport I could watch for this issue?

Just hit this kind of slowdown when full cloning a 100GB template on ZFS ZVOL. The "qemu-img convert" used all memory (buffered) and causing overall system slow down. My system is only 64GB of RAM and didn't fit to buffer the 100GB ZVOL used by qemu-img.

Or is there a way to limit memory usage for "qemu-img convert", so it will only buffer the ZVOL up to 1GB perhaps?
 
In the pvetest repository is a new ZFS package, please test this one. There is also a newer kernel, worth a try. A clone operation is for sure IO intensive operation, AFAIK the only way to put some limit on it, would be to either use cstream or a cgroup (memory limit).
 
  • Like
Reactions: chrone
In the pvetest repository is a new ZFS package, please test this one. There is also a newer kernel, worth a try. A clone operation is for sure IO intensive operation, AFAIK the only way to put some limit on it, would be to either use cstream or a cgroup (memory limit).

I see, I'll try it tomorrow and will also test with other block (lvm) or file store (qcow2) to see whether qemu-img convert will use all available memory to buffer 100GB of storage and the slow down caused by the IO in swap.
 
In the pvetest repository is a new ZFS package, please test this one. There is also a newer kernel, worth a try. A clone operation is for sure IO intensive operation, AFAIK the only way to put some limit on it, would be to either use cstream or a cgroup (memory limit).

Just tested with latest packages from pvetest on 32GB RAM host today and the issue is still there.

Full cloning 100GB template from ZFS ZVOL to ZFS ZVOL causing "qemu-img convert" to use all available memory as buffered pages, this caused the overall system slowed down due to out of memory and high increase in swap usage. Disabling the zfs_compressed_arc_enabled didn't help either.

I performed several tested performing full cloning and it seemed qemu-image convert will use all available memory as buffered pages if the source is bigger than total memory and is in ZFS ZVOL image format.
  • Full clone 100GB template with qcow2 on ZFS to qcow2 on ZFS uses more less 1% buffered pages out of 32GB RAM.
  • Full clone 100GB template with qcow2 on ZFS to ZFS ZVOL uses more less 20% buffered pages out of 32GB RAM.
  • Full clone 100GB template with ZFS ZVOL to qcow2 on ZFS uses all available memory as buffered pages out of 32GB RAM.
  • Full clone 100GB template with ZFS ZVOL to ZFS ZVOL uses all available memory as buffered pages out of 32GB RAM.
I think there's an issue with qemu-img convert and ZFS ZVOL as source image format. Is this some kind of memory leak or just a bug from qemu or zfs?
 
Last edited:
As pointed out by @fabian , using the qemu-img convert -t none and -T none as the source and destination drive cache mode fixed the issue.

Code:
/usr/bin/qemu-img convert -p -n -f raw -O raw -t none -T none /dev/zvol/rpool/data/vm-1081-disk-1 zeroinit:/dev/zvol/rpool/data/vm-1101-disk-2

Will there be interactive option in the disk online migration UI and clone UI to check -t none and -T none if we are using ZFS ZVOL as source image format, or both UIs will detect what format we are using automatically?
 
As pointed out by @fabian , using the qemu-img convert -t none and -T none as the source and destination drive cache mode fixed the issue.

Code:
/usr/bin/qemu-img convert -p -n -f raw -O raw -t none -T none /dev/zvol/rpool/data/vm-1081-disk-1 zeroinit:/dev/zvol/rpool/data/vm-1101-disk-2

Will there be interactive option in the disk online migration UI and clone UI to check -t none and -T none if we are using ZFS ZVOL as source image format, or both UIs will detect what format we are using automatically?

I think we can just make an informed, automatic decision using the defaults of source and target storage types as well as formats.
 
  • Like
Reactions: chrone
I think we can just make an informed, automatic decision using the defaults of source and target storage types as well as formats.

Hi Fabian,

Restoring from 200GB backup file from vzdump as ZVOL also slows down the system (Proxmox Web UI not responsive). Is this still related to qemu-img issue for ZVOL?
 

Attachments

  • proxmox 5.1 zfs vm restore - web ui not responding.png
    proxmox 5.1 zfs vm restore - web ui not responding.png
    63.7 KB · Views: 18
Hi Fabian,

Restoring from 200GB backup file from vzdump as ZVOL also slows down the system (Proxmox Web UI not responsive). Is this still related to qemu-img issue for ZVOL?

it's the same underlying issue (opening the zvol in Qemu with no-flush, which leads to huge amounts of buffered memory on the kernel side, which leads to ARC collapse and possibly swapping). the fix for qemu-img convert (offline move-disk/full clone) is already in git, the one for vma extract/qmrestore is still in the works.
 
  • Like
Reactions: chrone
it's the same underlying issue (opening the zvol in Qemu with no-flush, which leads to huge amounts of buffered memory on the kernel side, which leads to ARC collapse and possibly swapping). the fix for qemu-img convert (offline move-disk/full clone) is already in git, the one for vma extract/qmrestore is still in the works.

Thanks Fabian. Looking forward for this.
 
it's the same underlying issue (opening the zvol in Qemu with no-flush, which leads to huge amounts of buffered memory on the kernel side, which leads to ARC collapse and possibly swapping). the fix for qemu-img convert (offline move-disk/full clone) is already in git, the one for vma extract/qmrestore is still in the works.
this issue somehow still exists under 6.1-3 i cloned 225G raw ZFS block through GUI and got io delay up to 25%, i tried again with the "qemu-img -t none -T none" from above and io delay peaked at max 5% acrsz was maxed out at 62G after that, looked like it did not touch arc while cloning via gui.
 
Last edited:
this issue somehow still exists under 6.1-3 i cloned 225G raw ZFS block through GUI and got io delay up to 25%, i tried again with the "qemu-img -t none -T none" from above and io delay peaked at max 5% acrsz was maxed out at 62G after that, looked like it did not touch arc while cloning via gui.

offline in both cases? online "move disk" is a completely different code path..
 
you can set a bwlimit for online cloning (it's not yet available in the clone dialogue, but you can set it in datacenter.cfg for all clone operations, or manually via the API/qm clone for individual clone operations) - that should alleviate the problem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!