Clone VM on ZFS Raid10 very slow

DominikD · Jan 29, 2020

We are currently running tests to evaluate Proxmox as a Xenserver alternative.

I have a Dell R7515 (AMD 7302P based) with 14 x Intel NVME P4510 2Tb in a ZFS Raid10 (OS is running on another device, so the NVMe Disks are pure VM storage).

I now have a VM with 750 Gb HDD on that ZFS Raid10 and i'm trying to clone it while the VM is offline. That is now running for about 10 minutes and is only 17% done. The speed is horribly slow. Also the process is using HUGE amounts of CPU power. Please see the attached top screenshot.

The proxmox host is currently running absolutely nothing except this VM clone and has a load average of 40.

Anybody has an idea if this is normal?

tom · Jan 29, 2020

How are the disks connected, which controller?

DominikD · Jan 29, 2020

They are NVMe disks, so they are all directly connected to the PCIe bus.

tom · Jan 29, 2020

Please post the full hardware specs.

DominikD · Jan 29, 2020

No problem:
- Dell Poweredge R7515
- 1 x AMD Epyc 7302 (16 x 3.0GHz)
- 16 x 32 Gb Samsung DDR4 PC2933 Registered ECC Memory
- 2 x 240 Gb SATA SSD (Micron 5100 Enterprise m.2 SSDs) with Dell BOSS Card (Marvel AHCI Raid1) for Proxmox OS
- 14 x Intel P4510 2Tb NVMe SSDs directly connected to PCIe bus (no PCIe switch) as Proxmox VM Storage ZFS Raid10
- 2 x Broadcom 10Gbase-T NIC Ports

Anything more? Tom, we already have a license with you for a small Proxmox host, if you would allow it i can also create a ticket through this different host. If we can solve the problems with this machine we will buy another license for this system and the systems that will follow.

DominikD · Jan 30, 2020

What i also see is that the performance when benchmarking the disks directly on the host is about 3-3.5gb / sec. When doing some rough benchmarks inside of a VM i only get about 450-500mb/sec. But while trying some failover and disk replacement tests the resilvering of ZFS was running with 12-15Gb/sec and was very very fast. As 14 x NVMe should be very fast, why do we loose so much performance when using ZFS Raid10 (should be the fastest raid level, right) and even more inside of a VM?

guletz · Jan 30, 2020

Hi,

NVME are fast if you have many process that do i/o on it.

Could you post the output for this commands:

zpool status -v
zpool list -v

and the pmx config of this VM?

Good luck / Bafta.

DominikD · Jan 30, 2020

I know that the strenght with NVMe is with many parallel IOs but there should be at least more than 500mb/sec possible inside a Vm on a huge 14 x P4510 Raid10 or am i wrong? Also it should be a lot faster to clone a 750Gb VM on that array than almost 1 hour runtime and this huge CPU load during clone.

Here is the output:

root@pve:~# zpool status -v
pool: pve1-zfsnvme
state: ONLINE
scan: resilvered 57.1G in 0 days 00:01:28 with 0 errors on Wed Jan 29 19:44:59 2020
config:

NAME STATE READ WRITE CKSUM
pve1-zfsnvme ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e496a98d5051 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e453188e5051 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e47eae8d5051 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e4d5ba8d5051 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e41b5e594f51 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e4a1a38d5051 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e43ed0564f51 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e48c188e5051 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e48ea38d5051 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e4b3b4594f51 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e4ca5f594f51 ONLINE 0 0 0
nvme10n1 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e453a38d5051 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e485ae8d5051 ONLINE 0 0 0

errors: No known data errors

root@pve:~# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pve1-zfsnvme 12.7T 793G 11.9T - - 0% 6% 1.00x ONLINE -
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e496a98d5051 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e453188e5051 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e47eae8d5051 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e4d5ba8d5051 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e41b5e594f51 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e4a1a38d5051 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e43ed0564f51 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e48c188e5051 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.09% - ONLINE
nvme-eui.01000000010000005cd2e48ea38d5051 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e4b3b4594f51 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e4ca5f594f51 - - - - - - - - ONLINE
nvme10n1 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e453a38d5051 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e485ae8d5051 - - - - - - - - ONLINE

root@pve:~# cat /etc/pve/qemu-server/100.conf
agent: 1
bootdisk: scsi0
cores: 8
cpu: host
ide2: none,media=cdrom
memory: 8192
name: testvm-deb10
net0: virtio=BE

A:5A:3F:B8:9E,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: pve1-zfsnvme:vm-100-disk-0,discard=on,size=750G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=e9fd1864-39d5-4c08-8aa7-dbce84e2748c
sockets: 1
vmgenid: 51c54816-0985-495a-baf5-444b5ffb2a78

guletz · Jan 31, 2020

Hi @DominikD

For me is absolut normal what you see it(long time clone). zfs will need to copy each block of your VM(default size = 512 ). I guess that you had use default volblocksize=8k. So each 8k block will be need to be split on 6 vdev => 8/6 = 1.33, so it will be 2 x 8k blocks/vdev. In the end you will need to write a double size data for your cloned VM at least(+2 metadata/vdev). If you have lucky your NVME will use internal 16 K, if not ....you are in trouble !

What I say is only a aproximation of the process, because could be other factors who can make a influence(like compression).

Good luck / Bafta.

wolfgang · Jan 31, 2020

Hi,

how do you make the benchmarks inside the VM?
The point is KVM is limited on IOPS if this benchmark is a 4k benchmark these results are very impressive.

About the NVME topology, can you send the output of this command?

Code:

lspci -tv

DominikD · Feb 7, 2020

I have testet with various easy benchmarks, there is no claim to be perfect. It was meant to have a short performance overview. We now have tested to recreate the ZFS pool as a Raid60 (2 x VDEVs Raidz2 with 7 drives each) and i know that performance is worse in Raidz2, but it is reall really weak,

Do you have any benchmark tools / commands which you recommend to have a better overview?

The lspci output is attached.

DominikD · Feb 7, 2020

I have now cloned a basic Debian 10 VM and made 8 Test VMs. Running a simple "dbench -s 10" on all VMs in parallel shows that each VM does not get more than 140 MB / Sec.

While this simple tests runs kvm processes on the host use up a high amount of CPU, maybe this is normal. But there are lots of "z_wr_iss" processes runnign eating almost every cpu cycle available and driving the average load on the system to about 50. Is this normal if only 8 VMs are making IO and no other cpu intensive tasks?

I would be more than happy to give ssh access to the system. I can also buy a license for this system if that would speed up diagnosing the issue.

DominikD · Feb 7, 2020

During the testing after some time he startet spitting the following errors on the Proxmox host:

[ 5889.944895] Uhhuh. NMI received for unknown reason 2d on CPU 16.
[ 5889.944896] Do you have a strange power saving mode enabled?
[ 5889.944897] Dazed and confused, but trying to continue
[ 5890.105937] Uhhuh. NMI received for unknown reason 2d on CPU 16.
[ 5890.105938] Do you have a strange power saving mode enabled?
[ 5890.105939] Dazed and confused, but trying to continue
[ 5890.836785] Uhhuh. NMI received for unknown reason 2d on CPU 16.
[ 5890.836786] Do you have a strange power saving mode enabled?
[ 5890.836787] Dazed and confused, but trying to continue
[ 5891.807908] Uhhuh. NMI received for unknown reason 2d on CPU 22.
[ 5891.807909] Do you have a strange power saving mode enabled?
[ 5891.807910] Dazed and confused, but trying to continue
[ 5894.268946] Uhhuh. NMI received for unknown reason 2d on CPU 28.
[ 5894.268948] Do you have a strange power saving mode enabled?
[ 5894.268948] Dazed and confused, but trying to continue
[ 5897.400678] Uhhuh. NMI received for unknown reason 2d on CPU 25.
[ 5897.400680] Do you have a strange power saving mode enabled?
[ 5897.400680] Dazed and confused, but trying to continue
[ 5897.531486] Uhhuh. NMI received for unknown reason 2d on CPU 22.
[ 5897.531487] Do you have a strange power saving mode enabled?
[ 5897.531488] Dazed and confused, but trying to continue
[ 5898.987464] Uhhuh. NMI received for unknown reason 2d on CPU 28.
[ 5898.987465] Do you have a strange power saving mode enabled?
[ 5898.987466] Dazed and confused, but trying to continue
[ 5901.600507] Uhhuh. NMI received for unknown reason 2d on CPU 19.
[ 5901.600508] Do you have a strange power saving mode enabled?
[ 5901.600508] Dazed and confused, but trying to continue
[ 5906.766109] Uhhuh. NMI received for unknown reason 2d on CPU 22.
[ 5906.766110] Do you have a strange power saving mode enabled?
[ 5906.766111] Dazed and confused, but trying to continue
[ 5908.673252] Uhhuh. NMI received for unknown reason 2d on CPU 19.
[ 5908.673260] Do you have a strange power saving mode enabled?
[ 5908.673261] Dazed and confused, but trying to continue

guletz · Feb 7, 2020

DominikD said:
While this simple tests runs kvm processes on the host use up a high amount of CPU, maybe this is normal. But there are lots of "z_wr_iss" processes runnign eating almost every cpu cycle available and driving the average load on the system to about 50

See here why is happend:

https://github.com/zfsonlinux/zfs/issues/1637

DominikD · Feb 7, 2020

The linked article says that CPU usage is coming from ZFS compression. I have already disabled compression, just to be sure that CPU usage by compressing is no problem here.

Compression is off on the pool, and compression is off on each vm disk:

root@pve:~# zfs get all pve1-nvme | grep compression
pve1-nvme compression off local

root@pve:~# zfs get all pve1-nvme/vm-100-disk-0 | grep compression
pve1-nvme/vm-100-disk-0 compression off inherited from pve1-nvme
(same on each other virtual disk)

wolfgang · Feb 7, 2020

We have to isolate this problem and not mix everything.
So, please let's start with the NVMe system first.

It is possible to dump all data of the ZFS pool?

I would like to benchmark all 14 NVMe in parallel and compare it with a single NVMe.

You got 26 devices that authenticate as PCI bridge.
Dell does not explain in there manuals what this is.
I guess it is a 26 port PCIe switch for the storage backplane. [1]
There is a second device that I can't identify. [2]

The syslog output what you wrote indicates that de device must reset because nothing happens.
I guess an interrupt gets lost.

This is all new staff so the analysis will take the is no quick fix for it.
But we will find it.

1.) https://www.broadcom.com/products/pcie-switches-bridges/expressfabric#tab-PCIe2
2.) https://www.plda.com/applications/enterprise-storage

DominikD · Feb 7, 2020

Dear Wolfgang thanks for your help, i just wrote you a direct message. I also bought a license for this host, so we can start a more depth analysis and i can give you ssh access to the machine.

DominikD · Feb 7, 2020

Made some tests on the same machine with 4 x Samsung PM1643 SAS SSDs on a Perc H730p controller in a Raid10.

The performance in my 8 benchmark VMs on the same dbench based test is more than double what i get from the 14 x NVMe ZFS raid.

ThinkAgain · Apr 4, 2020

Currently looking to setup Proxmox on a similar system, and am wondering if you‘ve made any progress?

I also found these, by the way:

https://github.com/openzfs/zfs/issues/8381

https://github.com/openzfs/zfs/pull/10018

mmenaz · Apr 5, 2020

DominikD said:
They are NVMe disks, so they are all directly connected to the PCIe bus.

Just for curiosity and forgive my ignorance, what do you mean with "directly connected"? What cables / slots / whatever are you using to put 14 of those drivers in Dell R7515? Does Dell R7515 have 14 U.2 connectors?

Clone VM on ZFS Raid10 very slow

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Famous Member

Member

Famous Member

Proxmox Retired Staff

Member

Attachments

Member

Member

Famous Member

Member

Proxmox Retired Staff

Member

Member

Active Member

Renowned Member