Clone VM on ZFS Raid10 very slow

DominikD

Member
Jan 29, 2020
11
0
21
43
We are currently running tests to evaluate Proxmox as a Xenserver alternative.

I have a Dell R7515 (AMD 7302P based) with 14 x Intel NVME P4510 2Tb in a ZFS Raid10 (OS is running on another device, so the NVMe Disks are pure VM storage).

I now have a VM with 750 Gb HDD on that ZFS Raid10 and i'm trying to clone it while the VM is offline. That is now running for about 10 minutes and is only 17% done. The speed is horribly slow. Also the process is using HUGE amounts of CPU power. Please see the attached top screenshot.

The proxmox host is currently running absolutely nothing except this VM clone and has a load average of 40.

Anybody has an idea if this is normal?
 

Attachments

  • zfs clone cpu usage.png
    zfs clone cpu usage.png
    193.4 KB · Views: 23
How are the disks connected, which controller?
 
Please post the full hardware specs.
 
No problem:
- Dell Poweredge R7515
- 1 x AMD Epyc 7302 (16 x 3.0GHz)
- 16 x 32 Gb Samsung DDR4 PC2933 Registered ECC Memory
- 2 x 240 Gb SATA SSD (Micron 5100 Enterprise m.2 SSDs) with Dell BOSS Card (Marvel AHCI Raid1) for Proxmox OS
- 14 x Intel P4510 2Tb NVMe SSDs directly connected to PCIe bus (no PCIe switch) as Proxmox VM Storage ZFS Raid10
- 2 x Broadcom 10Gbase-T NIC Ports

Anything more? Tom, we already have a license with you for a small Proxmox host, if you would allow it i can also create a ticket through this different host. If we can solve the problems with this machine we will buy another license for this system and the systems that will follow.
 
What i also see is that the performance when benchmarking the disks directly on the host is about 3-3.5gb / sec. When doing some rough benchmarks inside of a VM i only get about 450-500mb/sec. But while trying some failover and disk replacement tests the resilvering of ZFS was running with 12-15Gb/sec and was very very fast. As 14 x NVMe should be very fast, why do we loose so much performance when using ZFS Raid10 (should be the fastest raid level, right) and even more inside of a VM?
 
Hi,

NVME are fast if you have many process that do i/o on it.

Could you post the output for this commands:

zpool status -v
zpool list -v

and the pmx config of this VM?

Good luck / Bafta.
 
I know that the strenght with NVMe is with many parallel IOs but there should be at least more than 500mb/sec possible inside a Vm on a huge 14 x P4510 Raid10 or am i wrong? Also it should be a lot faster to clone a 750Gb VM on that array than almost 1 hour runtime and this huge CPU load during clone.

Here is the output:

root@pve:~# zpool status -v
pool: pve1-zfsnvme
state: ONLINE
scan: resilvered 57.1G in 0 days 00:01:28 with 0 errors on Wed Jan 29 19:44:59 2020
config:

NAME STATE READ WRITE CKSUM
pve1-zfsnvme ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e496a98d5051 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e453188e5051 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e47eae8d5051 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e4d5ba8d5051 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e41b5e594f51 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e4a1a38d5051 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e43ed0564f51 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e48c188e5051 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e48ea38d5051 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e4b3b4594f51 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e4ca5f594f51 ONLINE 0 0 0
nvme10n1 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e453a38d5051 ONLINE 0 0 0
nvme-eui.01000000010000005cd2e485ae8d5051 ONLINE 0 0 0

errors: No known data errors


root@pve:~# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pve1-zfsnvme 12.7T 793G 11.9T - - 0% 6% 1.00x ONLINE -
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e496a98d5051 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e453188e5051 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e47eae8d5051 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e4d5ba8d5051 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e41b5e594f51 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e4a1a38d5051 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e43ed0564f51 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e48c188e5051 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.09% - ONLINE
nvme-eui.01000000010000005cd2e48ea38d5051 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e4b3b4594f51 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e4ca5f594f51 - - - - - - - - ONLINE
nvme10n1 - - - - - - - - ONLINE
mirror 1.81T 113G 1.70T - - 0% 6.10% - ONLINE
nvme-eui.01000000010000005cd2e453a38d5051 - - - - - - - - ONLINE
nvme-eui.01000000010000005cd2e485ae8d5051 - - - - - - - - ONLINE

root@pve:~# cat /etc/pve/qemu-server/100.conf
agent: 1
bootdisk: scsi0
cores: 8
cpu: host
ide2: none,media=cdrom
memory: 8192
name: testvm-deb10
net0: virtio=BE:DA:5A:3F:B8:9E,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: pve1-zfsnvme:vm-100-disk-0,discard=on,size=750G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=e9fd1864-39d5-4c08-8aa7-dbce84e2748c
sockets: 1
vmgenid: 51c54816-0985-495a-baf5-444b5ffb2a78
 
Hi @DominikD

For me is absolut normal what you see it(long time clone). zfs will need to copy each block of your VM(default size = 512 ). I guess that you had use default volblocksize=8k. So each 8k block will be need to be split on 6 vdev => 8/6 = 1.33, so it will be 2 x 8k blocks/vdev. In the end you will need to write a double size data for your cloned VM at least(+2 metadata/vdev). If you have lucky your NVME will use internal 16 K, if not ....you are in trouble !

What I say is only a aproximation of the process, because could be other factors who can make a influence(like compression).

Good luck / Bafta.
 
Hi,

how do you make the benchmarks inside the VM?
The point is KVM is limited on IOPS if this benchmark is a 4k benchmark these results are very impressive.

About the NVME topology, can you send the output of this command?

Code:
lspci -tv
 
I have testet with various easy benchmarks, there is no claim to be perfect. It was meant to have a short performance overview. We now have tested to recreate the ZFS pool as a Raid60 (2 x VDEVs Raidz2 with 7 drives each) and i know that performance is worse in Raidz2, but it is reall really weak,

Do you have any benchmark tools / commands which you recommend to have a better overview?

The lspci output is attached.
 

Attachments

I have now cloned a basic Debian 10 VM and made 8 Test VMs. Running a simple "dbench -s 10" on all VMs in parallel shows that each VM does not get more than 140 MB / Sec.

While this simple tests runs kvm processes on the host use up a high amount of CPU, maybe this is normal. But there are lots of "z_wr_iss" processes runnign eating almost every cpu cycle available and driving the average load on the system to about 50. Is this normal if only 8 VMs are making IO and no other cpu intensive tasks?

I would be more than happy to give ssh access to the system. I can also buy a license for this system if that would speed up diagnosing the issue.
 
During the testing after some time he startet spitting the following errors on the Proxmox host:

[ 5889.944895] Uhhuh. NMI received for unknown reason 2d on CPU 16.
[ 5889.944896] Do you have a strange power saving mode enabled?
[ 5889.944897] Dazed and confused, but trying to continue
[ 5890.105937] Uhhuh. NMI received for unknown reason 2d on CPU 16.
[ 5890.105938] Do you have a strange power saving mode enabled?
[ 5890.105939] Dazed and confused, but trying to continue
[ 5890.836785] Uhhuh. NMI received for unknown reason 2d on CPU 16.
[ 5890.836786] Do you have a strange power saving mode enabled?
[ 5890.836787] Dazed and confused, but trying to continue
[ 5891.807908] Uhhuh. NMI received for unknown reason 2d on CPU 22.
[ 5891.807909] Do you have a strange power saving mode enabled?
[ 5891.807910] Dazed and confused, but trying to continue
[ 5894.268946] Uhhuh. NMI received for unknown reason 2d on CPU 28.
[ 5894.268948] Do you have a strange power saving mode enabled?
[ 5894.268948] Dazed and confused, but trying to continue
[ 5897.400678] Uhhuh. NMI received for unknown reason 2d on CPU 25.
[ 5897.400680] Do you have a strange power saving mode enabled?
[ 5897.400680] Dazed and confused, but trying to continue
[ 5897.531486] Uhhuh. NMI received for unknown reason 2d on CPU 22.
[ 5897.531487] Do you have a strange power saving mode enabled?
[ 5897.531488] Dazed and confused, but trying to continue
[ 5898.987464] Uhhuh. NMI received for unknown reason 2d on CPU 28.
[ 5898.987465] Do you have a strange power saving mode enabled?
[ 5898.987466] Dazed and confused, but trying to continue
[ 5901.600507] Uhhuh. NMI received for unknown reason 2d on CPU 19.
[ 5901.600508] Do you have a strange power saving mode enabled?
[ 5901.600508] Dazed and confused, but trying to continue
[ 5906.766109] Uhhuh. NMI received for unknown reason 2d on CPU 22.
[ 5906.766110] Do you have a strange power saving mode enabled?
[ 5906.766111] Dazed and confused, but trying to continue
[ 5908.673252] Uhhuh. NMI received for unknown reason 2d on CPU 19.
[ 5908.673260] Do you have a strange power saving mode enabled?
[ 5908.673261] Dazed and confused, but trying to continue
 
The linked article says that CPU usage is coming from ZFS compression. I have already disabled compression, just to be sure that CPU usage by compressing is no problem here.

Compression is off on the pool, and compression is off on each vm disk:

root@pve:~# zfs get all pve1-nvme | grep compression
pve1-nvme compression off local

root@pve:~# zfs get all pve1-nvme/vm-100-disk-0 | grep compression
pve1-nvme/vm-100-disk-0 compression off inherited from pve1-nvme
(same on each other virtual disk)
 
We have to isolate this problem and not mix everything.
So, please let's start with the NVMe system first.

It is possible to dump all data of the ZFS pool?

I would like to benchmark all 14 NVMe in parallel and compare it with a single NVMe.

You got 26 devices that authenticate as PCI bridge.
Dell does not explain in there manuals what this is.
I guess it is a 26 port PCIe switch for the storage backplane. [1]
There is a second device that I can't identify. [2]

The syslog output what you wrote indicates that de device must reset because nothing happens.
I guess an interrupt gets lost.

This is all new staff so the analysis will take the is no quick fix for it.
But we will find it.

1.) https://www.broadcom.com/products/pcie-switches-bridges/expressfabric#tab-PCIe2
2.) https://www.plda.com/applications/enterprise-storage
 
Dear Wolfgang thanks for your help, i just wrote you a direct message. I also bought a license for this host, so we can start a more depth analysis and i can give you ssh access to the machine.
 
Made some tests on the same machine with 4 x Samsung PM1643 SAS SSDs on a Perc H730p controller in a Raid10.

The performance in my 8 benchmark VMs on the same dbench based test is more than double what i get from the 14 x NVMe ZFS raid. :(
 
They are NVMe disks, so they are all directly connected to the PCIe bus.
Just for curiosity and forgive my ignorance, what do you mean with "directly connected"? What cables / slots / whatever are you using to put 14 of those drivers in Dell R7515? Does Dell R7515 have 14 U.2 connectors?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!