zfs and I/O delay measurements in proxmox 4.1.33

vigilian

Renowned Member
Oct 9, 2015
82
1
73
Hi,

Little introduction: I had a big problem of heaving big I/O delay and error in file contents inside VM's on my ovh(provider) server a few months ago after an error of consistency from my mdadm raid. So I had to completely restore all my VM's to have my setup back running. Since this event, I was a bit afraid of this might occur again, and especially since I don't know it could happen since it was raid 1. So I'm installing since yesterday now my home proxmox server on a zfs pool RAID 10 with the restoration of my 20+ VM's. It was on a Adaptec RAID 1+0 card before. So the setup of the disk is 4* 1TB western digital RE4 edition. the wish of change was motivated by the security that offers zfs compared whith a traditional RAID.

Maybe it's a bad moment to doing this since proxmox just change the interface so I don't know if they change the I/O delay measurement too.
Anyway, I've completed the restoration of the VM's and I can state a few facts here:
-the I/O delay from the startup of all VM at once after a cold start is really important and goes for a long time, something like 20 or 30 minutes. something like 30-60% of I/O delay.

- when everything is started, the I/O delay continue to 0.11-4%.

-the usage of CPU is 50% of my i7-4790k because of a VM under windows 10 but inside the VM it is only 2% of my CPU because it does basically not a lot of things just a bit of browsing and music playing for testing purposes. It seems extraordinary to my playing one song on one VM in a browser contribute to put 25% of charge more on the cpu.


I didn't put any RAM limit to the system, it has 32GB and it should be okey for now even if it currently uses 29GB with all the VM launched.

Under this few facts, some task are a bit slower than on my adaptec raid, some faster.
Can someone enlight me maybe on the I/O delay, cpu usage? maybe it 's bad setup to do such thing, and maybe should I return to a traditional RAID adaptec?
All the help I can get will be appreciated.
 
To measure I/O you can look at zpool iostat and do an individual iostat on your disks. I had very bad performance on my 12 disk ZFS pool and was able to determine a bad disk with very, very high io delay with iostat.

Considering your startup issues, yes it takes a little bit of time on start, because your ARC is completely empty and needs to be filled to gain performance. You can monitor your ARC with arcstat.py.

All mentioned tools are capable of timed results with a last parameter of e.g. 5 for measurements every 5 seconds. This normally establishes a baseline to further investigation.
 
thanks:)
and more in general should I have the same results than my adaptec RAId configuration or maybe I don't have a system big enough to be as much efficient than that?
is it possible to know how much ram is taken by ZFS?
 
root@ns001007:~# arcstat.py
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
20:55:27 0 0 0 0 0 0 0 0 0 10G 10G

is it the ram?


capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zpoolada 1016G 840G 19 71 158K 424K
mirror 508G 420G 3 47 27.2K 272K
sdb - - 0 15 4.80K 280K
sdc - - 0 15 22.4K 280K
mirror 508G 420G 16 24 131K 152K
sdd - - 0 7 6.40K 158K
sde - - 1 7 125K 158K
---------- ----- ----- ----- ----- ----- -----

do you have any tutorial to debug it?
 
root@ns001007:~# arcstat.py
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
20:55:27 0 0 0 0 0 0 0 0 0 10G 10G

is it the ram?

Yes, you use currently 10 G, at most half your ram in your machine if not configured otherwise.

capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zpoolada 1016G 840G 19 71 158K 424K
mirror 508G 420G 3 47 27.2K 272K
sdb - - 0 15 4.80K 280K
sdc - - 0 15 22.4K 280K
mirror 508G 420G 16 24 131K 152K
sdd - - 0 7 6.40K 158K
sde - - 1 7 125K 158K
---------- ----- ----- ----- ----- ----- -----

do you have any tutorial to debug it?

So, your ZFS is idling around at the moment.

What exaclty to debug? With iostat you can measure your per disk latency. Keep in mind to do it with an interval, than you'll get a better impression.


You did not enable deduplication, did you? That's a very big performance killer and you need LOTS of RAM for that.
 
thanks for the RAM indication. I'm such a newbie in those range of knowledge so...
I did not enable deduplication indeed. I've tried to put it as simple as I could. That's why I deleted the RAID adaptec. I wanted something more efficient than a something external.
But I did not limit the RAM usage for ZFS as I read that it was a good idea to use as much as we can in any circumstances.
Do you have some examples for a problem with
a disk detected with iostat? Because as I'm new to this, I won't recognize any problems I guess.
 

Attachments

  • iostatzpool.txt
    98.8 KB · Views: 13
during my test here, i've launch several app and use others in my windows10 VM with all other vm launched.
Should stress test in a vm the hdd to have more accurate results? or something else maybe?
Seems to me slower than my adaptec raid actually and I don't really see why since it's not a CPU problem nor a RAM problem and basically the adaptec raid does the same job than here with the zfs .... Any advice?
 

that's what i'm talking about. with my adaptec card it was more 0.33% .... here it's 1.50% and above normal<>not normal?
 
just in case it would be because of the chipset:
Code:
root@ns001007:~# lspci
00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 04)
00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-V (rev 04)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 04)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Z87 Express LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 04)
 
I didn't put any RAM limit to the system, it has 32GB and it should be okey for now even if it currently uses 29GB with all the VM launched.
ZFS is very RAM hungry. When you change nothing ZFS will use always the half of your Memory. A minimum that it can works is 4GB (i know have one backupmachine with...). But 16-32GB is recommended only for ZFS. So you use all your RAM vor VMs. It can't work without problems. And a second problem can be an HW Raidcontroller. I talked to the PVE Support, they said never use an HW controller with ZFS. And yes we tested it, (also with an adaptec, don't know which one exactly). But with Freenas, we had on high I/O an kernelpanic. With Proxmox it works, but poor performance. So when you can erease your controller bios completly it should work. I've done this with my IBM SAS controller. Or buy an real or only SAS/SATA controller, build in enough memory and an nice SSD cache: https://pve.proxmox.com/wiki/Storage:_ZFS#Add_Cache_and_Log_to_existing_pool

Hope this helps a little bit. I've done a lot of tests with zfs, before i used it in production.

What says "pveperf" ?
 
A minimum that it can works is 4GB (i know have one backupmachine with...).

Hmm, I'm running it with 256MB ARC and it still works perfect. So, this 4 GB minimum is not a technical one.

But 16-32GB is recommended only for ZFS. So you use all your RAM vor VMs. It can't work without problems. And a second problem can be an HW Raidcontroller. I talked to the PVE Support, they said never use an HW controller with ZFS. And yes we tested it, (also with an adaptec, don't know which one exactly). But with Freenas, we had on high I/O an kernelpanic. With Proxmox it works, but poor performance. So when you can erease your controller bios completly it should work. I've done this with my IBM SAS controller. Or buy an real or only SAS/SATA controller, build in enough memory and an nice SSD cache: https://pve.proxmox.com/wiki/Storage:_ZFS#Add_Cache_and_Log_to_existing_pool

In general do not use a RAID controller which cannot do JBOD. You have to use JBOD for ZFS and that is what fireon described there with the SAS controller. I'm running one of those IT-flashed firmware controllers on my own.
 
thanks for the RAM indication. I'm such a newbie in those range of knowledge so...
I did not enable deduplication indeed. I've tried to put it as simple as I could. That's why I deleted the RAID adaptec. I wanted something more efficient than a something external.
But I did not limit the RAM usage for ZFS as I read that it was a good idea to use as much as we can in any circumstances.
Do you have some examples for a problem with
a disk detected with iostat? Because as I'm new to this, I won't recognize any problems I guess.

Please provide iostat 5 output while doing heavy I/O. Zpool does not show heavy load.
 
ZFS is very RAM hungry. When you change nothing ZFS will use always the half of your Memory. A minimum that it can works is 4GB (i know have one backupmachine with...). But 16-32GB is recommended only for ZFS. So you use all your RAM vor VMs. It can't work without problems. And a second problem can be an HW Raidcontroller. I talked to the PVE Support, they said never use an HW controller with ZFS. And yes we tested it, (also with an adaptec, don't know which one exactly). But with Freenas, we had on high I/O an kernelpanic. With Proxmox it works, but poor performance. So when you can erease your controller bios completly it should work. I've done this with my IBM SAS controller. Or buy an real or only SAS/SATA controller, build in enough memory and an nice SSD cache: https://pve.proxmox.com/wiki/Storage:_ZFS#Add_Cache_and_Log_to_existing_pool

Hope this helps a little bit. I've done a lot of tests with zfs, before i used it in production.

What says "pveperf" ?

@fireon and @LnxBil Maybe I wasn't clear enough but I clearly stated that I had previously an adaptec configuration without zfs with a RAID 10 on it and I hadn't any I/O problems with it. I wanted to change because of scurity features of the ZFS not really for performance because I don't believe in all your story about ZFS performance, clearly because you always take the side of the 128GB of RAM' server which is mostly not the case for people who rent an OVH server for example or any other provider server. It is the case when YOU are an enterprise which have a lot of big server in datacenter etc. But this kind of server are not necessarly the most of the IT range of server since they often have they own system of management.
So please don't talk about RAID controller, we are not discussing that about here + it's clearly stated in the wiki of proxmox that we can't use ZFS with raid controller.

@fireon, you clearly didn't look at what I put on the thread. I've put the arcstat which stated that it use 8-10GB of ram, so no my VM don't eat all of my RAM.

@LnxBil this command is not recognized in my proxmox so I don't know what are you talking about. I did see it on the forum but on none of my proxmox server, iostat exists. zpool iostat exist and that's what I've put earlier but not iostat so maybe indicate to me the place where I could find it.
 
@fireon
@LnxBil this command is not recognized in my proxmox so I don't know what are you talking about. I did see it on the forum but on none of my proxmox server, iostat exists. zpool iostat exist and that's what I've put earlier but not iostat so maybe indicate to me the place where I could find it.

iostat is in the sysstat package and you should install it. It has very handy tools for analysing performance bottlenecks.
 
Here an example of a failed disk:

Code:
root@backup  ~ > iostat -x 5
...
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,06    0,00    0,09   17,38    0,00   82,47

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00    0,60     0,00     2,40     8,00     0,00    0,00    0,00    0,00   0,00   0,00
sdb               0,20     0,00    0,60    0,00   245,60     0,00   818,67     0,02   25,33   25,33    0,00   9,33   0,56
sdc               0,20     0,00    0,60    0,00   302,40     0,00  1008,00     0,01   16,00   16,00    0,00   9,33   0,56
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
sdd               0,20     0,00    0,60    0,00   254,40     0,00   848,00     0,00    6,67    6,67    0,00   5,33   0,32
sdq               0,00     0,00    1,40    0,00     7,20     0,00    10,29     0,01    9,14    9,14    0,00   9,14   1,28
sdn               0,00     0,00    1,00    0,00     6,40     0,00    12,80     0,00    2,40    2,40    0,00   2,40   0,24
sdk               0,00     0,00    6,60    0,00    85,60     0,00    25,94     0,00    0,61    0,61    0,00   0,24   0,16
sdh               0,00     0,00    2,80    0,00    66,40     0,00    47,43     0,00    0,86    0,86    0,00   0,57   0,16
sdo               0,00     0,00    4,00    0,00    92,00     0,00    46,00     0,01    2,60    2,60    0,00   2,40   0,96
sdf               0,00     0,00    3,00    0,00    66,40     0,00    44,27     0,01    3,20    3,20    0,00   3,20   0,96
sdi               0,00     0,00    0,20    0,00     0,80     0,00     8,00     0,00    0,00    0,00    0,00   0,00   0,00
sdl               0,00     0,00    4,00    0,00    92,80     0,00    46,40     0,01    3,40    3,40    0,00   1,40   0,56
sdj               0,00     0,00    5,00    0,00    83,20     0,00    33,28     0,01    2,40    2,40    0,00   0,80   0,40
sdg               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
sdm               0,00     0,00    7,60    0,00   187,20     0,00    49,26     0,07    9,16    9,16    0,00   3,47   2,64
sdp               0,00     0,00    3,40   13,00   117,60   342,40    56,10    14,73 1005,85 1588,47  853,48  60,98 100,00

Faulty disk was sdp, which clearly is out of the line of the others.
 
Code:
root@ns001007:~# pveperf
CPU BOGOMIPS:      63980.16
REGEX/SECOND:      3284731
HD SIZE:           39.25 GB (/dev/dm-0)
BUFFERED READS:    106.39 MB/sec
AVERAGE SEEK TIME: 0.09 ms
FSYNCS/SECOND:     220.90
DNS EXT:           46.27 ms
DNS INT:           102.49 ms
root@ns001007:~# pveperf /zpoolada/
CPU BOGOMIPS:      63980.16
REGEX/SECOND:      3241946
HD SIZE:           252.92 GB (zpoolada)
FSYNCS/SECOND:     110.98
DNS EXT:           41.26 ms
DNS INT:           53.60 ms

As requested the pveperf. I guess the first one is on the system disk? The value are a bit awkward to me because it s a Samsung 840 pro... Seems a bit slow isn't it?

For the second one I don't know if I did it right because that it doesn't t show the good aide which should be 2TB
 
Maybe that's better?
Code:
root@ns001007:~# pveperf /dev/zpoolada/
CPU BOGOMIPS:      63980.16
REGEX/SECOND:      3302348
HD SIZE:           0.01 GB (udev)
FSYNCS/SECOND:     161446.17
DNS EXT:           43.48 ms
DNS INT:           26.18 ms
 
Any idea of what could be a good test for a iostat test ?
Reset of vm is bad idea since it has all in ram memory.
I would be really glad to have some clarifications about the pveperf and if I test the right locations since the default test doesn't seem right etc...

And I'm more worried about the iowait in idle... Do you all have that or my system is maybe corrupted in some way and I should reinstall maybe the all thing?
 
Code:
root@ns001007:~# pveperf
CPU BOGOMIPS:      63980.16
REGEX/SECOND:      3284731
HD SIZE:           39.25 GB (/dev/dm-0)
BUFFERED READS:    106.39 MB/sec
AVERAGE SEEK TIME: 0.09 ms
FSYNCS/SECOND:     220.90
DNS EXT:           46.27 ms
DNS INT:           102.49 ms
root@ns001007:~# pveperf /zpoolada/
CPU BOGOMIPS:      63980.16
REGEX/SECOND:      3241946
HD SIZE:           252.92 GB (zpoolada)
FSYNCS/SECOND:     110.98
DNS EXT:           41.26 ms
DNS INT:           53.60 ms

As requested the pveperf. I guess the first one is on the system disk? The value are a bit awkward to me because it s a Samsung 840 pro... Seems a bit slow isn't it?

For the second one I don't know if I did it right because that it doesn't t show the good aide which should be 2TB

Where does the Samsung come into play? You never mentioned it before, but yes, these are not fast devices for synchronous writes but not that slow. You should have much better values there.

Unfortunately, pveperf is not a good tool for measuring I/O-Performance, because it can lead to very strange values, e.g.

Code:
BUFFERED READS:    121.36 MB/sec
AVERAGE SEEK TIME: 7.26 ms
FSYNCS/SECOND:     4264.41

This is a stock hardware raid controller on a HP DL360G6 with two 146 GB drives. It is completely unreasonable to have such a big number there. For performance, please always consider a fio-based test.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!