ritardo IO

Oct 9, 2024
155
3
18
I'm using proxmox VE 8.4.1, I have 2 nodes one with a 24 x INTEL(R) XEON(R) SILVER 4510 CPU and 128G of RAM, the other with a 20 x Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz with 256G of RAM, both have 4 x 2TB HDD disks in RAID5 ZFS, on the node with the 24 x INTEL(R) XEON(R) SILVER 4510 CPU and 128G of RAM I experience a lot of IO delay, on both nodes I only have one vm that I migrate between one node and the other, but on the node with the most performing CPU I have a very high IO delay, even during the migration phase the delay increases a lot, what could it be?
 
4 x 2TB HDD disks in RAID5 ZFS
From that sentence it is not clear if you have a hardware Raid5 and use ZFS on top of it, or if you have a RaidZ1, which is similar like a Raid5.

Whatsoever - that does not work well. No way!

My recommendation, as mentioned multiple times here in the forum:
  • use an HBA, not hardware Raid - make sure PVE sees all physical disks
  • during installation build a single ZFS pool from mirrored vdevs (similar Raid10) - and use all HDDs for this; each and every disk will be bootable at the end, which is really nice, isn't it?
  • after initial setup add a fast "Special Device", at least mirrored. This is crucial! Use two NVMe if possible or use two SATA. Use "Enterprise class" devices with PLP for this. This must be done on the CLI, afaik. You'll find tutorials if you search for it...
This is the only approach I know which possibly(!) might get you acceptable performance for generic use. Note that "HDD only" just doesn't cut it nowadays...
 
before using zfs I was using ceph, but after a month I had serious IO problems, I would like to go back to ceph and use 4 16TB HDDs with the addition of a 1TB SDD for DB/WAL, what do you think?
 
both have 4 x 2TB HDD disks

  • Are all eight the exact same model?
  • Are the controllers, to which those are connected, identical?


before using zfs I was using ceph, but after a month I had serious IO problems, I would like to go back to ceph and use 4 16TB HDDs with the addition of a 1TB SDD for DB/WAL, what do you think?

Did you think about the fact, that HDDs may simply not sufficient for your workload?
 
I would like to go back to ceph and use 4 16TB HDDs
Well..., no! You need several nodes and multiple OSD per node for a good experience, beside other things like a fast (>=10GBit/s) and redundant network. I had used Ceph over a year in my "productive Homelab(!)" - starting as small as possible. Some notes:

 
I read that for ceph they recommend using a few HDDs but larger in capacity, instead of many smaller HDDs, and they are talking about at least a minimum of 3 nodes, honestly I find different ideas on the web
 
I read that for ceph they recommend using a few HDDs but larger in capacity, instead of many smaller HDDs,

Under which specific circumstances? For me this sound plainly wrong. Especially with HDDs you want a zillion independent ones, not only a few.

Disclaimer, as already noted: I have dropped Ceph.
 
so it seems to me that it is better to use ZFS in RAID10, and not ceph?

Yes.

Disclaimer: I am a well-known ZFS-fanboy... ;-)

 
I would like to make a ZFS raid 10 pool, with 4 16TB HDDs
Okay.

and then add a 1TB SDD for bd/wal,

That's wrong terminology. In ZFS there is an optional CACHE, which is a "read-only" caching device. And there is an SLOG, a "Separate LOG for the ZIL (the ZFS-Intention Log)".

Both are usually NOT recommended as they work differently than expected - most of the time. A Cache is a second level extension to the ARC (adaptive replacement cache), which always lives in Ram. When you establish a secondary Cache then this one needs Ram to work. This Ram is taken away from the system --> less Ram left for the normal ARC. Adding a large Cache may slow down your system. The recommendation is: upgrade your Ram to the absolute maximum possible. Only then re-evaluate (learn to read the output of arc_summary) the usefulness of a second level Cache.

An SLOG is often understood as a write-cache, which it is not. The SLOG is usually write-only. Never is data read from it. With the only exception that you encounter a power failure and data was written to SLOG but not yet to the data-disk. In that case its data is read during the next boot when importing the pool. Another aspect is that SLOG accelerates SYNC writes only. "Normal" writes are asynchron and SLOG has nothing to do with it.

It is worth to note that a ZIL exists with and without a SLOG. Without a dedicated SLOG the ZIL lives on the data-disks. That's the main reason why SYNC writes are slow and a separate SLOG would help with that.


In #3 I already mentioned a "Special Device". That's my recommendation. It must be good quality (mirrored and w/ PLP) as a loss of this one means loosing the complete pool.
 
when I create the ZFS pool I can't find the item I can't find the item add disk as cache (L2ARC) or log (SLOG) why?
Only staff can tell you.

A lot of things only work an the CLI... and with knowledge not presented in the PVE documentation...