Extremely high i/o delay

clanktron

New Member
Dec 23, 2021
9
0
1
23
I've recently setup a few lxc's on my pve node and am having terrible performance due to ridiculously high i/o delay (like a minimum of 5-10% under little to no load, spiking to 40-70% when doing pretty much anything). My containers are running on a mirrored zfs array of two 2tb hdds and I've made sure to allocate 6gb as the max ram usage for the arc cache. Not sure what the culprit is but any help is appreciated.Screen Shot 2021-12-29 at 12.29.33 PM.png
 
What HDDs do you use exactly?

A mirror of two HDDs will not perform very great when it comes to random IOPS. What are your containers doing more? Reading or writing data?

If it is reading, then increasing the ARC size (if you have the memory available), is a good chance to reduce the IO delay during normal operations. The ARC will use up to 50% of the memory if available. If you run arcstat you check how often data reads can be served from the ARC and how often it needs to go down to the actual disks (costing performance). Check the manual page to see what each column represents: man arcstat.

If it is writing, and they have a lot of sync writes (DB for example), you could add a ZIL/SLOG device on a faster SSD (a small intel Optane for example) to store the ZIL on the fast SSD before it is written to the slower HDDs. It really does not need to be large. In my setup (luckily switched over to SSD only by now), the ZIL rarely used up more than 1GB.
 
What HDDs do you use exactly?
The HDD's are repurposed ps4 drives (Seagate 2TB 5400RPM 128MB Cache), so not ideal for performance but I didn't think it would be this bad.
A mirror of two HDDs will not perform very great when it comes to random IOPS. What are your containers doing more? Reading or writing data?
My containers are primarily reading data, with two of them being extremely small running a vpn and dns respectively. The third is running docker with a few web servers, nothing too intensive though.
If it is reading, then increasing the ARC size (if you have the memory available), is a good chance to reduce the IO delay during normal operations. The ARC will use up to 50% of the memory if available. If you run arcstat you check how often data reads can be served from the ARC and how often it needs to go down to the actual disks (costing performance). Check the manual page to see what each column represents: man arcstat.
I could be interpreting my arc data wrong but it seems to be fine (seen below).
If it is writing, and they have a lot of sync writes (DB for example), you could add a ZIL/SLOG device on a faster SSD (a small intel Optane for example) to store the ZIL on the fast SSD before it is written to the slower HDDs. It really does not need to be large. In my setup (luckily switched over to SSD only by now), the ZIL rarely used up more than 1GB.
Despite lack of writing I added a SLOG partition to the same ssd that the main OS is running on just in case that might help, though it doesn't seem to be having any effect even after a reboot.
 

Attachments

  • Screen Shot 2021-12-30 at 3.16.28 PM.png
    Screen Shot 2021-12-30 at 3.16.28 PM.png
    31.4 KB · Views: 33
  • Screen Shot 2021-12-30 at 3.20.05 PM.png
    Screen Shot 2021-12-30 at 3.20.05 PM.png
    95.9 KB · Views: 33
The HDD's are repurposed ps4 drives (Seagate 2TB 5400RPM 128MB Cache), so not ideal for performance but I didn't think it would be this bad.
Hehe, I checked their product number on the photos, and if it really is an ST2000LM007, then they use SMR recording
Recording process Shingled magnetic Recording (SMR), Drive Managed SMR
https://geizhals.eu/seagate-mobile-hdd-2tb-st2000lm007-a1394770.html


Even if they would perform okayish enough, as soon as one fails, and you replace it with the same kind of drive, it is possible that the pool will never resilver as those SMR drives tend to be problematic when large amounts of data is being written to them, causing the kernel to consider them failed if they don't respond in time. There was quite the drama in 2020 when WD sold Red drives with SMR without telling anyone about it (not the only vendor who did that).

See https://blocksandfiles.com/2020/04/15/shingled-drives-have-non-shingled-zones-for-caching-writes/ for some background.

I recommend that you get some non SMR disks, to avoid problems in the future. Geizhals is a website that I really like because you can filter for exactly that stuff, for example: https://geizhals.eu/?cat=hde7s&xf=13745_2000~3772_2.5~8457_Conventional+Magnetic+Recording+(CMR)
 
  • Like
Reactions: Moayad
Even if they would perform okayish enough, as soon as one fails, and you replace it with the same kind of drive, it is possible that the pool will never resilver as those SMR drives tend to be problematic when large amounts of data is being written to them, causing the kernel to consider them failed if they don't respond in time. There was quite the drama in 2020 when WD sold Red drives with SMR without telling anyone about it (not the only vendor who did that).

Damn that really blows. Guess I can still use them for backups or something. Thanks for the help!
 
Damn that really blows. Guess I can still use them for backups or something.
That depends. They are really bad at writing alot of data at once and might go down to 1MB/s or something like that. So if you want to use them for backups you might want to only backup a few GB at a time, which is most of the time not you want from a backup drive. But you could try them as a PBS datastore. Its not recommended to use HDDs for that, but atleast the backups will be incremental, so only the difference to the last backups needs to be written and this should write way less data at once than a vzdump backup.
 
But you could try them as a PBS datastore. Its not recommended to use HDDs for that, but atleast the backups will be incremental, so only the difference to the last backups needs to be written and this should write way less data at once than a vzdump backup.
That was my intention, I already have PBS running alongside PVE. Though I just checked and the drives are luckily still within their return period so I’ll be getting reimbursed instead lol.
 
Thats ofcause the best option.

BTW: If you really want to have some fun with your server you should get some SSDs (enterprise SSDs would be recommended, refurbished ones will also work and won't cost more then new consumer SSDs). The more VMs/LXCs you are running the more IOPS you get and HDDs are really terrible at handling IOPS.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!