Pve kernel 4.15.17 comes with high and unstable disk latency.

PEB

New Member
May 17, 2013
11
1
3
Hi,

Today, I realized that since the night between May the 27th and May the 28th, when we rebooted one of our nodes, its disk latency just went nuts. Alas, it's not just its disks, it's also the disks of any VM started on this node. By investigating, I was able to correlate the beginning of this issue to the reboot and the subsequent use of the pve kernel 4.15.17-1.

For the host's disk latency, see
fz_daily.png

(daily graph) or
fz_weekly.png

for the weekly version. You can see that the thing started upon the 28th of May. The "return to normal" thingy you see on the daily graph at 18:00 is after we rebooted under pve kernel 4.13.16-2-pve, after trying to upgrade first to 4.15.17-2-pve to see if this issue has been fixed.

For some VM that were on the machine, see eg
vert.png


Note that the VM arrived on the host from another on the 25th of may, hence the burst between the 25th and the 28th, that is perfectly normal. The second burst, between the 27th and the 28th is the issue.

The end of the graph is when we removed the VM from the host, as this VM is critical and can't suffer from a *10 IO time/latency of its disks.

So, there is an issue with 4.15 pve kernel.

Have you already been informed? If not, I hope this post serves its bug report purpose.

Cheers, and thanks for your work!
 
  • Like
Reactions: sumsum

marsian

Active Member
Sep 27, 2016
55
4
28
Interesting, but I would assume some more technical information on your environment could be helpful here ;) HDD/SSD types, Controller, Cache, Type of Storage (local 7k2/10k, iSCSI etc.), ...?
 

micush

Active Member
Jul 18, 2015
69
3
28
Phoronix benchmarking shows that Spectre and Meltdown patches can affect both cpu and disk performance. Perhaps that is what is going on here.
 

PEB

New Member
May 17, 2013
11
1
3
Interesting, but I would assume some more technical information on your environment could be helpful here ;) HDD/SSD types, Controller, Cache, Type of Storage (local 7k2/10k, iSCSI etc.), ...?

As the own host disk is impacted, I'd bet it's not really a big matter. But as soon as some specific intel is asked I'd be happy to answer.

Regarding meltdown/spectre patches, maybe it could be it. But as you see, it's not just a basic increase of the average access times, it's an increase coming with instabilities. Also, I can imagine delay in many things, but x5 to x20 for latency? I'm doubtful.
 

mac.linux.free

Active Member
Jan 29, 2017
177
10
38
45
Do you have PTI enabled? Turn it off and see if that fixes it.

We took a 30% hit when we enabled Spectre/Meltdown mitigation. Same symptoms you're seeing.

for me at least it fixes it...hard to believe but true
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!