Hi,
Today, I realized that since the night between May the 27th and May the 28th, when we rebooted one of our nodes, its disk latency just went nuts. Alas, it's not just its disks, it's also the disks of any VM started on this node. By investigating, I was able to correlate the beginning of this issue to the reboot and the subsequent use of the pve kernel 4.15.17-1.
For the host's disk latency, see
(daily graph) or
for the weekly version. You can see that the thing started upon the 28th of May. The "return to normal" thingy you see on the daily graph at 18:00 is after we rebooted under pve kernel 4.13.16-2-pve, after trying to upgrade first to 4.15.17-2-pve to see if this issue has been fixed.
For some VM that were on the machine, see eg
Note that the VM arrived on the host from another on the 25th of may, hence the burst between the 25th and the 28th, that is perfectly normal. The second burst, between the 27th and the 28th is the issue.
The end of the graph is when we removed the VM from the host, as this VM is critical and can't suffer from a *10 IO time/latency of its disks.
So, there is an issue with 4.15 pve kernel.
Have you already been informed? If not, I hope this post serves its bug report purpose.
Cheers, and thanks for your work!
Today, I realized that since the night between May the 27th and May the 28th, when we rebooted one of our nodes, its disk latency just went nuts. Alas, it's not just its disks, it's also the disks of any VM started on this node. By investigating, I was able to correlate the beginning of this issue to the reboot and the subsequent use of the pve kernel 4.15.17-1.
For the host's disk latency, see
(daily graph) or
for the weekly version. You can see that the thing started upon the 28th of May. The "return to normal" thingy you see on the daily graph at 18:00 is after we rebooted under pve kernel 4.13.16-2-pve, after trying to upgrade first to 4.15.17-2-pve to see if this issue has been fixed.
For some VM that were on the machine, see eg
Note that the VM arrived on the host from another on the 25th of may, hence the burst between the 25th and the 28th, that is perfectly normal. The second burst, between the 27th and the 28th is the issue.
The end of the graph is when we removed the VM from the host, as this VM is critical and can't suffer from a *10 IO time/latency of its disks.
So, there is an issue with 4.15 pve kernel.
Have you already been informed? If not, I hope this post serves its bug report purpose.
Cheers, and thanks for your work!