Random High IO issues

shaunieb

New Member
Oct 29, 2012
17
1
1
South Africa
Hi

Over the last month I've been experiencing random high IO issues on random nodes in my PVE4 cluster.

There is no indication of what VM would be causing this high io as all VM's can be idle at the time of the event.

If I run a IOTop on the VM with high IO the disk read and writes are minimal a few megs W/R on occasion.
What I do notice is that if I run TOP id is sitting at 40-50, although there is no substantial resource use besides idle processes.

To resolve the issue, I need to migrate or shut down a VM or container. But not any VM or container. It will always be a specific one.

What I have noticed is that if I shut down the offending container the issue resolves but if I start the container up again the issue resumes. Only if I restart the Node can I start that specific container back up on that node.

Ill attach some relevant data to this post. If anyone has had a similar problem please let me know. On the storage side I am using CEPH with an Infiniband backbone.
 

Attachments

  • IO_Issues.PNG
    IO_Issues.PNG
    34.2 KB · Views: 17
  • nodeA_Packages.PNG
    nodeA_Packages.PNG
    25.9 KB · Views: 14
  • nodeB_packages.PNG
    nodeB_packages.PNG
    28.9 KB · Views: 10
  • nodeC_packages.PNG
    nodeC_packages.PNG
    29.5 KB · Views: 11
The spikes here indicating that the processes of the VM/Container were processing data faster than they could read from the storage (Ceph in you case)
It's not indicating a problem per se, it just shows that your workload is more depending on I/O than CPU but
I suggest you run a benchmarking tool inside the VM / Container like bonnie++ so you have numbers to see if your storage match your expectations (a few MB/s causing I/O wait is rather unusual)
 
The spikes here indicating that the processes of the VM/Container were processing data faster than they could read from the storage (Ceph in you case)
It's not indicating a problem per se, it just shows that your workload is more depending on I/O than CPU but
I suggest you run a benchmarking tool inside the VM / Container like bonnie++ so you have numbers to see if your storage match your expectations (a few MB/s causing I/O wait is rather unusual)

Hi

Thanks for your response.
I do know what IO indicates. The Issue is that there is no indication of where that IO load is coming from. None of the VM's indicate anything but Idle IO. When I migrated the VM's off that node, the IO on the node which I migrate the vms to remains static and the IO I migrated the nodes from drops back to idle. This would indicate that those VM's definitely are not contributing to a high IO workload as the workload would of moved with them.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!