Why are my VMs so slow?

proxwolfe · Sep 3, 2022

Hi,

My impression is that VMs on my server are running (more) slowly (than necessary).

To be fair, my server is not exactly cutting edge hardware. However: While VMs are slow, they only use a small percentage of the CPU allocated.

I would understand, if they were reaching their CPU limit. But why are they not using all the CPU they can get in order to run faster??? (And the reason is not that other VMs are using the remaining CPU power. The PVE host is also not using more than maybe 20% CPU power.)

So what is going on?

Thanks!

jdancer · Sep 3, 2022

This is what I use to optimize IOPS:

Set write cache enable (WCE) to 1 on SAS drives
Set VM cache to none
Set VM to use VirtIO-single SCSI controller and enable IO thread and discard option
Set VM CPU type to 'host'
Set VM VirtIO Multiqueue to number of cores/vCPUs

If using Linux:
Set Linux VMs IO scheduler to none/noop

If using Ceph:
Set RBD pool to use the 'krbd' option

I don't run Windows VMs.

Neobin · Sep 4, 2022

To be able to suggest anything, you should provide more informations; for example:

What are the full hardware specifications of the PVE-host? (Especially: CPU, RAM, VM-Storage)
What does the full storage environment for the VMs look like? (Local/Network? Raid? HW/SW? Exact configuration, please! Which filesystems?)
What are the overall tasks/workload of the VMs?
How many VMs are running?
How much resources do you give them?
What OSs are running in those VMs?
On what VMs do you see the slowness? All? Only Windows? Only Linux? Only Linux with desktop environments?

Do you have any comparable numbers?
Against what do you compare?

Slowness (especially when the measurement is only taken subjective) does not only have to come from the CPU and/or its (under-)utilization. It could also be overcommitting RAM and therefore swapping and/or disk-IO. Or in case of desktop environments, that all the graphic has to be rendered from the CPU, if you do not passthrough a physical graphics card for example.

If you do not need live-migration between nodes with different hardware platforms, the recommendation to set the CPU-type of the VMs to host is a good start, even without knowing more about your exact environment.

proxwolfe · Sep 7, 2022

jdancer said:
This is what I use to optimize IOPS:

Set write cache enable (WCE) to 1 on SAS drives
Set VM cache to none
Set VM to use VirtIO-single SCSI controller and enable IO thread and discard option
Set VM CPU type to 'host'
Set VM VirtIO Multiqueue to number of cores/vCPUs

If using Linux:
Set Linux VMs IO scheduler to none/noop

If using Ceph:
Set RBD pool to use the 'krbd' option

I don't run Windows VMs.

Thanks for that - I will try all of it when I return from my vacation and report back.

In the meantime, and since I always want to learn, could you please also explain, what the problem is? (You gave me the - supposed - solution but I would love to understand how it works...) Many thanks in advance!

proxwolfe · Sep 7, 2022

Neobin said:
To be able to suggest anything, you should provide more informations; for example:

What are the full hardware specifications of the PVE-host? (Especially: CPU, RAM, VM-Storage)

What does the full storage environment for the VMs look like? (Local/Network? Raid? HW/SW? Exact configuration, please! Which filesystems?)

What are the overall tasks/workload of the VMs?

How many VMs are running?

How much resources do you give them?

What OSs are running in those VMs?

On what VMs do you see the slowness? All? Only Windows? Only Linux? Only Linux with desktop environments?

Do you have any comparable numbers?
Against what do you compare?

Slowness (especially when the measurement is only taken subjective) does not only have to come from the CPU and/or its (under-)utilization. It could also be overcommitting RAM and therefore swapping and/or disk-IO. Or in case of desktop environments, that all the graphic has to be rendered from the CPU, if you do not passthrough a physical graphics card for example.

If you do not need live-migration between nodes with different hardware platforms, the recommendation to set the CPU-type of the VMs to host is a good start, even without knowing more about your exact environment.

Thanks for stopping by!

I am not sure how the hardware specs of my host are relevant.

Maybe I need to clarify my issue: I am not complaining that my VMs are running slowly in an absolute sense. My issue is that they are running slowly while the host still has lots ans lots of CPU cycles to spare. I am asking why the VMs are not using the CPU power available on the host (I am not asking why my host doesn't have more CPU power).

Dunuin · Sep 7, 2022

proxwolfe said:
Thanks for stopping by!

I am not sure how the hardware specs of my host are relevant.

Maybe I need to clarify my issue: I am not complaining that my VMs are running slowly in an absolute sense. My issue is that they are running slowly while the host still has lots ans lots of CPU cycles to spare. I am asking why the VMs are not using the CPU power available on the host (I am not asking why my host doesn't have more CPU power).

Let's say you use slow disks, then the CPU can't do its job because it needs to wait for data that needs to be read/written from disks first.
Or you are heavily overcommitting your RAM and swapping will slow everything down.
Or you are using too many vCPUs for your hardware and processes are queuing up.

All things where VMs will be slowed down where it is essential to know the specific hardware, VM config and workloads.

proxwolfe · Sep 7, 2022

Dunuin said:
Let's say you use slow disks, then the CPU can't do its job because it needs to wait for data that needs to be read/written from disks first.
Or you are heavily overcommitting your RAM and swapping will slow everything down.
Or you are using too many vCPUs for your hardware and processes are queuing up.

All things where VMs will be slowed down where it is essential to know the specific hardware, VM config and workloads.

Ah, I see. Thanks for clearing that up for me!!!

I can rule out overcommitting of RAM.

About too many vCPUs - would that not show as a high CPU load on the host? (Because that is not the case.)

With respect to disk usage - would that show as IO delay? Because that I do see sometimes. I will do some tests with only one or two VMs that do not utilized disks a lot and report back.

Dunuin · Sep 7, 2022

proxwolfe said:
About too many vCPUs - would that not show as a high CPU load on the host? (Because that is not the case.)

You should see alot of load yes, but doesn't have to be very high load.

proxwolfe said:
With respect to disk usage - would that show as IO delay? Because that I do see sometimes. I will do some tests with only one or two VMs that do not utilized disks a lot and report back.

Yes, then IO delay should be high. In another thead you wrote you are using Ceph. So even network problems could prevent your CPU from doing stuff when waiting for data.

proxwolfe · Sep 7, 2022

Dunuin said:
Yes, then IO delay should be high. In another thead you wrote you are using Ceph. So even network problems could prevent your CPU from doing stuff when waiting for data.

About that:

One thing I don't fully understand yet is whether CEPH is actually slowing down my system:

I have an OSD in each node, so I would expect access times to be similar to a non-CEPH setup with only local disks. For reading that should be the case but it might be that for writing a disk access operation is reported back as completed only when it has been written to all OSDs across all nodes and that then could be delayed by the network.

Do you happen to know how this actually works with CEPH?

Dunuin · Sep 7, 2022

proxwolfe said:
About that:

One thing I don't fully understand yet is whether CEPH is actually slowing down my system:

I have an OSD in each node, so I would expect access times to be similar to a non-CEPH setup with only local disks. For reading that should be the case but it might be that for writing a disk access operation is reported back as completed only when it has been written to all OSDs across all nodes and that then could be delayed by the network.

Do you happen to know how this actually works with CEPH?

I'm no Ceph expert. But as far as I know Ceph will keep 3 copies of everything. Atleast sync writes should wait until all 3 OSDs have reported back that the data got securely stored. So a low latency and fast NIC should be needed for good sync write performance.

Search

Search

Why are my VMs so slow?

proxwolfe

Well-Known Member

jdancer

Renowned Member

Neobin

Distinguished Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

Dunuin

Distinguished Member

proxwolfe

Well-Known Member

Dunuin

Distinguished Member

proxwolfe

Well-Known Member

Dunuin

Distinguished Member