VM I/O Performance with Ceph Storage

t0mc@ · Jan 12, 2023

Hi everybody,

a while ago we set up a three node Proxmox cluster and as storage backend we use the built in ceph features.
After a while we noticed a strong decreasing of I/O Perfomance in the VMs when it comes to writes of small files.
Writing a single big file at once seems to perform quit normal, also does reading in all cases.
When moving complete VM Disks from Ceph to another external storage and back also performs very good, no problems here.

A concrete test case:
Having a test VM with a plain Debian installation, we install gimp ("apt install gimp"). The extraction of these 116 packages, mostly each of them just a few KB in size takes much longer as it should. This is even worse when removing gimp and all dependencies completely ("apt autoremove gimp").
What is to be mentioned: When we do this little test, it doesn't matter if this Test VM is running completely alone in the cluster or if there are other VMs running; performance is always the same bad.

This is our architecture:
Three identical HP ProLiant DL380 Gen9, each of them:

CPU: 2x Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
RAM: 128GB DDR4 2133 MHz
Storage
- HDD Controller: HP Smart HBA H240 (current firmware v 7.20) with these HDDs:
  - 4 identical 3TB SAS 7.200 rpm (Model: HP MB3000FCWDH)
- 1xNVMe 240GB
- 1xNVMe 1TB
- 1xNVMe 2TB
Network:
- HPE Ethernet 1Gb 4-port 331i Adapter - NIC
- HP InfiniBand FDR/Ethernet 40Gb 2-port 544+FLR-QSFP

Configuration:

Proxmox VE 7.3-4 OS installed at the 240GB NVMe
Proxmox 3 Node Cluster:
Network config:
- 1Gb Ports of 331i connected to a switch for Proxmox and "outside" traffic, as a LACP bond (Network 172.20.4.0/24)
- 40 Gb Ports of InfiniBand directly connected between the 3 servers as a RSTP Loop for Ceph Cluster Traffic (Network 172.20.81.0/24), as described here: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
Ceph config:
- cluster_network: 172.20.81.0/24 (The 40Gb InfiniBand RSTP Loop)
- public_network: 172.20.4.0/24 (The 1Gb LACP Bond)
- 1 Monitor and Manager Service running on each Proxmox node in 172.20.81.0/24 Net:
- 1 MDS on each node for cephfs running in 172.20.4.0/24 Net:
- OSD: 4xSAS HDDs and 2 remaining NVMes (1x 1TB, 1x2TB) on each Node:
- Pools / Proxmox Storage:
  - 1 Pool just using NVMes (crush rules)
  - 1 Pool just using HDDs (crush rules)

(Now reaching the 10 screenshot Limit, continuing in the next Post)

t0mc@ · Jan 12, 2023

(Continue: )

Test VM Config:

Installing Gimp at this VM lasts 3-4 mins, just made a screencast of it and uploaded to a dropbox:

https://www.dropbox.com/s/ws7hmxzdhpgtuaa/InstallGimp.webm?dl=0
Notice the "Extraction" or "Config" Phase... this is original speed, not SlowMo

BTW:
For tracking down this issue we activated Prometheus Metrics in the Ceph Cluster and scrape it with Prometheus. This is shown in the corresponding Grafana Dashboard:

We notice the "Avg Apply Latency" and "Avg Commit Latency" (and "OSD Apply/Commit Latency Distribution") which never seems to get below ~18ms and also seen much higher vlaues (>128ms) but can't really interpret if these values are normal or too slow?!

Is there anybody who has a hint or an idea what could cause this situation?
Thx a lot in advance, we are investigating this for several weeks now without a conclusion.
T0mc@

shanreich · Jan 12, 2023

One thing I noticed in your setup, although I am not sure whether this is related to the issues you are experiencing:

You have the public network set to the 1Gb bond, which is probably not what you want. The communication with Ceph clients runs via the public network. The cluster network is used for heartbeat and replication traffic to lessen load on the public network. This means that communication of the clients (VMs) runs via the 1Gb network, which can quickly congest.

You can find more information about this in the Ceph documentation [1]

I would try using the 40Gb network as your public network and see if that resolves the issue.

[1] https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

t0mc@ · Jan 12, 2023

@shanreich
Thx for the hint, but this was the original setup we used till some days ago. (Meaning: we HAD the public network also at the 40Gb Nics).
During the try - and - error investigations in the last weeks, this was the last we changed: Set the public network from 172.20.81.0/24 to 172.20.4.0/24 a few days ago. We wanted to make sure, the ceph inter-cluster traffic should be as fast as possible because - as far as we understood - this belongs to the repliaction traffic spanning the nodes, too.
But the conclusion is:
No matter, if public network is configured to the 4x1Gb bond or the 40Gb Loop, VM small-file-write i/o performance isn't affected, the same bad in both cases.

t0mc@ · Jan 14, 2023

Perhaps any other idea?

shanreich · Jan 16, 2023

What kind of disks are those NVMEs? Have you checked with iperf whether the network is fine and delivers the bandwidth you are expecting?

You could also try setting the VM Disk Cache to Writeback and see if that increases performance.

dragon2611 · Jan 16, 2023

I suspect this won't be needed if you are using proper enterprise SSDs but if not you may need to enable trim

I believe it's

Code:

ceph config set global bdev_enable_discard true

But I'm taking that from my very badly written notes on what I did on my test cluster so the command may not be correct

It can also be set per OSD if prefered.

t0mc@ · Jan 16, 2023

shanreich said:
What kind of disks are those NVMEs? Have you checked with iperf whether the network is fine and delivers the bandwidth you are expecting?

You could also try setting the VM Disk Cache to Writeback and see if that increases performance.

-> The disks are those:
Crucial P2 CT1000P2SSD8 (1TB)
Crucial P2 CT2000P2SSD8 (2TB)
Connected via PCIe Adapter Card to PCIe 4x Slots

-> iperf gave this:

-> Setting the VM Disk Cache to "WriteBack" doesn't really change anything. BUT: Setting this to "WriteBack (unsafe)" massively increases performance of our little "Gimp Install" Test Case.
BUT: When I remember correctly, Proxmox itself doesn't advise to use any caches at the VM Disks at all, but set it to "No Cache" (which also is the default) when using Storage Backends like Ceph or ZFS (especially in productive environments). Am I wrong?

dragon2611 · Jan 16, 2023

You may need the trim support given those are consumer SSD's

t0mc@ · Jan 16, 2023

dragon2611 said:
I suspect this won't be needed if you are using proper enterprise SSDs but if not you may need to enable trim

I believe it's

Code:

ceph config set global bdev_enable_discard true

But I'm taking that from my very badly written notes on what I did on my test cluster so the command may not be correct

It can also be set per OSD if prefered.

We just use commands like this to enable this Option at OSD with NVMe Disks:

Code:

ceph config set osd.4 bdev_enable_discard true

BUT now things even get much worse... immediatly latencies increases on these disks:

t0mc@ · Jan 16, 2023

dragon2611 · Jan 16, 2023

t0mc@ said:
We just use commands like this to enable this Option at OSD with NVMe Disks:

Code:

ceph config set osd.4 bdev_enable_discard true

BUT now things even get much worse... immediatly latencies increases on these disks:
View attachment 45662

Probably will do initially yes.

t0mc@ · Jan 16, 2023

OK, then we just wait a while...fingers crossed

dragon2611 · Jan 16, 2023

From what I read the reason the Ceph devs decided to disable Trim is they believe enterprise SSD's will do it internally, but consumer ones may not. I suspect it needs to 'catch up' now that you've enabled it.

You may see odd spikes when it trims but it shouldn't be constant high latency.

t0mc@ · Jan 16, 2023

@dragon2611
Do u have any Idea, how long it should take when things calm down again? Until now the latencies have horrible values:

VMs with Disks at the NVMes are near to being unusable

dragon2611 · Jan 16, 2023

Might be worth doing it overnight if you can.

t0mc@ · Jan 16, 2023

OK, then we will give it a try over night..

BTW: Just stumbled across this:
https://github.com/rook/rook/issues/6964
There the Option "bdev_async_discard" beside "bdev_enable_discard" is mentioned and also set to "true". When being set to false in combination with "bdev_enable_discard" enabled should be a performance killer, too, as far as I understood. (Yes, this is for Kubernetes rook implementation of ceph, but may also an issue here)
Will test that also...

Neobin · Jan 16, 2023

t0mc@ said:
-> The disks are those:
Crucial P2 CT1000P2SSD8 (1TB)
Crucial P2 CT2000P2SSD8 (2TB)

Your initial post is really well structured and detailed, but the only thing you did not mention, but is one of the most important factors for IO-problems, were the exact model numbers of the SSDs used.
Why? Did you already know, that those consumer drives are really bad for things like, in your case, Ceph?: [1]

It looks like those drives get meanwhile also produced with QLC-nand, which would even be more worst, if you got those... (But even with TLC-nand on those, it would not really be better!)

If you want appropriate performance, get reasonable enterprise SSDs with PLP.

[1] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516 (Especially the comparison table on page 3.)

t0mc@ · Jan 16, 2023

@Neobin
Thx for your hints... in the meanwhile we also came across the fact that it is not the best idea to use consumer NVMes for ceph when it comes to performance, so far, so good...
BUT: is it really realistic that the VM I/O performance (and notice: just when writing small files) when using NVMEs as backend feels even SLOWER compared when using SAS HDDs with 7.200 rpm? Sure, we understood that consumer NVMEs would not perform as good as enterprise ones... but even SLOWER than pure HDDs? Really?
We can't really believe that this should be the case just because these are not enterprise NVMEs... We assume, we have another - more general - problem here.

dragon2611 · Jan 16, 2023

Should have commented this earlier but wasn't really thinking about it then.

Those P2's are really slow for sustained writes once you exceed the SLC cache, I was getting ~80MB/s on the one in my gaming pc (Although it's a rarely an issue in that use case as i'm rarely writing enough data in a short timeframe to trigger this).

If that machine is doing anything close to production then it needs higher quality SSDs.

I do have some really jankily built proxmox deployments including a single machine using CEPH and 3 low power USFF nodes running it over Gigabit and lowly i3's, and yes I use consumer SSD's

But the difference there is the loss of "production" just means my home automation doesn't work or I may lose a couple of servers that maybe used by myself and a few friends. Whilst having to actually use a light switch is annoying it's not going to cause more than an inconvience.

For actual production clusters where loss of service = revenue loss then it's worth shelling out for the proper hardware, I will likely be building a Hyperconverged proxmox setup soon and that one will be using enterprise rated SSD's

VM I/O Performance with Ceph Storage

Well-Known Member

Attachments

Well-Known Member

Attachments

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Famous Member

Well-Known Member

Renowned Member