VM I/O Performance with Ceph Storage

t0mc@

Well-Known Member
Nov 27, 2017
94
14
48
47
Hi everybody,

a while ago we set up a three node Proxmox cluster and as storage backend we use the built in ceph features.
After a while we noticed a strong decreasing of I/O Perfomance in the VMs when it comes to writes of small files.
Writing a single big file at once seems to perform quit normal, also does reading in all cases.
When moving complete VM Disks from Ceph to another external storage and back also performs very good, no problems here.

A concrete test case:
Having a test VM with a plain Debian installation, we install gimp ("apt install gimp"). The extraction of these 116 packages, mostly each of them just a few KB in size takes much longer as it should. This is even worse when removing gimp and all dependencies completely ("apt autoremove gimp").
What is to be mentioned: When we do this little test, it doesn't matter if this Test VM is running completely alone in the cluster or if there are other VMs running; performance is always the same bad.

This is our architecture:
Three identical HP ProLiant DL380 Gen9, each of them:
  • CPU: 2x Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
  • RAM: 128GB DDR4 2133 MHz
  • Storage
    • HDD Controller: HP Smart HBA H240 (current firmware v 7.20) with these HDDs:
      • 4 identical 3TB SAS 7.200 rpm (Model: HP MB3000FCWDH)
    • 1xNVMe 240GB
    • 1xNVMe 1TB
    • 1xNVMe 2TB
  • Network:
    • HPE Ethernet 1Gb 4-port 331i Adapter - NIC
    • HP InfiniBand FDR/Ethernet 40Gb 2-port 544+FLR-QSFP
Configuration:
  • Proxmox VE 7.3-4 OS installed at the 240GB NVMe
  • Proxmox 3 Node Cluster:
    1673516467240.png
  • Network config:
    • 1Gb Ports of 331i connected to a switch for Proxmox and "outside" traffic, as a LACP bond (Network 172.20.4.0/24)
    • 40 Gb Ports of InfiniBand directly connected between the 3 servers as a RSTP Loop for Ceph Cluster Traffic (Network 172.20.81.0/24), as described here: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
      1673517254610.png
  • Ceph config:
    • cluster_network: 172.20.81.0/24 (The 40Gb InfiniBand RSTP Loop)
    • public_network: 172.20.4.0/24 (The 1Gb LACP Bond)
      1673522324839.png
    • 1 Monitor and Manager Service running on each Proxmox node in 172.20.81.0/24 Net:
      1673517826599.png
      1673517839981.png
    • 1 MDS on each node for cephfs running in 172.20.4.0/24 Net:
      1673518106212.png
    • OSD: 4xSAS HDDs and 2 remaining NVMes (1x 1TB, 1x2TB) on each Node:
      1673517922154.png
    • Pools / Proxmox Storage:
      • 1 Pool just using NVMes (crush rules)
      • 1 Pool just using HDDs (crush rules)
        1673518301640.png
(Now reaching the 10 screenshot Limit, continuing in the next Post)
 

Attachments

  • 1673517917090.png
    1673517917090.png
    128.6 KB · Views: 75
  • 1673517753275.png
    1673517753275.png
    25.2 KB · Views: 68
Last edited:
  • Like
Reactions: stepei
(Continue: )

  • Test VM Config:
    1673519168359.png
    1673519209116-png.45516
Installing Gimp at this VM lasts 3-4 mins, just made a screencast of it and uploaded to a dropbox:

https://www.dropbox.com/s/ws7hmxzdhpgtuaa/InstallGimp.webm?dl=0
Notice the "Extraction" or "Config" Phase... this is original speed, not SlowMo ;)

BTW:
For tracking down this issue we activated Prometheus Metrics in the Ceph Cluster and scrape it with Prometheus. This is shown in the corresponding Grafana Dashboard:
1673521197735.png
1673521162333.pngWe notice the "Avg Apply Latency" and "Avg Commit Latency" (and "OSD Apply/Commit Latency Distribution") which never seems to get below ~18ms and also seen much higher vlaues (>128ms) but can't really interpret if these values are normal or too slow?!

Is there anybody who has a hint or an idea what could cause this situation?
Thx a lot in advance, we are investigating this for several weeks now without a conclusion.
T0mc@
 

Attachments

  • 1673521094079.png
    1673521094079.png
    199.1 KB · Views: 29
  • 1673519209116.png
    1673519209116.png
    29.3 KB · Views: 975
Last edited:
  • Like
Reactions: stepei
One thing I noticed in your setup, although I am not sure whether this is related to the issues you are experiencing:

You have the public network set to the 1Gb bond, which is probably not what you want. The communication with Ceph clients runs via the public network. The cluster network is used for heartbeat and replication traffic to lessen load on the public network. This means that communication of the clients (VMs) runs via the 1Gb network, which can quickly congest.

You can find more information about this in the Ceph documentation [1]

I would try using the 40Gb network as your public network and see if that resolves the issue.

[1] https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/
 
Last edited:
@shanreich
Thx for the hint, but this was the original setup we used till some days ago. (Meaning: we HAD the public network also at the 40Gb Nics).
During the try - and - error investigations in the last weeks, this was the last we changed: Set the public network from 172.20.81.0/24 to 172.20.4.0/24 a few days ago. We wanted to make sure, the ceph inter-cluster traffic should be as fast as possible because - as far as we understood - this belongs to the repliaction traffic spanning the nodes, too.
But the conclusion is:
No matter, if public network is configured to the 4x1Gb bond or the 40Gb Loop, VM small-file-write i/o performance isn't affected, the same bad in both cases.
 
What kind of disks are those NVMEs? Have you checked with iperf whether the network is fine and delivers the bandwidth you are expecting?

You could also try setting the VM Disk Cache to Writeback and see if that increases performance.
 
I suspect this won't be needed if you are using proper enterprise SSDs but if not you may need to enable trim

I believe it's
Code:
ceph config set global bdev_enable_discard true

But I'm taking that from my very badly written notes on what I did on my test cluster so the command may not be correct

It can also be set per OSD if prefered.
 
What kind of disks are those NVMEs? Have you checked with iperf whether the network is fine and delivers the bandwidth you are expecting?

You could also try setting the VM Disk Cache to Writeback and see if that increases performance.
-> The disks are those:
Crucial P2 CT1000P2SSD8 (1TB)
Crucial P2 CT2000P2SSD8 (2TB)
Connected via PCIe Adapter Card to PCIe 4x Slots

-> iperf gave this:
1673878678985.png

-> Setting the VM Disk Cache to "WriteBack" doesn't really change anything. BUT: Setting this to "WriteBack (unsafe)" massively increases performance of our little "Gimp Install" Test Case.
BUT: When I remember correctly, Proxmox itself doesn't advise to use any caches at the VM Disks at all, but set it to "No Cache" (which also is the default) when using Storage Backends like Ceph or ZFS (especially in productive environments). Am I wrong?
 
I suspect this won't be needed if you are using proper enterprise SSDs but if not you may need to enable trim

I believe it's
Code:
ceph config set global bdev_enable_discard true

But I'm taking that from my very badly written notes on what I did on my test cluster so the command may not be correct

It can also be set per OSD if prefered.
We just use commands like this to enable this Option at OSD with NVMe Disks:

Code:
ceph config set osd.4 bdev_enable_discard true

BUT now things even get much worse... immediatly latencies increases on these disks:
1673879866991.png
 
From what I read the reason the Ceph devs decided to disable Trim is they believe enterprise SSD's will do it internally, but consumer ones may not. I suspect it needs to 'catch up' now that you've enabled it.

You may see odd spikes when it trims but it shouldn't be constant high latency.
 
@dragon2611
Do u have any Idea, how long it should take when things calm down again? Until now the latencies have horrible values:

1673885402231.png
1673885764877.png


VMs with Disks at the NVMes are near to being unusable
 
OK, then we will give it a try over night..

BTW: Just stumbled across this:
https://github.com/rook/rook/issues/6964
There the Option "bdev_async_discard" beside "bdev_enable_discard" is mentioned and also set to "true". When being set to false in combination with "bdev_enable_discard" enabled should be a performance killer, too, as far as I understood. (Yes, this is for Kubernetes rook implementation of ceph, but may also an issue here)
Will test that also...
 
Last edited:
-> The disks are those:
Crucial P2 CT1000P2SSD8 (1TB)
Crucial P2 CT2000P2SSD8 (2TB)

Your initial post is really well structured and detailed, but the only thing you did not mention, but is one of the most important factors for IO-problems, were the exact model numbers of the SSDs used.
Why? Did you already know, that those consumer drives are really bad for things like, in your case, Ceph?: [1]

It looks like those drives get meanwhile also produced with QLC-nand, which would even be more worst, if you got those... (But even with TLC-nand on those, it would not really be better!)

If you want appropriate performance, get reasonable enterprise SSDs with PLP.

[1] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516 (Especially the comparison table on page 3.)
 
  • Like
Reactions: aredan and ITT
@Neobin
Thx for your hints... in the meanwhile we also came across the fact that it is not the best idea to use consumer NVMes for ceph when it comes to performance, so far, so good...
BUT: is it really realistic that the VM I/O performance (and notice: just when writing small files) when using NVMEs as backend feels even SLOWER compared when using SAS HDDs with 7.200 rpm? Sure, we understood that consumer NVMEs would not perform as good as enterprise ones... but even SLOWER than pure HDDs? Really?
We can't really believe that this should be the case just because these are not enterprise NVMEs... We assume, we have another - more general - problem here.
 
Should have commented this earlier but wasn't really thinking about it then.

Those P2's are really slow for sustained writes once you exceed the SLC cache, I was getting ~80MB/s on the one in my gaming pc (Although it's a rarely an issue in that use case as i'm rarely writing enough data in a short timeframe to trigger this).

If that machine is doing anything close to production then it needs higher quality SSDs.

I do have some really jankily built proxmox deployments including a single machine using CEPH and 3 low power USFF nodes running it over Gigabit and lowly i3's, and yes I use consumer SSD's

But the difference there is the loss of "production" just means my home automation doesn't work or I may lose a couple of servers that maybe used by myself and a few friends. Whilst having to actually use a light switch is annoying it's not going to cause more than an inconvience.

For actual production clusters where loss of service = revenue loss then it's worth shelling out for the proper hardware, I will likely be building a Hyperconverged proxmox setup soon and that one will be using enterprise rated SSD's
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!