Poor write performance on ceph backed virtual disks.

Ceph writes are slowed down by network latency that does not improve with higher speeds cards.
Best options are: krbd on, write back cache and iothread=1 but I see others have already suggested them to you.
 
Just upgraded the production cluster to Octopus.
Did you check if your OSDs perform better with buffers enabled?
https://forum.proxmox.com/threads/d...ter-upgrade-to-ceph-octopus.81542/post-363772

Observing ~30MB/s instead of ~20MB/s
Octopus has a write-around cache policy, if used with librbd. krbd uses page cache. More details in the link.
https://forum.proxmox.com/threads/ceph-rbd-cache-does-not-apply-to-vms.83051/post-365987

Ceph writes are slowed down by network latency that does not improve with higher speeds cards.
@mgiammarco, I suppose the not is a write error?
25 GbE has a lower latency then 10 GbE. 40 GbE is 4x10 GbE and 100 GbE is 4x 25 GbE.
 
Just a quick question, since we suffered some performance issues as well.
18 X 16TB HDD's (3 X Toshiba MG08ACA16TE per node). (the big spinners here are connected to each node via LSI SAS 9300-8e's in a 1-to-1 direct-attach (non-expander) enclosure.
There are 3x 16TB HDDs for each node connected via an external SAS Connection. I may have overlooked it, but where do DB and WAL live? Are DB and WAL located on the HDDs? If so, i will bet, this is the cause.
I did some tests a while ago. Look here, at the end of my post:
https://forum.proxmox.com/threads/p...d-unable-to-get-device-info.58865/post-271831
Edit: The write performance here does remind me of the performance we had, when we just put WAL to SSD and had DB live on the HDD.
 
Last edited:
  • Like
Reactions: Alwin Antreich
Just a quick question, since we suffered some performance issues as well.

There are 3x 16TB HDDs for each node connected via an external SAS Connection. I may have overlooked it, but where do DB and WAL live? Are DB and WAL located on the HDDs? If so, i will bet, this is the cause.
I did some tests a while ago. Look here, at the end of my post:
https://forum.proxmox.com/threads/p...d-unable-to-get-device-info.58865/post-271831
Edit: The write performance here does remind me of the performance we had, when we just put WAL to SSD and had DB live on the HDD.

Hi Ingo,

Our DB/WAL is directly on the spinners on both my home cluster and work cluster.

Write performance on my home cluster with far less hardware power seems to perform better. Odd eh?

I'm willing to try the dedicated WAL/DB disk. The servers have multiple M.2 slots on the motherboards and plenty of FH expansion for quad-m.2 expansion on the back end.

So we plan to have up-to 64TB of spinners per node (4 x 16tb drives). Most of this storage will be virtual disks for VM's for mostly large-file storage. (big zip archives, video recording, pcap data), so I believe we'd be alright with a 2TB SSD per node as the WAL/DB drive. (~3%). Does that sound reasonable?

Looking at INTEL DC P4511, SAMSUNG 983 DCT, and MICRON 7300 PRO as contenders.

The Micron appears to have the best price and performance claims, but I can't tell if it's PLP implementation is equal or not. Thoughts?

Thanks!
 
How much 10Gb links every node has? Where are fio/iperf/ceph bench tests? What is your ceph setup - 3/2 etc?
 
How much 10Gb links every node has? Where are fio/iperf/ceph bench tests? What is your ceph setup - 3/2 etc?
6 X 10Gb

1: Coro1
2: Coro2
3: CephP
4: CephC
5: Network Trunks
6: Unused

Write performance from Windows guests is limited to approximately the write-sync performance of drives in the pool. Other guests do slightly better.

Ceph bench shows results similar to expected bare drive performance (good) and rebalancing performance is within expected margins. The issue seems to be something within the driver stack for guests forcing all writes to occur as write-sync operations from guests.

Going to try WAL/DB on NVME and see if it fixes it. Compared to the overall cost of the cluster it's probably the best next step.
 
We moved forward with the install of some NVME DB/WAL drives based on Ingo S's post.

We are using the Micron 7300 Pro M.2 2TB drives for this and have 437G per 16TB drive assigned as DB/WAL space. Result is about 3% of total space being NVME for DB/WAL.

BIG improvement!

Now seeing ~150MB/s (5X performance) on big data transfers to our archive share backed by the spinning pool. So far haven't had any transfer errors but time will tell.

This is more inline with the performance I would expect from these drives and the various overhead of the software defined storage and virtualized workloads.

For ~$2000 in parts I'm very happy with this upgrade relative to the total cost of the cluster and its impact on fixing this issue.
 
You are lucky! You get 50mbyte/sec over 1Gb... I get that over 10Gbe!!!

2x switches MLAG, LACP 2+3 for public
2xswitches MLAG, LACP2+3 for cluster

Super performance 36x 1->2TB consumer SSD's on 6 hosts.

All hosts feature ZEN2 CPU's.

1637155576809.png
 
You are lucky! You get 50mbyte/sec over 1Gb... I get that over 10Gbe!!!

2x switches MLAG, LACP 2+3 for public
2xswitches MLAG, LACP2+3 for cluster

Super performance 36x 1->2TB consumer SSD's on 6 hosts.

All hosts feature ZEN2 CPU's.

View attachment 31461
Hi AngryAdm,

Random 4K reads at Que depth 1 will always be pretty slow. That is heavily impacted by the network latency combined with drive access latency / service time.
 
Hi AngryAdm,

Random 4K reads at Que depth 1 will always be pretty slow. That is heavily impacted by the network latency combined with drive access latency / service time.
o/

Yeah, but as soon as I run a backup with 200mbyte/sec exchange crawls to almost a stop :/
Weak for 36 SSD's in 6 hosts.
 
Good question... Curious if anyone else is still struggling with these issues.

I thought we had this problem whipped but within a few months of adding the NVME drives to use for WAL/DB, write speed from windows guests to the spinning pool collapsed again. It's just terrible. Seems to be related to the driver stack provided for the windows guests under the virtio project.

I'm planning on expanding our SSD pool with more 4TB SSDs this year, which will leave us a handful of 2TB SATA SSD's. I'm going to try dedicating a 2TB SSD per 16TB spinner as the WAL/DB drive for each. I'm hopeful that this will provide enough buffer to maintain consistently reasonable write speeds.
 
I hate to necro a thread like this, but I also am curious what happened for everyone here.
When I set up my current cluster I thought I'd try out CEPH, and I've been impressed by a lot of it's ability.
The performance on the other hand is downright abysmal and I'm struggling to determine why.

Granted my setup is a bit weird, since I'm using 40Gb Infiniband between the nodes, but even so I can get Iperf tests of 11-13gbps.
Regardless of this I am getting write speeds of 50-90MB/s on both replicated, and erasure coded pool types with rados bench.
Meanwhile, I get 350-430MB/s sequential read and 400-500MB/s random.
I've even tried enabling RDMA for CEPH because I can get 5-6GB/s transfers when using qperf vs iperf. As an aside, this has sadly done absolutely nothing so, I'm not sure if I just did it wrong or if it is actually useless.

The ping between all of my nodes is less than 1ms, and this awful performance shows up even when there is zero load on the pool.
I was thinking that maybe my write speed could be somehow related to the network the traffic is going over, but this logically makes no sense to me if my reads are almost 10 times faster.

I'm not really sure where to look, but when I run the rados write benchmark I am seeing max latencies of 4 with an average of 0.9.
I'm not really sure what this means, it seems to indicate it's seconds with (s), but it seems like I'd have a bigger issue if it was that bad.

Did anyone ever narrow down what is happening or have a good diagnostic path to follow to find out? I'm just getting sick of fighting with CEPH to get it to not have the same speed as my 60GB laptop drive from 2003.
 
Last edited:
I hate to necro a thread like this, but I also am curious what happened for everyone here.
When I set up my current cluster I thought I'd try out CEPH, and I've been impressed by a lot of it's ability.
The performance on the other hand is downright abysmal and I'm struggling to determine why.

Granted my setup is a bit weird, since I'm using 40Gb Infiniband between the nodes, but even so I can get Iperf tests of 11-13gbps.
Regardless of this I am getting write speeds of 50-90MB/s on both replicated, and erasure coded pool types with rados bench.
Meanwhile, I get 350-430MB/s sequential read and 400-500MB/s random.
I've even tried enabling RDMA for CEPH because I can get 5-6GB/s transfers when using qperf vs iperf. As an aside, this has sadly done absolutely nothing so, I'm not sure if I just did it wrong or if it is actually useless.

The ping between all of my nodes is less than 1ms, and this awful performance shows up even when there is zero load on the pool.
I was thinking that maybe my write speed could be somehow related to the network the traffic is going over, but this logically makes no sense to me if my reads are almost 10 times faster.

I'm not really sure where to look, but when I run the rados write benchmark I am seeing max latencies of 4 with an average of 0.9.
I'm not really sure what this means, it seems to indicate it's seconds with (s), but it seems like I'd have a bigger issue if it was that bad.

Did anyone ever narrow down what is happening or have a good diagnostic path to follow to find out? I'm just getting sick of fighting with CEPH to get it to not have the same speed as my 60GB laptop drive from 2003.
Hi Superfish,

The particular problems I have had with ceph write performance are mostly related to windows driver/cache stack issues. I get pretty good performance when running performance benchmarks directly on a pool.

Your problem sounds to me like "sync-write" performance limitations of many SSD's. Can you describe your ceph drive makeup? What models of what drives? How many nodes? Those write speeds in a benchmark of the pool would be what I would expect if the SSD's selected were of a non-enterprise with full PLP variety.
 
  • Like
Reactions: Ingo S
Hi Superfish,

The particular problems I have had with ceph write performance are mostly related to windows driver/cache stack issues. I get pretty good performance when running performance benchmarks directly on a pool.

Your problem sounds to me like "sync-write" performance limitations of many SSD's. Can you describe your ceph drive makeup? What models of what drives? How many nodes? Those write speeds in a benchmark of the pool would be what I would expect if the SSD's selected were of a non-enterprise with full PLP variety.
Actually, after banging my head into the keyboard for a few days I think I might have finally gotten a lead on my issue.
I have 3 active nodes at the moment with two drives 1TB/16TB drives each. (I know this is sub optimal and it's really not helping, but I don't think it's my real issue.)
Each set of HDDs has an SSD assigned to it as a 256GB WAL/DB disk, and this is where I think my problem is coming in.

I started digging into how I could test the OSDs themselves and found that you can benchmark them using the command.
ceph tell osd.x bench
This lead me to realize that my OSD itself was running at roughly 40 - 60MB/s. Obviously, this isn't really going to work out well and my write speed is never going to be usable like that.

I then found a thread, here, that was discussing checking individual OSD performance and looking for "slow_*_bytes" values under the "bluefs" field.
ceph daemon osd.x perf dump
Surprise, surprise, my drive was churning out slow bytes like its life depended on it.
I then finally read the documentation on how to set up the DB and WAL space and realized that I would need about 1.3TB for the desired 32TB(2x16TB) worth of drive space.
This means that the drives will need to constantly keep pulling overflowing DB and WAL data off of the HDDs themselves and will slow to a crawl.

Now I've got to go fix that and see if I find a new thing I didn't read the documentation for properly.
Marvelous thing reading the documentation is...
 
  • Like
Reactions: Ingo S
Oh man... this is great news.
I just checked ceph daemon osd.x perf dump and yeah, I guess we are using too small WAL/DB as well.
The slow bytes part is this one here?
Code:
"bluefs": {
        "db_total_bytes": 62495121408,
        "db_used_bytes": 5754585088,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
---->>  "slow_total_bytes": 4000220971008,
        "slow_used_bytes": 0,

Maaayyyybeee i should try looking into the manual too. Just maybe... :rolleyes:
 
The slow bytes part is this one here?
I can't claim any expertise there, but my understanding was that any "slow_*" values would indicate that something is bottlenecking. In this case it was related to not having enough space for the DB/WAL data which should be about 4% the size of the storage drives from what I have been reading.
It seems like it can be 1% - 4%, but 4% is recommended.
https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#sizing

At the lowest end that would mean I would need about 328GB of space for 32TB, but I am leaning toward the 3% - 4% range probably being the safest bet without me needing to do extensive testing.

Obviously if you're building a significant amount of storage nodes and are investing in drives then you would probably want to test to see if you can use less and get away with it. As for me, I think I'll probably just buy some 2TB drives for my servers because I'm tired of fighting with it and guessing. lol
 
  • Like
Reactions: Ingo S
Yeah i can totally see that.
I just did some maths on our cluster and yeah we need to resize down to 2-3OSDs per Server to get away with our 375GB DB/WAL SSDs. Luckily we can shrink it down that much.
I just started moving EVERYTHING around and reconfiguring the OSDs. This will take about 1-2 weeks.

ooof...
Still much to learn about Ceph.
 
Last edited:
Yeah i can totally see that.
I just did some maths on our cluster and yeah we need to resize down to 2-3OSDs per Server to get away with our 375GB DB/WAL SSDs. Luckily we can shrink it down that much.
I just started moving EVERYTHING around and reconfiguring the OSDs. This will take about 1-2 weeks.

ooof...
Still much to learn about Ceph.
Ouch, that balance time...
I would be very interested to know how it goes for you once it is complete. I set my cluster up over a year and a half ago and just kind of took for granted at the time that my performance was super bad because of the wonky hardware.
Now that I've finally gotten annoyed enough to dig into the problem it looks much more like it's a PEBKAC error than a CEPH or hardware one.
 
  • Like
Reactions: Ingo S

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!