Poor write performance on ceph backed virtual disks.

Ingo S · Dec 18, 2023

That balance time is just... I need to clear one Server at a time, destroy the OSDs, reduce to three per node, then create new OSDs with the properly sized DB/WAL.
Clearing a Server is about 20h. Backfill starts with about 400-500MB/s and reduces over time to a crawl.
All of this while the system is live and while I'm busy with the other regular support duties.

Times 6, because we have 6 Nodes. *insert party emote here*

I'll keep you updated!

AllanM · Dec 18, 2023

I currently have ~400GB of DB/WAL space per 16TB drive on our cluster. A single NVME 2TB drive is servicing 4 X 16TB spinners in each server. It performed really well at first but then once they had been in service awhile the performance has suffered, especially in windows guests.

I intend to get more 4TB SSD's to expand our SSD pool with, and re-assign all of our 2TB SATA SSD's to be DB/WAL drives, 1 for each 16TB spinner, in an attempt to improve the performance of the spinning pool.

alexskysilk · Dec 18, 2023

Superfish1000 said:
Granted my setup is a bit weird, since I'm using 40Gb Infiniband between the nodes, but even so I can get Iperf tests of 11-13gbps.

what makes your setup weird? may be worth looking at your /etc/network/interfaces file.

Superfish1000 said:
Regardless of this I am getting write speeds of 50-90MB/s on both replicated, and erasure coded pool types with rados bench.

On what benchmark? your fio arguments would be instructive. also, is that from a member node or guest VM?

Superfish1000 said:
I have 3 active nodes at the moment with two drives 1TB/16TB drives each. (I know this is sub optimal and it's really not helping, but I don't think it's my real issue.)

Laughs. you might want to reread the section of how osds work. with only 2 OSDs that arent even matched, that are HDDs no less, you're getting better performance then I would expect. BTW, how are you managing to have an EC pool with 3 nodes and 6 drives total? this isnt a very sane configuration.

Superfish1000 said:
Each set of HDDs has an SSD assigned to it as a 256GB WAL/DB disk, and this is where I think my problem is coming in.

I doubt this is making any difference in your benchmark. recreate without a separate db disk if you're curious.

This post (https://forum.proxmox.com/threads/osd-rebalance-at-1gb-s-over-10gb-s-network.136882/post-607900) covers some similarities with your config and may be of help to understand rebalance performance.

Ingo S · Dec 19, 2023

So, i did a little digging.
Ceph and its various services/tools have such a ton of features, settings, values and metrics, that it is really hard to get into it, if you are not a specially trained professional, e.g. just a "normal" IT Gui to manage a small to medium sized Cluster.

I found some more information about the slow bytes metric and i wanted to summarize it a little bit for later reference:

slow_total_bytes:
The total amount of space on your slow OSD (your harddisk), that is available for bluefs to store data and DB/WAL. This is space, that CAN be used for DB/WAL, but a separate DB and/or WAL disk will take precedence.

slow_used_bytes:
The amount of space on the slow device that is actually being used for DB/WAL. A non zero value indicates overspilling from your fast DB/WAL device due to insufficient sizing. This usually affects performance of the OSD negatively.

db_total_bytes:
Total size of your primary DB Device. This is where your RocksDB and other OSD metadata will live, as well as WAL, if there is enough space available and no specific WAL device was chosen.

db_used_bytes:
This is the actual size of the RocksDB and OSD metadata. I guess you could use db_used_bytes/db_total_bytes to monitor and in conjunction with slow_used_bytes ≠ 0 to trigger a warning for DB/WAL spillover.

wal_total_bytes: and wal_used_bytes
Same thing as DB, just for WAL specifically. Will be zero if you don't use a specific WAL device.

So what i found is that, even though we "only" allocated 58GB (1.5%) of space for DB, for each of our 4TB HDD OSDs, this seems to be sufficient, as slow_bytes_used is zero on all of our HDD based OSDs. db_used_bytes seems to indicate that only about 5.4GB is used by DB. This might be, because the HDD Pool is not very full but it seems to be quite low to me.

So i guess writing sequentially to the HDD pool really isn't bottlenecked then @ 15-50MB/s and it simply really IS just slow altogether?

alexskysilk · Dec 19, 2023

Ingo S said:
So i guess writing sequentially to the HDD pool really isn't bottlenecked then @ 15-50MB/s and it simply really IS just slow altogether?

Slow, yes- but bear in mind that the write performance expectation for a HDD pool is a MAXIMUM of a SINGLE DISK, and that is only possible on large, aligned, sequential writes- which for the most part are not a normal use case AT ALL. A single HDD is capable of 100-150MB/S, but thats only half the story. HDDs are TERRIBLE for concurrent IOs, which makes their IOPs abysmal. Virtualization load pattern is much more sensitive to IOPs then to gross throughput. If I were you, I'd get rid of your db nvmes and redeploy them as a separate pool for your vm boot disks.

Ingo S · Dec 19, 2023

Sadly, thats not possible. We need about 15-20TB Storage, so around 45-60TB RAW. Much of that is very idle data. But IF some of that has to move, these speeds are really not great.
Our databases and other high io stuff like web pages, applications etc. already are on SAS SSDs. Space there is limited, but Performance is quite OK, but i had imagined better.
Hopefully soon we will get a new, everything NVME Cluster on a 25GBit Ceph Network. I hope this will be a big improvement even over SAS SSDs.

alexskysilk · Dec 19, 2023

Ingo S said:
Sadly, thats not possible. We need about 15-20TB Storage

Here's the thing.

Not all data is equivalent, and shouldnt be treated so. While I dont have any details about your particular use case, there is nothing stopping you from deploying a NAS with 20TB usable in a raid6 configuration, and park all your "idle data" there, leaving your "production" storage need much smaller and easier (cheaper) deployable via ceph.

Ingo S said:
Our databases and other high io stuff like web pages, applications etc. already are on SAS SSDs. Space there is limited, but Performance is quite OK, but i had imagined better.

If you can (or would like to) describe the configuration, performance expectation, and measure performance it may be possible to tune it better.

Ingo S said:
Hopefully soon we will get a new, everything NVME Cluster on a 25GBit Ceph Network. I hope this will be a big improvement even over SAS SSDs.

Not everything can be solved by throwing more hardware at. As I hinted above, configuration and elimination of bottlenecks would likely yield results even without that (and possibly better.)

jsterr · Dec 19, 2023

To be honest, virtualiziation with ceph and hdd is pain in the ass even with dbwal devices ... Would never recommend to do this.
its just not fun, and if your dbwal device fails, all osds associated to it will be down aswell too. Thats something more to consider and risky.

Ingo S · Dec 20, 2023

alexskysilk said:
Here's the thing.

Not all data is equivalent, and shouldnt be treated so. While I dont have any details about your particular use case, there is nothing stopping you from deploying a NAS with 20TB usable in a raid6 configuration, and park all your "idle data" there, leaving your "production" storage need much smaller and easier (cheaper) deployable via ceph.

That's absolutely a valid strategy. Most of the Data that's idle is SMB Shares with files for daily work. So it's not totally idle, just not used very much, but still used. Lots of that is documents and invoices in our document management system and financial accounting.
There is not much of a need for big performance, so Ceph on HDD is totally fine. But since this is very important archival data, we really like the hardiness of ceph in regards to disk or node failure. (Yes there is an offline Backup on tape

)
With our old NAS we already had instances of 2 failed disks at once on that RAID6 system (after some road contruction took place and the whole building was trembling from time to time) A few months after that, one disk after another failed, 6 of 12 in total. That was very very scary.

alexskysilk said:
If you can (or would like to) describe the configuration, performance expectation, and measure performance it may be possible to tune it better.

Not everything can be solved by throwing more hardware at. As I hinted above, configuration and elimination of bottlenecks would likely yield results even without that (and possibly better.)

Our Server Cluster was never really designed to be a hyperconverged setup. The cluster is about 7-8 Years old and the concept of hyperconverged, virtualised systems hadn't really reached us back then.
We have been planning for a new cluster for a while now and hopefully next year we will get to order the hardware.
We will have Terminal Servers for about 150 users, File Storage, Databases, a Mailsystem, Application Servers, everything really.

I'm not quite sure how i can describe the configuration and performance expectation, because I'm not sure which metrics might be helpful.
We run a pool of 24 mixed SAS/SATA HDDs 7200rpm distributed on 4 of the 6 cluster nodes. The other nodes don't have space for a DB/WAL SSD.
Our SSD Pool consists of 12x 1.6TB Kyoxia PM5V SSDs SAS 12Gb/s evenly distributed on all 6 Nodes.
Nodes are connected via a Ceph dedicated 10G Ethernet, a dedicated 10G network facing the users, and a dedicated 1Gbit Network for Corosync.

On the HDD pool we become bottlenecked when IO reaches around 1200-1500, sometimes 1800 IO/s. Read and write performance greatly scales with parallelity, which is to be expected. For running applications, Databases or fetching a small document, I guess access latency is the most important. But I don't really know how to measure and compare this.
Sure, i can do a test with fio, and i can see OSD Commit Latency. But I don't know if a latency of around 30-50ms yay or nay.

One of the biggest pains in this setup is the slow speed for backups. There are some locations with hundreds of thousands of files 4k-1MB in size and on the SMB shares, there are even over 1million files of various sizes, scattered all around.
These take very very long to tape. We get speeds between 24-70MB/s whereas databases and other large blobs from the SSD pool come in at about 180MB/s (I think thats the limit of the LTO 8 Tape Library) Blobs from the HDD pool come in at around 50-70MB/s.

Thanks very much, for taking a look at this.

Ingo S · Dec 20, 2023

jsterr said:
To be honest, virtualiziation with ceph and hdd is pain in the ass even with dbwal devices ... Would never recommend to do this.
its just not fun, and if your dbwal device fails, all osds associated to it will be down aswell too. Thats something more to consider and risky.

Yeah, true, that can happen. We run a 2/3 replica on a 6 Node Cluster and we had one node come down with a failed DB/WAL disc. But frankly, that was really not much of a deal. We moved all affected VMs to another node. After 10min we had all affected VMs running again, after 1 day of rebalance, everything was fine again regarding ceph health status and when we replaced the failed SSD we added the missing OSDs back to the storage. We didn't even have very much of a performance impact.
You really can't do that with a RAID system, can you?

alexskysilk · Dec 20, 2023

Ingo S said:
With our old NAS we already had instances of 2 failed disks at once on that RAID6 system (after some road contruction took place and the whole building was trembling from time to time) A few months after that, one disk after another failed, 6 of 12 in total. That was very very scary.

no doubt. its worth considering that your ceph hardware might not have fared better. good on you for having actual backups in either case. Whats more, you havent been bitten by ceph yet; trust me- it happens.

The reason that for your usecase a nas is superior is because you'd get better single thread performance then you would with ceph, and probably better throughput too- as you alluded, this could be different the larger the number of initiators comes into play- but considering your use description this isnt likely. and you ARE reporting slower then desirable experience.

Ingo S said:
We run a pool of 24 mixed SAS/SATA HDDs 7200rpm distributed on 4 of the 6 cluster nodes. The other nodes don't have space for a DB/WAL SSD.

I have a 9 node cluster thats dedicated to a 6+2 cephfs on HDD (~1.6PB raw.) I dont have any db/wal SSDs because they dont make enough difference to offset the major complexity they introduce, although I do use an SSD replicated pool for fs metadata. It is NOT fast, and its not meant to be. The reason it was deployed in the first place is because it was 500TB when it was deployed. If you don't have plans to scale your file system, a proper dual headed filer might have been the better choice.

Ingo S said:
Our SSD Pool consists of 12x 1.6TB Kyoxia PM5V SSDs SAS 12Gb/s evenly distributed on all 6 Nodes.
Nodes are connected via a Ceph dedicated 10G Ethernet, a dedicated 10G network facing the users, and a dedicated 1Gbit Network for Corosync.

a couple of points:
1. if you're only using one switch, you have a spof.
2. if you're using two switches, are they set up to lagg across each other? if so (or can be) you should do that instead and lagg your 10gb interfaces together. then vlan your ceph private and public interfaces apart from each other. I would advise NOT sharing the ceph lagg with your user traffic; create a seperate 1gb lagg for that (and if you DO have users that actually benefit from a faster connection or you have parallel congestion, add more links.)

Ingo S said:
which is to be expected. For running applications, Databases or fetching a small document, I guess access latency is the most important. But I don't really know how to measure and compare this.
Sure, i can do a test with fio, and i can see OSD Commit Latency. But I don't know if a latency of around 30-50ms yay or nay.

30-50ms is fine for hdd backed pools. it is utter shit for ssd. don't run your databases on an hdd backed pool

Ingo S said:
One of the biggest pains in this setup is the slow speed for backups. There are some locations with hundreds of thousands of files 4k-1MB in size and on the SMB shares, there are even over 1million files of various sizes, scattered all around.

that would pose a challenge regardless of your backing store. what do you use for backup? if your data is cephfs backed, I would suggest a tar pipe or parsyncfp2 to have a hope of good results. parsyncfp2 can be distributed between multiple initiators which would give you the best overall results, but there's no beating tar for efficiency. If your data is on rbd, use a mechanism that can stream rbd snapshots- PBS works, or you can home grow something like https://github.com/Corsinvest/cv4pve-barc

alexskysilk · Dec 20, 2023

Ingo S said:
We moved all affected VMs to another node. After 10min we had all affected VMs running again

Why was this necessary?

Ingo S · Dec 21, 2023

alexskysilk said:
no doubt. its worth considering that your ceph hardware might not have fared better. good on you for having actual backups in either case. Whats more, you havent been bitten by ceph yet; trust me- it happens.

While we had our fair share of Problems with Ceph in the past, mostly due to inexperience, what would bite us? I mean in terms of storing our data reliably und running consistently?

alexskysilk said:
The reason that for your usecase a nas is superior is because you'd get better single thread performance then you would with ceph, and probably better throughput too- as you alluded, this could be different the larger the number of initiators comes into play- but considering your use description this isnt likely. and you ARE reporting slower then desirable experience.

I have a 9 node cluster thats dedicated to a 6+2 cephfs on HDD (~1.6PB raw.) I dont have any db/wal SSDs because they dont make enough difference to offset the major complexity they introduce, although I do use an SSD replicated pool for fs metadata. It is NOT fast, and its not meant to be. The reason it was deployed in the first place is because it was 500TB when it was deployed. If you don't have plans to scale your file system, a proper dual headed filer might have been the better choice.

That's a valid point. Good thing is, our new cluster isn't set in stone yet. It's probably a good idea, to assess the kind of storage needs we have again. Although we are quite small, our systems seem quite complex. Maybe there is some potential for simplifying things.

alexskysilk said:
a couple of points:
1. if you're only using one switch, you have a spof.
2. if you're using two switches, are they set up to lagg across each other? if so (or can be) you should do that instead and lagg your 10gb interfaces together. then vlan your ceph private and public interfaces apart from each other. I would advise NOT sharing the ceph lagg with your user traffic; create a seperate 1gb lagg for that (and if you DO have users that actually benefit from a faster connection or you have parallel congestion, add more links.)

Thats an interesting take on building the cluster network. We do use 2 Switches, but for different Networks, no LAGG. Our Servers have a dual link Intel X520-DA2 and two 1G Onboard Intel I350 NICs, one of which is used by Corosync and 1 NIC which is IPMI only.
The reason there is no lagg is, I didn't want to have user traffic on the same link as Ceph, so i use one 10G link for ceph, and one 10G link for user data. Since this cluster is quite old, we never split ceph networking to a private ceph net and a public ceph net. And because backups run on the user net, it wouldn't be great, to run the user net on 1G Links i guess.
We don't use CephFS. All storage is on virtual Disks attached to containers and VMs, so rbd only.
I begin to think, I might sit down and write a whole new concept for our Storage system for our new cluster. What we are running today is a growing heap of a mess.

alexskysilk said:
30-50ms is fine for hdd backed pools. it is utter shit for ssd. don't run your databases on an hdd backed pool

HAHAHAHA no, i would never do that. :lol:

alexskysilk said:
that would pose a challenge regardless of your backing store. what do you use for backup? if your data is cephfs backed, I would suggest a tar pipe or parsyncfp2 to have a hope of good results. parsyncfp2 can be distributed between multiple initiators which would give you the best overall results, but there's no beating tar for efficiency. If your data is on rbd, use a mechanism that can stream rbd snapshots- PBS works, or you can home grow something like https://github.com/Corsinvest/cv4pve-barc

We use a system called Bareos on a dedicated Server with 10G Link, with a tape library LTO8 connected via SAS. I'm not quite sure how Bareos writes to the tape, but i think it uses tar, with a lot of things build around it.
All the files are on rbd volumes. We mounted the storage via NFS onto the backup server and pull the files from there. This is because sometimes we need to pull single files from the backup e.g. if a user screwed up his excel sheet or finds some document lost after a month or so.
This is really tricky and we've been pondering how to do better for a long time.
I am considering PBS and will take a look at it in detail in the future. Maybe it can be a valid solution, or a first tier of Backup (Daily to a NAS, weekly to tape from the NAS or something)

Ingo S · Dec 21, 2023

alexskysilk said:
Why was this necessary?

I confused something. We had an incident where the OS disk of a Server died. This was, when the VMs were down and we got them up on another node in 10min.
When a DB/WAL SSD died, the VMs were not down, but we needed to shut down the Server to replace the NVME Drive, so we live migrated the VMs.

Superfish1000 · Dec 21, 2023

alexskysilk said:
what makes your setup weird? may be worth looking at your /etc/network/interfaces file.

I'm using IPoIB and not Ethernet, and all of my hardware is kind of a mixed bag of Ebay specials. The system's boot disks are actually shucked PNY 128GB SSDs that I mounted to a custom 3D printed bracket and ziptied to the back of the case because I didn't want to use one of the 3 drive bays for a boot disk and I only had one PCIe port per server(Supermicro 4 node fat twin).

Apart from that I don't think any of the other networking setup I have is super weird except that my migration network is on IPoIB but is then bridged to ethernet using a Voltaire VLT-4036E. That ethernet connection is then bridged to VPN and that VPN is configured on two other nodes that are configured on the other side of the globe.

I have 7 nodes total with 2 in Tokyo and 5 in the US, all linked over a ZeroTier VPN.

alexskysilk said:
On what benchmark? your fio arguments would be instructive. also, is that from a member node or guest VM?

When I was testing the pool itself that was using the rados bench command.

Code:

rados bench -p [Pool Name]10 write --no-cleanup
rados bench -p [Pool Name] 10 seq
rados bench -p [Pool Name] 10 rand
rados -p [Pool Name] cleanup

When I was testing the OSDs I was using the ceph tell command.

Code:

ceph tell osd.* bench -f plain

alexskysilk said:
Laughs. you might want to reread the section of how osds work. with only 2 OSDs that arent even matched, that are HDDs no less, you're getting better performance then I would expect. BTW, how are you managing to have an EC pool with 3 nodes and 6 drives total? this isnt a very sane configuration.

alexskysilk said:
I doubt this is making any difference in your benchmark. recreate without a separate db disk if you're curious.

I had the pools set up without a DB/WAL disk before and the disk latency was massive. It went down a lot after adding the SSDs, but I think now I'm starting to be punished for not putting enough drive space. I'm curious if the OSD benchmark would improve without the DB/WAL though... Maybe I will try reconfiguring things...

alexskysilk · Dec 21, 2023

Superfish1000 said:
I'm using IPoIB and not Ethernet, and all of my hardware is kind of a mixed bag of Ebay specials. The system's boot disks are actually shucked PNY 128GB SSDs that I mounted to a custom 3D printed bracket and ziptied to the back of the case because I didn't want to use one of the 3 drive bays for a boot disk and I only had one PCIe port per server(Supermicro 4 node fat twin).

Nothing inherently problematic about the above. In some ways having ceph over IPOIB is even preferrable, since that eliminates user tendency to want to attach a vmbr to ceph interfaces

The hacky nature of your config is more problematic for you physically then it is logically- it sounds like a nightmare to administer.

Superfish1000 said:
I'm curious if the OSD benchmark would improve without the DB/WAL though... Maybe I will try reconfiguring things...

Most likely not. Offloading the db to a faster device only helps when you have a lot of requests; it doesnt help with actual io. a single benchmark thread wouldn't stress a db thats on the HDDs, much less on ssd. If you want meaningful performance improvement, add a lot more OSDs. or SSDs. or both.

Search

Search

Poor write performance on ceph backed virtual disks.

Ingo S

Renowned Member

AllanM

Renowned Member

alexskysilk

Distinguished Member

Ingo S

Renowned Member

alexskysilk

Distinguished Member

Ingo S

Renowned Member

alexskysilk

Distinguished Member

jsterr

Famous Member

Ingo S

Renowned Member

Ingo S

Renowned Member

alexskysilk

Distinguished Member

alexskysilk

Distinguished Member

Ingo S

Renowned Member

Ingo S

Renowned Member

Superfish1000

Active Member

alexskysilk

Distinguished Member

We value your privacy