Bad Disk Performance with CEPH!

My understanding was that Ceph was on the cluster network?

We are a monitoring company so we receive thousands of signals each hour and the downtime is a major issue for us if we had a failover incident on ZFS HA, maybe I should consider this to resolve the current issue prior to building a ssd only ceph pool.

How can I confirm and change ceph/public network? My understanding is that they share the same NIC's?
 
Hi all, we deployed a proxmox cluster with 3 nodes last year and it was running relatively ok and we are getting quite familiar with the system.

The storage is ceph shared strorage with 12 OSD's across 3 nodes (4 each). We have just in the last week deployed a new software which runs an informix database which does checkpoints on write speeds on the disks. It is failing these checkpoints and causing major issues operationally. It is a monitoring software so we have operators working on the client software 24 hours per day. It is causing severe issues and crashing for them so is now quite urgent.

We have Kingston DC600M SSD Drives for DB & Wal and the rest of the storage is made up of western Digital Gold Data center drives.

According to the companies support they are seeing checkpoints take 60 seconds at times which should take maximum of 10 seconds to complete and they are telling us it is disk related due to write speeds.

Any idea how we could troubleshoot such an issue?
Hello.

From my experience, monitoring software database is the worst case scenario for IOs. Furthermore in your case : small writes (thus read, modify, write), with synchronous random writes, and with synchronous network replication on other nodes !
Maybe you are experiencing IOs starvation on the whole platform ? Did you look at maximum IO wait on the servers during checkpoints ?

Did you do some speed testing with your disks devices, raw and over CEPH at the platform installation ? Maybe your storage platform cannot manage the load ?

A solution could be to enable the RAM caching with host writeback but you will loose data in case of power outage. Be sure of what you are doing, read : https://pve.proxmox.com/wiki/Performance_Tweaks

And just a note : if you have production on the servers, you should get Proxmox Virtual Environment subscription for support :-)

BR
Gautier
 
My understanding was that Ceph was on the cluster network?

We are a monitoring company so we receive thousands of signals each hour and the downtime is a major issue for us if we had a failover incident on ZFS HA, maybe I should consider this to resolve the current issue prior to building a ssd only ceph pool.

How can I confirm and change ceph/public network? My understanding is that they share the same NIC's?

Hi @Jackmynet

Proxmox HA

- Downtime will be there in all cases, Proxmox HA need to identify that the node is down, make some fencing and run VM on different node. It does not matter if ZFS / CEPH, just data "lost"

CEPH public to cluster network switch, there are many same topic in forum.

https://forum.proxmox.com/threads/ceph-public-private-and-what-goes-over-the-network.147567/

At this moment as your knowledge is not on advanced level I really suggest Proxmox official Training (Basic + Advanced) or Professional Proxmox support for help.
 
Last edited:
As others here have said, you are using 1 spinning disk as your entire database backend. What are you expecting? A spinning disk does 10-100 IOPS, an OS alone eats 5-20 IOPS just ‘being idle’ (logging, authentication etc) in my environment.

It will be slow, Informix isn’t exactly a performance beast either, hence why IBM is ending the product line. Using your SSD for WAL/DB only affects the underlying Ceph backend, the data is still on the spinning disk. And using ZFS won’t help either, and you’ll also have to deal with consistency issues if you have a node or pool failure.
 
Ok, yes the design was based on vmware which we changed last minute and it seems proxmox is still a long way behind in this regard when building shared storage.

I think my best plan of action is to simply create a new Ceph pool with SSD discs for the database to run on so that it has access to the speed it needs, what you think?

Do you recommend a specific SSD for this? We do not have endless money to throw at this so something with solid performance but not super expensive. This storage doesn't need to be lightning fast :)
 
Ok, yes the design was based on vmware which we changed last minute and it seems proxmox is still a long way behind in this regard when building shared storage.
If your intention was to use VSAN with the same drives you deployed, your experience would be similar. Dont conflate your choice of hardware with the relative "features" of the software; neither of them can make a turtle outpace a hare.

I think my best plan of action is to simply create a new Ceph pool with SSD discs for the database to run on so that it has access to the speed it needs, what you think?
Thats seems like a much better idea. you can operate two seperate pools on two separate disk classes in the same cluster. Aside of admonishing to pick high write endurance devices, it would be to your benefit to create an acceptance criteria for what you want the storage to handle in iops. you should probably also rethink your network to accommodate the much faster devices.

We do not have endless money to throw at this
This is, with respect, completely the wrong approach. The price of the components should never be considered in a vacuum; What are the points where you cause the enterprise financial harm? have you actually done the due diligence to figure out what your absolute floor (performance, downtime, etc) below which that occurs? If you really can't afford to do it "right" it may be a better idea to engage a service you can pay monthly that will; then its just a matter of a monthly recurring service charge- it will cost more in the long run but you will likely have the result you desire.
 
  • Like
Reactions: Johannes S
Do you have any recommendations for ssd drives for ceph?

We are bound by compliance and are a company that would not exist if we outsourced these builds and it has served us very well up to now. We are not going to throw endless money at it as we simply cannot and do not need to.

The intention would be to add 25Gb dual nic to each server to dedicate to Ceph and leave the 10Gb for the cluster, add a ssd ceph pool and move the databse storage over to this ceph storage for the speed it needs.
 
I have read good things about the samsung SSD's and also the kingston, would you recommend to mix these or use all one type of drive in the ceph pool?

Also am I best allocating whole drives as OSD's? It would appear so based on what I am reading, but unsure
 
I have read good things about the samsung SSD's and also the kingston,
Generally speaking, manufacturer brands are preferrable to repackagers (vendors that dont make any of the modules or components, eg Kingston)

Samsung, Toshiba, Solidigm, Kioxia all make suitable parts. Again, stick to write optimized enterprise models so you dont end up killing your drives in short order.
Also am I best allocating whole drives as OSD's?
Yes. As a rule, don't ever share devices meant for aggregation (eg, zfs, raid, ceph, etc.) If you really want to understand why you need to consider that you're not ACTUALLY dividing the device, and what happens when you introduce contention into its performance and lifecycle expectations.
 
Thanks for your advise,

I am thinking I will go with all the same SSD to avoid risk of any difference in performance however I am concerned about the potentially bad practice of that in case a batch of bad drives causes simultaneous problems across drives.

Am I being paranoid or should I keep them the same?
 
The attached are some readings from the disc, seems fairly horrific :)
 

Attachments

  • Screenshot 2025-02-18 105853.png
    Screenshot 2025-02-18 105853.png
    115.6 KB · Views: 4
  • Screenshot 2025-02-18 105906.png
    Screenshot 2025-02-18 105906.png
    110.8 KB · Views: 4
Thanks for your advise,

I am thinking I will go with all the same SSD to avoid risk of any difference in performance however I am concerned about the potentially bad practice of that in case a batch of bad drives causes simultaneous problems across drives.

Am I being paranoid or should I keep them the same?
We mitigate this by buying SSD from multiple supplier (same model), to try avoid same production batch.
 
  • Like
Reactions: Jackmynet
So I have narrowed down with our supplier that it is only really the Informix database that needs fast storage.

The plan would be to create a ceph pool with ssd discs. If I put 960Gb SSD X 3 into each node and each one will be an OSD on its own how much ssd storage will I have on each node? (Total 9 960Gb OSD across the cluster, 3 on each node)

We need 1Tb of storage for that database as theres a primary and secondary which each are 500Gb disc size. They run on separate nodes however in a failure event they could potentially fail over to the same node due to HA.

I have tried calculating this with the ceph storage calculator but am not fully understanding it.

Thanks
 
VMware has no magic, if you put a single spinning disk per server, it will be slow, if it is faster than a single disk is capable of doing, it is lying to you (caching in RAM), which is fine until it isn't (server crashes, power outages etc). Which is why in poorly constructed systems *cough*IBM storage*cough* VMware outages can take days to recover when the power goes out and a single RAID module doesn't come back (been there). That is why I don't like closed systems, because the vendor often lies to sell you the system and it's only year 4 when things go south that you realize your mistake.

That being said, Ceph is really easy to calculate, add all the OSDs in the pool together (eg. each 960Gb disk in every server) and then multiply that by the yield of your Ceph configuration. If you have size=3, min=2, then in an optimal state (with 3 nodes) you will have (960*n (OSD) / 3). So if you have 3 nodes with 3 disks each, that is ~2.8T. The min is how many nodes have to be alive in order to keep serving requests.

It becomes a bit more complex when you add 4 and 5 nodes, because ideally, you will have the same calculation, however, when a node fails, your blocks are re-allocated on other nodes to restore to size=3, meaning when a node goes down you effectively shrink your available storage.
 
Last edited:
  • Like
Reactions: Johannes S