Bad Disk Performance with CEPH!

Jackmynet

Member
Oct 5, 2024
34
0
6
Hi all, we deployed a proxmox cluster with 3 nodes last year and it was running relatively ok and we are getting quite familiar with the system.

The storage is ceph shared strorage with 12 OSD's across 3 nodes (4 each). We have just in the last week deployed a new software which runs an informix database which does checkpoints on write speeds on the disks. It is failing these checkpoints and causing major issues operationally. It is a monitoring software so we have operators working on the client software 24 hours per day. It is causing severe issues and crashing for them so is now quite urgent.

We have Kingston DC600M SSD Drives for DB & Wal and the rest of the storage is made up of western Digital Gold Data center drives.

According to the companies support they are seeing checkpoints take 60 seconds at times which should take maximum of 10 seconds to complete and they are telling us it is disk related due to write speeds.

Any idea how we could troubleshoot such an issue?
 
Hello @Jackmynet

It may depends on many factors and from the description is not clear about the configuration

What is you network configuration ?
- do you have dedicated network interface for CEPH ?
- What is the speed of network interface controller

Similar issue: https://forum.proxmox.com/threads/ceph-very-slow-what-am-i-doing-wrong.148680/

Are you able to experiment and https://yourcmc.ru/wiki/Ceph_performance

From the description, you are SSD / HDD what is the configuration in CEPH ? Do you use different device class based on speed ?

https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes

https://forum.proxmox.com/threads/ceph-classes-rules-and-pools.60013/

Update: for DB with huge workloads we are using ZFS with HA / replication as ZFS fits better for DB deployments.
 
Last edited:
  • Like
Reactions: UdoB
Thanks for the reply.

I am not very experienced on this but we have a 10Gb fibre connected between all 3 nodes which is dedicated to Ceph, this is not going via a switch, it is a direct broadcast network between the ports which I would think is more than sufficient?

I am looking at the other threads you sent now to see what may be relevant to me.

Essentially it is a simple ceph setup where we allocated partitions of the ssd drives for the db & wal on each osd.
 
I would think is more than sufficient?
According to the companies support they are seeing checkpoints take 60 seconds at times which should take maximum of 10 seconds to complete and they are telling us it is disk related due to write speeds.
These things look contradictory.

I would advise putting aside the current design (at least logically) and begin by defining exactly what you expect from your storage. If you wish some assistance in designing a storage solution to meet the required minimums of your application I (and others here) can make suggestions, but you have to define that first.
 
I have gone and verified that the ceph network does infact have 10Gb network speeds and it does.

Strange thing is that on one server the eno0 and eno2 interfaces are the ethernet (1Gb) for internet access but the fibre nic name is ens3f0 and ens3f1 but I remember identifying this when setting it up and it is therefore configured with this in mind.

The requirement of this storage is high availability and redundancy we do not need lightning fast speeds but it needs to be reasonable speed for the software to ru these checkpoints successfully which it does not need to be very fast to do.

So considering my cluster/ceph network is 10Gb what would be next to look at and how do I troubleshoot it?
 
Hi @Jackmynet

can you please share / describe the configuration how you separated drive class in CEPH ? This is not mentioned in you original post.

If you are running both drive class (hdd + ssd) in one CEPH, do you use replication rule for device classes, or how you separate them ?

Also will be good to know what exactly are the "checkpoints"

Thank you

Lukas
 
Last edited:
HI Lukas,

Is there a way I can export the config?

The Ceph configuration is 4 HDD OSD on each node. Each HDD osd is a partition of the drives on that machine. Each OSD has 80Gb of SSD drive allocated as DB and 80Gb allocated as WAL

I have not done anything with classes etc. I have tested the bandwidth between each node and it is showing as expected 10Gb.

I have just now changed the disc cache to write back in the hope of slightly reducing the issue.

The checkpoints according to the supplier are done every 15 minutes by the informix database and it is taking a long time due to wrtie speeds on the discs.
 
See attached. Some of what is in the configuration window is pasted into notes as it wouldnt fit in screenshot.

Let me know your thoughts.
 

Attachments

  • Screenshot 2025-02-16 204333.png
    Screenshot 2025-02-16 204333.png
    136.3 KB · Views: 13
  • Screenshot 2025-02-16 204358.png
    Screenshot 2025-02-16 204358.png
    39.7 KB · Views: 12
  • Screenshot 2025-02-16 204424.png
    Screenshot 2025-02-16 204424.png
    25.3 KB · Views: 10
  • Screenshot 2025-02-16 203853.png
    Screenshot 2025-02-16 203853.png
    71 KB · Views: 10
  • Screenshot 2025-02-16 203919.png
    Screenshot 2025-02-16 203919.png
    84.4 KB · Views: 10
  • Screenshot 2025-02-16 204301.png
    Screenshot 2025-02-16 204301.png
    85.2 KB · Views: 12
  • note proxmox 1.txt
    note proxmox 1.txt
    656 bytes · Views: 4
  • # begin crush map.txt
    # begin crush map.txt
    1.9 KB · Views: 1
Not a ceph user here, but :
Ceph is RAID over network, so write is replicated realtime over the two other nodes (with the default replicas = 3).
HDD are slow, too low IOPS, running VM OS and DB from them is excepted to be slow.
Adding nodes + adding many HDD might help but, IMO, full flash OSD is mandatory, even more for 3 nodes cluster, HDD will be for backup :-/

edit: post disk benchmark within VM, like iops and bandwidth.
 
Last edited:
Thanks

I see two potential problems

- Only HDD in CEPH, are you sure that is enough for "informix database" ? You mentioned also that use SSD, can you plese share screenshot of Node -> Disks ?
- You are running CEHP on public network (explanation bellow)
- cluster_network = 10.11.250.1/24
- public_network = 10.11.250.1/24


Public Network

  • Purpose:
    • Used for client-to-Ceph cluster communication.
    • Proxmox nodes, VMs, and other clients access Ceph storage through this network.
    • Handles operations like reading/writing data, metadata communication, and cluster status checks.
  • Characteristics:
    • Typically runs on a high-speed network (10GbE preferred).
    • Should be highly available and reliable because all external I/O depends on it.

Cluster Network

  • Purpose:
    • Used for Ceph internal communication between OSDs, monitors, and managers.
    • Handles replication, recovery, heartbeats, and data backfilling.
    • This traffic is heavy, especially during data replication and balancing.
  • Characteristics:
    • Should be on a separate physical network from the public network to avoid congestion.
    • High bandwidth and low latency are crucial here (again, 10GbE recommended).
 
see attached. I am using the ssd as DB and WAL only in the ceph config based on advise I got from the forum.

So are you saying we should have a totally seperate 10Gb network for the cluster and the ceph?
 

Attachments

  • Screenshot 2025-02-16 215224.png
    Screenshot 2025-02-16 215224.png
    118 KB · Views: 9
  • Screenshot 2025-02-16 215236.png
    Screenshot 2025-02-16 215236.png
    71.7 KB · Views: 9
From the screenshots you split the each drive for several partitions (ZFS, CEPH, ....), but for CEPH you are using the one which is on classic HDD (partition),

1739743740370.png

On storages under Data Centre I can see myntie-ceph

1739743936063.png

1739744025402.png

Which consists from the HDD (partitions]

1739744110702.png
Please share me the details of device for osd.3 (double click)

example from our lab:

1739744408178.png
 

Attachments

  • 1739744367173.png
    1739744367173.png
    61.1 KB · Views: 6
  • 1739744312898.png
    1739744312898.png
    49.2 KB · Views: 5
  • 1739744219231.png
    1739744219231.png
    7.2 KB · Views: 6
Yes the majority of this ceph is made up of HDD partitions as explained but we added ssd for db wal.

See attached as requested. The OSD 3 is identical to all the others.
 

Attachments

  • Screenshot 2025-02-16 222729.png
    Screenshot 2025-02-16 222729.png
    86 KB · Views: 7
  • Screenshot 2025-02-16 222958.png
    Screenshot 2025-02-16 222958.png
    80.1 KB · Views: 7
  • Screenshot 2025-02-16 223005.png
    Screenshot 2025-02-16 223005.png
    53.4 KB · Views: 7
Now understand (apologize I had something in my mind and was little bit blocked)

The potential performance problems.

- CEPH cluster is running on same public network
- You are using HDD FS ZFS and also CEPH - on same one device - disk, which is really not good idea and can main problem of performance issues.
-- What is stored on the ZFS: 79 GB (OS ?), 100 GB and 200 GB ?
-- Recommendation for CEPH: One dedicated Device per OSD

Regarding WAL / DB can you share details of "Devices" for OSD.0, OSD.4 and OSD.5 from node PVE ?
 
Last edited:
  • Like
Reactions: alexskysilk
There is currently actually nothing running on the zfs drives I created. this we used when testing at the start.

Would it be best to simply delete these partitions? I still have OSD's that are only partitions of drives rather than whole discs.

See attached as requested
 

Attachments

  • Screenshot 2025-02-16 230613.png
    Screenshot 2025-02-16 230613.png
    55.9 KB · Views: 6
  • Screenshot 2025-02-16 230605.png
    Screenshot 2025-02-16 230605.png
    39.6 KB · Views: 6
  • Screenshot 2025-02-16 230559.png
    Screenshot 2025-02-16 230559.png
    36.2 KB · Views: 5
  • Screenshot 2025-02-16 230549.png
    Screenshot 2025-02-16 230549.png
    52.1 KB · Views: 4
  • Screenshot 2025-02-16 230541.png
    Screenshot 2025-02-16 230541.png
    40.9 KB · Views: 4
  • Screenshot 2025-02-16 230534.png
    Screenshot 2025-02-16 230534.png
    32.4 KB · Views: 2
  • Screenshot 2025-02-16 230521.png
    Screenshot 2025-02-16 230521.png
    50.9 KB · Views: 2
  • Screenshot 2025-02-16 230512.png
    Screenshot 2025-02-16 230512.png
    42 KB · Views: 0
  • Screenshot 2025-02-16 230500.png
    Screenshot 2025-02-16 230500.png
    31 KB · Views: 6
to add:

WAL/DB on SSD are NOT substitute for actual performant OSD devices. HDDs are piss poor for random IO, which is primarily what a virtualization workload demands; moreover, under normal circumstances you would end up bottlenecked by your relatively poor network infrastructure (single 10gb link for both public and private traffic; depending on what else is using this link there could be further consequences) but in this case your disks will be the bottleneck.
 
Ok I can ses third protentional problem, you are using same two SDD (I overseen in screenshot "+" for expand the list), for all OSD for DB and especially WAL.

ZFS on same device as CEPH should not be problem if there are no R/W operation, but it is not standard.

CEPH running on public network can be also problem.

And HDD ... itself

Lets wait few hours, for other members opinion :)
 
Last edited:
Yes each node has just the two 480Gb SSD which are being used for DB and WAL on all the OSD for that node only. 80Gb of ssd for db wal on each osd.
 
From my perspective

- Move CEPH from Public to cluster network https://forum.proxmox.com/threads/ceph-changing-public-network.119116/
- Switch from HDD to SSD

- Consider to use ZFS, I do not know what is maximum period of data loose for you business. For Large DB with heavy R/W we are using ZFS in RAID with replication on secondary servers.

The CEPH cluster need redesign, as there are bottlenecks: Network (Public), HDD and also one SSD for WAL sharing same bus for all OSDs, which is not ideal also.