Bad Disk Performance with CEPH!

Jackmynet · Sunday at 14:43

Hi all, we deployed a proxmox cluster with 3 nodes last year and it was running relatively ok and we are getting quite familiar with the system.

The storage is ceph shared strorage with 12 OSD's across 3 nodes (4 each). We have just in the last week deployed a new software which runs an informix database which does checkpoints on write speeds on the disks. It is failing these checkpoints and causing major issues operationally. It is a monitoring software so we have operators working on the client software 24 hours per day. It is causing severe issues and crashing for them so is now quite urgent.

We have Kingston DC600M SSD Drives for DB & Wal and the rest of the storage is made up of western Digital Gold Data center drives.

According to the companies support they are seeing checkpoints take 60 seconds at times which should take maximum of 10 seconds to complete and they are telling us it is disk related due to write speeds.

Any idea how we could troubleshoot such an issue?

Lukas Moravek · Sunday at 17:53

Hello @Jackmynet

It may depends on many factors and from the description is not clear about the configuration

What is you network configuration ?
- do you have dedicated network interface for CEPH ?
- What is the speed of network interface controller

Similar issue: https://forum.proxmox.com/threads/ceph-very-slow-what-am-i-doing-wrong.148680/

Are you able to experiment and https://yourcmc.ru/wiki/Ceph_performance

From the description, you are SSD / HDD what is the configuration in CEPH ? Do you use different device class based on speed ?

https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes

https://forum.proxmox.com/threads/ceph-classes-rules-and-pools.60013/

Update: for DB with huge workloads we are using ZFS with HA / replication as ZFS fits better for DB deployments.

Jackmynet · Sunday at 18:09

Thanks for the reply.

I am not very experienced on this but we have a 10Gb fibre connected between all 3 nodes which is dedicated to Ceph, this is not going via a switch, it is a direct broadcast network between the ports which I would think is more than sufficient?

I am looking at the other threads you sent now to see what may be relevant to me.

Essentially it is a simple ceph setup where we allocated partitions of the ssd drives for the db & wal on each osd.

alexskysilk · Sunday at 18:38

Jackmynet said:
I would think is more than sufficient?

Jackmynet said:
According to the companies support they are seeing checkpoints take 60 seconds at times which should take maximum of 10 seconds to complete and they are telling us it is disk related due to write speeds.

These things look contradictory.

I would advise putting aside the current design (at least logically) and begin by defining exactly what you expect from your storage. If you wish some assistance in designing a storage solution to meet the required minimums of your application I (and others here) can make suggestions, but you have to define that first.

Jackmynet · Sunday at 18:54

I have gone and verified that the ceph network does infact have 10Gb network speeds and it does.

Strange thing is that on one server the eno0 and eno2 interfaces are the ethernet (1Gb) for internet access but the fibre nic name is ens3f0 and ens3f1 but I remember identifying this when setting it up and it is therefore configured with this in mind.

The requirement of this storage is high availability and redundancy we do not need lightning fast speeds but it needs to be reasonable speed for the software to ru these checkpoints successfully which it does not need to be very fast to do.

So considering my cluster/ceph network is 10Gb what would be next to look at and how do I troubleshoot it?

Lukas Moravek · Sunday at 20:59

Hi @Jackmynet

can you please share / describe the configuration how you separated drive class in CEPH ? This is not mentioned in you original post.

If you are running both drive class (hdd + ssd) in one CEPH, do you use replication rule for device classes, or how you separate them ?

Also will be good to know what exactly are the "checkpoints"

Thank you

Lukas

Jackmynet · Sunday at 21:09

HI Lukas,

Is there a way I can export the config?

The Ceph configuration is 4 HDD OSD on each node. Each HDD osd is a partition of the drives on that machine. Each OSD has 80Gb of SSD drive allocated as DB and 80Gb allocated as WAL

I have not done anything with classes etc. I have tested the bandwidth between each node and it is showing as expected 10Gb.

I have just now changed the disc cache to write back in the hope of slightly reducing the issue.

The checkpoints according to the supplier are done every 15 minutes by the informix database and it is taking a long time due to wrtie speeds on the discs.

Lukas Moravek · Sunday at 21:28

Hi @Jackmynet

Paste the screenshots (which might be more helpful):

Under the "Datacenter" of the cluster
- Storage

On the node under the "CEPH"
- Configuration
- Monitor
- OSD
- CephFS
- Pools

Or use cmd:

ceph status

ceph osd tree

ceph osd pool ls

ceph fs status

ceph osd crush rule ls

pvesm status

Jackmynet · Sunday at 21:46

See attached. Some of what is in the configuration window is pasted into notes as it wouldnt fit in screenshot.

Let me know your thoughts.

_gabriel · Sunday at 21:46

Not a ceph user here, but :
Ceph is RAID over network, so write is replicated realtime over the two other nodes (with the default replicas = 3).
HDD are slow, too low IOPS, running VM OS and DB from them is excepted to be slow.
Adding nodes + adding many HDD might help but, IMO, full flash OSD is mandatory, even more for 3 nodes cluster, HDD will be for backup :-/

edit: post disk benchmark within VM, like iops and bandwidth.

Lukas Moravek · Sunday at 22:44

Thanks

I see two potential problems

- Only HDD in CEPH, are you sure that is enough for "informix database" ? You mentioned also that use SSD, can you plese share screenshot of Node -> Disks ?
- You are running CEHP on public network (explanation bellow)

- cluster_network = 10.11.250.1/24

- public_network = 10.11.250.1/24

Public Network

Purpose:
- Used for client-to-Ceph cluster communication.
- Proxmox nodes, VMs, and other clients access Ceph storage through this network.
- Handles operations like reading/writing data, metadata communication, and cluster status checks.
Characteristics:
- Typically runs on a high-speed network (10GbE preferred).
- Should be highly available and reliable because all external I/O depends on it.

Cluster Network

Purpose:
- Used for Ceph internal communication between OSDs, monitors, and managers.
- Handles replication, recovery, heartbeats, and data backfilling.
- This traffic is heavy, especially during data replication and balancing.
Characteristics:
- Should be on a separate physical network from the public network to avoid congestion.
- High bandwidth and low latency are crucial here (again, 10GbE recommended).

Jackmynet · Sunday at 22:54

see attached. I am using the ssd as DB and WAL only in the ceph config based on advise I got from the forum.

So are you saying we should have a totally seperate 10Gb network for the cluster and the ceph?

Lukas Moravek · Sunday at 23:22

From the screenshots you split the each drive for several partitions (ZFS, CEPH, ....), but for CEPH you are using the one which is on classic HDD (partition),

On storages under Data Centre I can see myntie-ceph

Which consists from the HDD (partitions]

Please share me the details of device for osd.3 (double click)

example from our lab:

Jackmynet · Sunday at 23:28

Yes the majority of this ceph is made up of HDD partitions as explained but we added ssd for db wal.

See attached as requested. The OSD 3 is identical to all the others.

Lukas Moravek · Sunday at 23:59

Now understand (apologize I had something in my mind and was little bit blocked)

The potential performance problems.

- CEPH cluster is running on same public network
- You are using HDD FS ZFS and also CEPH - on same one device - disk, which is really not good idea and can main problem of performance issues.
-- What is stored on the ZFS: 79 GB (OS ?), 100 GB and 200 GB ?
-- Recommendation for CEPH: One dedicated Device per OSD

Regarding WAL / DB can you share details of "Devices" for OSD.0, OSD.4 and OSD.5 from node PVE ?

Jackmynet · Monday at 00:08

There is currently actually nothing running on the zfs drives I created. this we used when testing at the start.

Would it be best to simply delete these partitions? I still have OSD's that are only partitions of drives rather than whole discs.

See attached as requested

alexskysilk · Monday at 00:25

to add:

WAL/DB on SSD are NOT substitute for actual performant OSD devices. HDDs are piss poor for random IO, which is primarily what a virtualization workload demands; moreover, under normal circumstances you would end up bottlenecked by your relatively poor network infrastructure (single 10gb link for both public and private traffic; depending on what else is using this link there could be further consequences) but in this case your disks will be the bottleneck.

Lukas Moravek · Monday at 00:25

Ok I can ses third protentional problem, you are using same two SDD (I overseen in screenshot "+" for expand the list), for all OSD for DB and especially WAL.

ZFS on same device as CEPH should not be problem if there are no R/W operation, but it is not standard.

CEPH running on public network can be also problem.

And HDD ... itself

Lets wait few hours, for other members opinion

Jackmynet · Monday at 08:23

Yes each node has just the two 480Gb SSD which are being used for DB and WAL on all the OSD for that node only. 80Gb of ssd for db wal on each osd.

Lukas Moravek · Monday at 08:55

From my perspective

- Move CEPH from Public to cluster network https://forum.proxmox.com/threads/ceph-changing-public-network.119116/
- Switch from HDD to SSD

- Consider to use ZFS, I do not know what is maximum period of data loose for you business. For Large DB with heavy R/W we are using ZFS in RAID with replication on secondary servers.

The CEPH cluster need redesign, as there are bottlenecks: Network (Public), HDD and also one SSD for WAL sharing same bus for all OSDs, which is not ideal also.

Bad Disk Performance with CEPH!

Member

Member

Member

Distinguished Member

Member

Member

Member

Member

Member

Attachments

Renowned Member

Member

Public Network​

Cluster Network​

Member

Attachments

Member

Attachments

Member

Attachments

Member

Member

Attachments

Distinguished Member

Member

Member

Member

We value your privacy

Public Network

Cluster Network