Do I need to additionally configure disks for ceph+Proxmox?

emptness

Member
Aug 19, 2022
110
6
23
Greetings!
There are many manuals on the Internet on optimizing the work of Ceph, in particular on configuring disk parameters in the OS.
Please tell me the default settings of Proxmox VE are optimal in general terms, or does it really require optimization at this point in time?

For example, these parameters are recommended:
echo "deadline" > /sys/block/sd[x]/queue/scheduler - for HDD
echo "noop" > /sys/block/sd[x]/queue/scheduler - for SSD

I tried to change them, but the system does not allow it. The parameter is none and there are no other options.
 
Greetings!
There are many manuals on the Internet on optimizing the work of Ceph, in particular on configuring disk parameters in the OS.
Please tell me the default settings of Proxmox VE are optimal in general terms, or does it really require optimization at this point in time?

For example, these parameters are recommended:
echo "deadline" > /sys/block/sd[x]/queue/scheduler - for HDD
echo "noop" > /sys/block/sd[x]/queue/scheduler - for SSD

I tried to change them, but the system does not allow it. The parameter is none and there are no other options.
The schedulers you mentioned are what older kernels offered and no longer available. More modern kernels offer multiqueue alternatives to these schedulers.

If you have SSDs and you see none as the scheduler or have HDDs and their scheduler is mq-deadline, you already have the desired configuration described in the recommendations you found.

Of course, you can choose mq-deadline (multiqueue replacement for deadline) or none (multiqueue replacement for noop) for any disks / ssds as you like. If you want to change this and keep the setting over reboots, you probably want to create a udev rule for this. Like:
Code:
vi /etc/udev/rules.d/60-scheduler.rules
(or whatever you like to call the file)
and add
Code:
# override scheduler for drives used by ceph
ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sdb", ATTR{queue/scheduler}="none"
 
The schedulers you mentioned are what older kernels offered and no longer available. More modern kernels offer multiqueue alternatives to these schedulers.

If you have SSDs and you see none as the scheduler or have HDDs and their scheduler is mq-deadline, you already have the desired configuration described in the recommendations you found.

Of course, you can choose mq-deadline (multiqueue replacement for deadline) or none (multiqueue replacement for noop) for any disks / ssds as you like. If you want to change this and keep the setting over reboots, you probably want to create a udev rule for this. Like:
Code:
vi /etc/udev/rules.d/60-scheduler.rules
(or whatever you like to call the file)
and add
Code:
# override scheduler for drives used by ceph
ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sdb", ATTR{queue/scheduler}="none"
Thank you so much for your reply!
Yes, I have an ssd. They have [mq-deadline] none set in their planner now.
I would like to change the selection to none. How do I do the setup without rebooting, online?
echo "none" > /sys/block/sd[x]/queue/scheduler ?
Do I understand correctly that it will be better for SSD?
 
echo "none" > /sys/block/sd[x]/queue/scheduler ?
correct. You can do that on the fly to change the current scheduler.

Do I understand correctly that it will be better for SSD?
I actually doubt there will be any performance benefit with regards to I/O. However, you might save some CPU cycles because none does less work than mq-deadline.
It all obviously heavily on your specific hardware, but I do not think you will notice any difference, to be honest.
 
correct. You can do that on the fly to change the current scheduler.


I actually doubt there will be any performance benefit with regards to I/O. However, you might save some CPU cycles because none does less work than mq-deadline.
It all obviously heavily on your specific hardware, but I do not think you will notice any difference, to be honest.
Understood you.
There are many articles on the Internet on optimizing the operation of ceph and increasing the read/write speed. Many of them are very old.
The fact is that my ceph speed tests have quite low indicators, although the servers are powerful and HDD, SSD, NAME Enterprise class.
But the more I study this issue, the more I am convinced that the default settings in Proxmox are optimal.
Do you think this is really the case? or can you recommend some settings?
Most of all, I am confused by the speed of the HDD+NVME pool, I do not see any increase in performance (
If you are ready to help me, I can provide more detailed characteristics and a description of the settings.
 
First recommendation: do benchmarks, identify whether you lack IOPS or bandwidth, whether some storage medium is the problem and if so, which one and then you have a much better chance adressing the specific reasons.

Also keep in mind that the limiting factor in ceph may not be the storage medium at all, but the network.
My setup, for instance, consists of 3 nodes with 8-10 osds each, one node with very fast nvme ssds, one with sas ssds and the last one with older sata ssds (all datacenter / enterprise class - they do give higher iops and more consistently so).
In my case, right now the 10G network I use clearly is a bottleneck as far as bandwidth is concerned. Each nvme ssd has much higher bandwidth and even a single sas ssd can saturate roughly 75% of a 10G link.

The network may also be the source of latency issues. As for tuning, search this forum for information on how to disable auth (stuff like auth_client_required) and debugging (such as debug_filestore, debug_auth, etc.) in ceph (and take care to understand the implications before doing so). Doing so may reduce traffic and save some roundtrips for auth as well as (slightly) reduce CPU load.

Also important to understand how replicas are distributed across nodes and osds. By default each osd gets a weight corresponding to its size capacity. You can easily setup a pool that ends up favoring (because of the higher weight) the slowest (but largest) drives, etc.
 
First recommendation: do benchmarks, identify whether you lack IOPS or bandwidth, whether some storage medium is the problem and if so, which one and then you have a much better chance adressing the specific reasons.

Also keep in mind that the limiting factor in ceph may not be the storage medium at all, but the network.
My setup, for instance, consists of 3 nodes with 8-10 osds each, one node with very fast nvme ssds, one with sas ssds and the last one with older sata ssds (all datacenter / enterprise class - they do give higher iops and more consistently so).
In my case, right now the 10G network I use clearly is a bottleneck as far as bandwidth is concerned. Each nvme ssd has much higher bandwidth and even a single sas ssd can saturate roughly 75% of a 10G link.

The network may also be the source of latency issues. As for tuning, search this forum for information on how to disable auth (stuff like auth_client_required) and debugging (such as debug_filestore, debug_auth, etc.) in ceph (and take care to understand the implications before doing so). Doing so may reduce traffic and save some roundtrips for auth as well as (slightly) reduce CPU load.

Also important to understand how replicas are distributed across nodes and osds. By default each osd gets a weight corresponding to its size capacity. You can easily setup a pool that ends up favoring (because of the higher weight) the slowest (but largest) drives, etc.
Thanks! I understand you.
I know about the settings you are writing about. And won't disabling
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
lead to a decrease in security?

Can you tell me what speeds your disk system tests show? for example, a RADOS test so that I can compare.
At the moment, my network consists of 4 1 GBit ports in aggregation (that is, up to 4 GBit at best). I understand perfectly well that this is not enough. I'm waiting for 100 GBit switches to arrive. But all the same, tests on my cluster do not give the maximum possible speed and this confuses me. For example, HDD pool speed is sometimes better than SSD pool.
So I'm worried if the new switches will fix it)
 
And won't disabling
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
lead to a decrease in security?
That all depends on who your ceph clients are. If you use ceph exclusively as a storage backend for VMs, then your cluster nodes will be the only clients and you can then determine if the network that traffic runs on actually needs the security.
 
That all depends on who your ceph clients are. If you use ceph exclusively as a storage backend for VMs, then your cluster nodes will be the only clients and you can then determine if the network that traffic runs on actually needs the security.
I understand you. Thanks!
Will it be necessary to restart ceph on all nodes to apply the settings?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!