Do I need to additionally configure disks for ceph+Proxmox?

emptness · May 19, 2023

Greetings!
There are many manuals on the Internet on optimizing the work of Ceph, in particular on configuring disk parameters in the OS.
Please tell me the default settings of Proxmox VE are optimal in general terms, or does it really require optimization at this point in time?

For example, these parameters are recommended:
echo "deadline" > /sys/block/sd[x]/queue/scheduler - for HDD
echo "noop" > /sys/block/sd[x]/queue/scheduler - for SSD

I tried to change them, but the system does not allow it. The parameter is none and there are no other options.

emptness · May 23, 2023

Someone help me figure it out, please.

Waschbüsch · May 24, 2023

emptness said:
Greetings!
There are many manuals on the Internet on optimizing the work of Ceph, in particular on configuring disk parameters in the OS.
Please tell me the default settings of Proxmox VE are optimal in general terms, or does it really require optimization at this point in time?

For example, these parameters are recommended:
echo "deadline" > /sys/block/sd[x]/queue/scheduler - for HDD
echo "noop" > /sys/block/sd[x]/queue/scheduler - for SSD

I tried to change them, but the system does not allow it. The parameter is none and there are no other options.

The schedulers you mentioned are what older kernels offered and no longer available. More modern kernels offer multiqueue alternatives to these schedulers.

If you have SSDs and you see none as the scheduler or have HDDs and their scheduler is mq-deadline, you already have the desired configuration described in the recommendations you found.

Of course, you can choose mq-deadline (multiqueue replacement for deadline) or none (multiqueue replacement for noop) for any disks / ssds as you like. If you want to change this and keep the setting over reboots, you probably want to create a udev rule for this. Like:

Code:

vi /etc/udev/rules.d/60-scheduler.rules

(or whatever you like to call the file)
and add

Code:

# override scheduler for drives used by ceph
ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sdb", ATTR{queue/scheduler}="none"

emptness · May 25, 2023

Waschbüsch said:
The schedulers you mentioned are what older kernels offered and no longer available. More modern kernels offer multiqueue alternatives to these schedulers.

If you have SSDs and you see none as the scheduler or have HDDs and their scheduler is mq-deadline, you already have the desired configuration described in the recommendations you found.

Of course, you can choose mq-deadline (multiqueue replacement for deadline) or none (multiqueue replacement for noop) for any disks / ssds as you like. If you want to change this and keep the setting over reboots, you probably want to create a udev rule for this. Like:

Code:

vi /etc/udev/rules.d/60-scheduler.rules

(or whatever you like to call the file)
and add

Code:

# override scheduler for drives used by ceph ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="none" ACTION=="add|change", KERNEL=="sdb", ATTR{queue/scheduler}="none"

Thank you so much for your reply!
Yes, I have an ssd. They have [mq-deadline] none set in their planner now.
I would like to change the selection to none. How do I do the setup without rebooting, online?
echo "none" > /sys/block/sd[x]/queue/scheduler ?
Do I understand correctly that it will be better for SSD?

Waschbüsch · May 25, 2023

emptness said:
echo "none" > /sys/block/sd[x]/queue/scheduler ?

correct. You can do that on the fly to change the current scheduler.

emptness said:
Do I understand correctly that it will be better for SSD?

I actually doubt there will be any performance benefit with regards to I/O. However, you might save some CPU cycles because none does less work than mq-deadline.
It all obviously heavily on your specific hardware, but I do not think you will notice any difference, to be honest.

emptness · May 25, 2023

Waschbüsch said:
correct. You can do that on the fly to change the current scheduler.

I actually doubt there will be any performance benefit with regards to I/O. However, you might save some CPU cycles because none does less work than mq-deadline.
It all obviously heavily on your specific hardware, but I do not think you will notice any difference, to be honest.

Understood you.
There are many articles on the Internet on optimizing the operation of ceph and increasing the read/write speed. Many of them are very old.
The fact is that my ceph speed tests have quite low indicators, although the servers are powerful and HDD, SSD, NAME Enterprise class.
But the more I study this issue, the more I am convinced that the default settings in Proxmox are optimal.
Do you think this is really the case? or can you recommend some settings?
Most of all, I am confused by the speed of the HDD+NVME pool, I do not see any increase in performance (
If you are ready to help me, I can provide more detailed characteristics and a description of the settings.

Waschbüsch · May 25, 2023

First recommendation: do benchmarks, identify whether you lack IOPS or bandwidth, whether some storage medium is the problem and if so, which one and then you have a much better chance adressing the specific reasons.

Also keep in mind that the limiting factor in ceph may not be the storage medium at all, but the network.
My setup, for instance, consists of 3 nodes with 8-10 osds each, one node with very fast nvme ssds, one with sas ssds and the last one with older sata ssds (all datacenter / enterprise class - they do give higher iops and more consistently so).
In my case, right now the 10G network I use clearly is a bottleneck as far as bandwidth is concerned. Each nvme ssd has much higher bandwidth and even a single sas ssd can saturate roughly 75% of a 10G link.

The network may also be the source of latency issues. As for tuning, search this forum for information on how to disable auth (stuff like auth_client_required) and debugging (such as debug_filestore, debug_auth, etc.) in ceph (and take care to understand the implications before doing so). Doing so may reduce traffic and save some roundtrips for auth as well as (slightly) reduce CPU load.

Also important to understand how replicas are distributed across nodes and osds. By default each osd gets a weight corresponding to its size capacity. You can easily setup a pool that ends up favoring (because of the higher weight) the slowest (but largest) drives, etc.

emptness · May 26, 2023

Waschbüsch said:
First recommendation: do benchmarks, identify whether you lack IOPS or bandwidth, whether some storage medium is the problem and if so, which one and then you have a much better chance adressing the specific reasons.

Also keep in mind that the limiting factor in ceph may not be the storage medium at all, but the network.
My setup, for instance, consists of 3 nodes with 8-10 osds each, one node with very fast nvme ssds, one with sas ssds and the last one with older sata ssds (all datacenter / enterprise class - they do give higher iops and more consistently so).
In my case, right now the 10G network I use clearly is a bottleneck as far as bandwidth is concerned. Each nvme ssd has much higher bandwidth and even a single sas ssd can saturate roughly 75% of a 10G link.

The network may also be the source of latency issues. As for tuning, search this forum for information on how to disable auth (stuff like auth_client_required) and debugging (such as debug_filestore, debug_auth, etc.) in ceph (and take care to understand the implications before doing so). Doing so may reduce traffic and save some roundtrips for auth as well as (slightly) reduce CPU load.

Also important to understand how replicas are distributed across nodes and osds. By default each osd gets a weight corresponding to its size capacity. You can easily setup a pool that ends up favoring (because of the higher weight) the slowest (but largest) drives, etc.

Thanks! I understand you.
I know about the settings you are writing about. And won't disabling
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
lead to a decrease in security?

Can you tell me what speeds your disk system tests show? for example, a RADOS test so that I can compare.
At the moment, my network consists of 4 1 GBit ports in aggregation (that is, up to 4 GBit at best). I understand perfectly well that this is not enough. I'm waiting for 100 GBit switches to arrive. But all the same, tests on my cluster do not give the maximum possible speed and this confuses me. For example, HDD pool speed is sometimes better than SSD pool.
So I'm worried if the new switches will fix it)

czechsys · May 26, 2023

https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

Waschbüsch · May 26, 2023

emptness said:
And won't disabling
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
lead to a decrease in security?

That all depends on who your ceph clients are. If you use ceph exclusively as a storage backend for VMs, then your cluster nodes will be the only clients and you can then determine if the network that traffic runs on actually needs the security.

emptness · May 26, 2023

Waschbüsch said:
That all depends on who your ceph clients are. If you use ceph exclusively as a storage backend for VMs, then your cluster nodes will be the only clients and you can then determine if the network that traffic runs on actually needs the security.

I understand you. Thanks!
Will it be necessary to restart ceph on all nodes to apply the settings?

Waschbüsch · May 27, 2023

emptness said:
I understand you. Thanks!
Will it be necessary to restart ceph on all nodes to apply the settings?

Unsure, really. It may be enough to restart the mon instances, then again maybe not. I don't recall.

alexskysilk · May 27, 2023

emptness said:
Will it be necessary to restart ceph on all nodes to apply the settings?

you need to restart all daemons to load the new settings. if you have guest nodes without daemons there would be nothing to restart.

Waschbüsch said:
It may be enough to restart the mon instances,

osds and mgr too.

Search

Search

Do I need to additionally configure disks for ceph+Proxmox?

emptness

Member

emptness

Member

Waschbüsch

Renowned Member

emptness

Member

Waschbüsch

Renowned Member

emptness

Member

Waschbüsch

Renowned Member

emptness

Member

czechsys

Renowned Member

Waschbüsch

Renowned Member

emptness

Member

Waschbüsch

Renowned Member

alexskysilk

Distinguished Member