Ceph - question before first setup, one pool or two pools

But please be aware that only overall throughput will increase, with the amount of users who access data. Every user will still experience only a data rate that is about equal to single thread performance. I only wanted to make this very clear.
this is not ideal, but still good enough

I don't know of an option that allows you to partition your disk space in the initial setup of PVE. You wil need to install PVE on the NVME drive, then boot some rescue system, resize the filesystems and the lvm, and then use the rest for the other purposes. If you know how to resize LVM and ext4 this might not be a big issue.
i wont think so that ill do it. not looking to make any custom changes.

To add to @Ingo S post, don't install Proxmox VE on the NVMe, when used with OSD DB+WAL on the same drive. These are separate concerns and are hard to control together. If there are enough IOps left, after hooking all the OSDs to the NVMe, you can move the MON DB (/var/lib/ceph/ceph-mon) to a partition on the NVMe, the get better latency.

what is the optimal install\configuration?
we are going to planning to get 2u servers each with 8*3.5 hdds, and 2 (we might add two more later due to high cost ) u2 nvme,

the servers can take up to 4 pcie lanes
  1. 2*40 network card
  2. optane 200-300 GB ? db+wal , do i really need it? case the ssd pool is quite fast (high end u2 nvme ) and the hdd pool is bary get write access? but if ill choose to write on it the WAL will make a large difference ? (the write speed should be as fast as the wal drive? )
  3. spare
  4. spare
12*32GB (out of 24) ram dimes is used

what do you thinks is better
can you suggest a good pcie nvme adapter for 1 and 4 m2 cards

what is suggest size of ssd for proxmox only?
 
Last edited:
In discussing this topic a question has come to my mind:
Why is sequential single threaded reading that slow? I would assume, ceph prepares reading of sequential blocks from other objects while the previous object is still read. I'm not sure, but i think this is what is called read ahead.
Might it be that this "slowness" is caused by the latency to lookup where the data is (on which OSD, which PG and which Object)? Can moving MON DB improve on this significantly? I'm not even sure where our MON DB is stored right now :eek:o_O...
Hm... the MONs hand out all (5) the maps of the cluster, the quicker it can do that the better (OFC). But I don't think that, in comparison, this would influence one client with a single ore multiple threads. Ceph doesn't know locality. So at any time a thread (or multiple) needs the time for the round trip to the OSD over the network. This is always a significant penalty. A faster MON will just lower the overall latency spend. On a hyper-converged setup, the CPU will also be a big factor for latency. The higher the frequency and the lower the load (less scheduling) the MON (actually any service) can serve the clients quicker.
 
what is the optimal install\configuration?
This is always a good question, with a sad answer: "You need to test it!". o_O Depending on many factors, some workloads will do better or worse with a given setup.

optane 200-300 GB ? db+wal , do i really need it? case the ssd pool is quite fast (high end u2 nvme ) and the hdd pool is bary get write access? but if ill choose to write on it the WAL will make a large difference ? (the write speed should be as fast as the wal drive? )
The DB (RocksDB) is merging its tables with a size of 3, 30, 300 GiB. If the DB doesn't fit into the partition it will spill over to the data media. To be on the save side, our tooling assumes 10% of the data device will be the size of the DB+WAL and creates the partition accordingly.

what do you thinks is better
There will be a good portion of logging going, besides other read/writes of services. But in most cases SSDs should be fine. I can't really say anything to the brands above. But I wouldn't use some non-standard disk implementation. If you have free SATA connectors on board, then use SSDs and place them at a good location inside the chassis.
 
The DB (RocksDB) is merging its tables with a size of 3, 30, 300 GiB. If the DB doesn't fit into the partition it will spill over to the data media. To be on the save side, our tooling assumes 10% of the data device will be the size of the DB+WAL and creates the partition accordingly.
can you explain?

There will be a good portion of logging going, besides other read/writes of services. But in most cases SSDs should be fine. I can't really say anything to the brands above. But I wouldn't use some non-standard disk implementation. If you have free SATA connectors on board, then use SSDs and place them at a good location inside the chassis.
now after double check, there are no free sata port, ill have to add pcie card/pcie/nvme riser/sata extender
 
the WAL should be 10% of the entire pool or from a one OSD
Each OSD has a DB & WAL. The 10% are taken from the disk size. And the DB+WAL partition will created by that size (eg. 4TB disk = 400GB DB+WAL).
 
  • Like
Reactions: ilia987
Each OSD has a DB & WAL. The 10% are taken from the disk size. And the DB+WAL partition will created by that size (eg. 4TB disk = 400GB DB+WAL).
is it stored on the osd itself or i should have dedicated storage for it?
 
I will have db/wal on osds directly in case of the fast nvme ssds.
For hdds...it depends on how much space is allocated for db/wal due hdds size. If they are really almost read-only, i will try it without shared device.

As for read-intensive clients...how they will access hdd pool? nfs? samba? ceph? cephfs?
 
If the OSD is a HDD, you should place DB+WAL on a SSD, since writing to the cluster produces lots of IO to the DB. If the DB is on the HDD OSD, overall Performance will be much less, especially in the case of a recovery, e.g. in case of a HDD failure etc.

On our cluster we have one 375GB SSD per Server, that contains 8 Partitions of around 42GB, one for each OSD.

I assume, this is not ideal, regarding this:

Alwin said:
The DB (RocksDB) is merging its tables with a size of 3, 30, 300 GiB. If the DB doesn't fit into the partition it will spill over to the data media.
looks to me as if DB+WAL are spilling over to our HDDs... :oops: :(
Man... designing a good ceph cluster is really hard. Seems there is still plenty to learn..
 
I will have db/wal on osds directly in case of the fast nvme ssds.
For hdds...it depends on how much space is allocated for db/wal due hdds size. If they are really almost read-only, i will try it without shared device.

As for read-intensive clients...how they will access hdd pool? nfs? samba? ceph? cephfs?
i think nfs


Man... designing a good ceph cluster is really hard. Seems there is still plenty to learn..
yep that's why i am asking all those questions
 
i think nfs

Hm. I have 3x node hyperconverged ceph pve (testing still, 2x sata ssd osd per node) and VM as nfs server and performance is...crap. Maybe with lxc it's better? Anyway it's next performance hit for accessing hdd pool from clients. Think about it (nfs-ganesha, ceph clients etc).
Shared ssd for db/wal is suggested in case of the hdd pool but it's next failure spot. Need to calcuate with it.

You can still look for drbd9, it has some nice features against drbd8, Maybe it will fit better for hdds (with hw raid)?
 
Hm. I have 3x node hyperconverged ceph pve (testing still, 2x sata ssd osd per node) and VM as nfs server and performance is...crap. Maybe with lxc it's better? Anyway it's next performance hit for accessing hdd pool from clients. Think about it (nfs-ganesha, ceph clients etc).
Shared ssd for db/wal is suggested in case of the hdd pool but it's next failure spot. Need to calcuate with it.

You can still look for drbd9, it has some nice features against drbd8, Maybe it will fit better for hdds (with hw raid)?
how much is crap ?
 
how much is crap ?

test scenario - nfs server + nfs client, both in dedicated vm on those types of the storage:
drbd9 over hw raid1 on 2 nodes: fio randread 31k, randwrite 4.8k, randread/write 11k/4k iops
ceph 2/2 on 2 nodes : 11k, 0.8k, 2k/0.6k
 
test scenario - nfs server + nfs client, both in dedicated vm on those types of the storage:
drbd9 over hw raid1 on 2 nodes: fio randread 31k, randwrite 4.8k, randread/write 11k/4k iops
ceph 2/2 on 2 nodes : 11k, 0.8k, 2k/0.6k
what is the throughput in MB?
what are your hardware
 
sorry for the double post but i would like to summarize what currently in mind the ask relevant questions,

assuming the following setup 5 node, when each: if everything will work well we might add more of those in the near future)
  • lxc pool 2*high performance ssd
  • data pool 8*6TB HDD (wd red pro)
  • 2 small sata for proxmox os
  • 12*32GB ram
  • 2*12 core CPU
  • 2*40 network ( one public one for ceph)
my questions
  1. i guess the lxc pool will store the DB on its own ssd (optane storage will have half the latency speed, but also 1/2 the throughput ) will i gain something from using optane ? (the cost of it is relatively high )
  2. i can get one more high performance ssd for journaling the data bool (the question is, it is for read only, so i think the jourling can be also stored on the pool it self)
  3. i guess i need dedicated storage for monitor ? what size and category i should get? does pcie nvme like ?
  4. do i need to add another storage for something else?

did i miss anything?
 
looks to me as if DB+WAL are spilling over to our HDDs... :oops: :(
Ceph 14.2.4 would complain about that, ceph -s.

On our cluster we have one 375GB SSD per Server, that contains 8 Partitions of around 42GB, one for each OSD.
I guess, the SSD might become the bottleneck with 8 OSDs. Maybe split them, 2x4/SSD.
 
i guess the lxc pool will store the DB on its own ssd (optane storage will have half the latency speed, but also 1/2 the throughput ) will i gain something from using optane ? (the cost of it is relatively high )
Optane Memory is a good fit for a DB+WAL device, as it delivers constant IOps for small writes. Depending on your IO needs, SSDs might be enough.

i can get one more high performance ssd for journaling the data bool (the question is, it is for read only, so i think the jourling can be also stored on the pool it self)
Again, depends on your IOps needs. But a DB+WAL device can be added later too.

i guess i need dedicated storage for monitor ? what size and category i should get? does pcie nvme like ?
Not necessarily, but as above, you can add that later too.

do i need to add another storage for something else?
Can't answer that.
 
after lots calculation, we are going do do something simpler
we will start with 3 nodes each:
  • 1u
  • 8x4TB ssd for pool
  • 2 smalls ssds for os raid
  • 40*2 network
  • 2*12 core
  • optional nvme\optane (not at initial stage)
each node should give approx 10TB usable redundant data
if all work and will be stable we will added nodes based on our storage growth
 
Wow, all of this fits into a 1U Server?

This setup sounds reasonable. What Controller will you be using for the SSDs? Are you going for an Intel or AMD based system? Just asking because AMD made a big leap in performance and seems very good at performance per buck.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!