Ceph - question before first setup, one pool or two pools

ilia987 · Nov 5, 2019

Ingo S said:
But please be aware that only overall throughput will increase, with the amount of users who access data. Every user will still experience only a data rate that is about equal to single thread performance. I only wanted to make this very clear.

this is not ideal, but still good enough

Ingo S said:
I don't know of an option that allows you to partition your disk space in the initial setup of PVE. You wil need to install PVE on the NVME drive, then boot some rescue system, resize the filesystems and the lvm, and then use the rest for the other purposes. If you know how to resize LVM and ext4 this might not be a big issue.

i wont think so that ill do it. not looking to make any custom changes.

Alwin said:
To add to @Ingo S post, don't install Proxmox VE on the NVMe, when used with OSD DB+WAL on the same drive. These are separate concerns and are hard to control together. If there are enough IOps left, after hooking all the OSDs to the NVMe, you can move the MON DB (/var/lib/ceph/ceph-mon) to a partition on the NVMe, the get better latency.

what is the optimal install\configuration?
we are going to planning to get 2u servers each with 8*3.5 hdds, and 2 (we might add two more later due to high cost ) u2 nvme,

the servers can take up to 4 pcie lanes

2*40 network card
optane 200-300 GB ? db+wal , do i really need it? case the ssd pool is quite fast (high end u2 nvme ) and the hdd pool is bary get write access? but if ill choose to write on it the WAL will make a large difference ? (the write speed should be as fast as the wal drive? )
spare
spare

12*32GB (out of 24) ram dimes is used

what do you thinks is better

2* SEDNA - DDR4 Slot Mounting Adapter for M2 SSD for raid (sata ssd) for proxmox os
4x nvme pcie riser and put 2 on it (2 spare) for proxmox os

can you suggest a good pcie nvme adapter for 1 and 4 m2 cards

what is suggest size of ssd for proxmox only?

Alwin · Nov 5, 2019

Ingo S said:
In discussing this topic a question has come to my mind:
Why is sequential single threaded reading that slow? I would assume, ceph prepares reading of sequential blocks from other objects while the previous object is still read. I'm not sure, but i think this is what is called read ahead.
Might it be that this "slowness" is caused by the latency to lookup where the data is (on which OSD, which PG and which Object)? Can moving MON DB improve on this significantly? I'm not even sure where our MON DB is stored right now ...

Hm... the MONs hand out all (5) the maps of the cluster, the quicker it can do that the better (OFC). But I don't think that, in comparison, this would influence one client with a single ore multiple threads. Ceph doesn't know locality. So at any time a thread (or multiple) needs the time for the round trip to the OSD over the network. This is always a significant penalty. A faster MON will just lower the overall latency spend. On a hyper-converged setup, the CPU will also be a big factor for latency. The higher the frequency and the lower the load (less scheduling) the MON (actually any service) can serve the clients quicker.

Alwin · Nov 5, 2019

ilia987 said:
what is the optimal install\configuration?

This is always a good question, with a sad answer: "You need to test it!".

Depending on many factors, some workloads will do better or worse with a given setup.

ilia987 said:
optane 200-300 GB ? db+wal , do i really need it? case the ssd pool is quite fast (high end u2 nvme ) and the hdd pool is bary get write access? but if ill choose to write on it the WAL will make a large difference ? (the write speed should be as fast as the wal drive? )

The DB (RocksDB) is merging its tables with a size of 3, 30, 300 GiB. If the DB doesn't fit into the partition it will spill over to the data media. To be on the save side, our tooling assumes 10% of the data device will be the size of the DB+WAL and creates the partition accordingly.

ilia987 said:
what do you thinks is better

2* SEDNA - DDR4 Slot Mounting Adapter for M2 SSD for raid (sata ssd) for proxmox os

4x nvme pcie riser and put 2 on it (2 spare) for proxmox os

There will be a good portion of logging going, besides other read/writes of services. But in most cases SSDs should be fine. I can't really say anything to the brands above. But I wouldn't use some non-standard disk implementation. If you have free SATA connectors on board, then use SSDs and place them at a good location inside the chassis.

ilia987 · Nov 5, 2019

Alwin said:
The DB (RocksDB) is merging its tables with a size of 3, 30, 300 GiB. If the DB doesn't fit into the partition it will spill over to the data media. To be on the save side, our tooling assumes 10% of the data device will be the size of the DB+WAL and creates the partition accordingly.

can you explain?

Alwin said:
There will be a good portion of logging going, besides other read/writes of services. But in most cases SSDs should be fine. I can't really say anything to the brands above. But I wouldn't use some non-standard disk implementation. If you have free SATA connectors on board, then use SSDs and place them at a good location inside the chassis.

now after double check, there are no free sata port, ill have to add pcie card/pcie/nvme riser/sata extender

Alwin · Nov 5, 2019

ilia987 said:
can you explain?

Explain what?

ilia987 · Nov 5, 2019

Alwin said:
Explain what?

the WAL should be 10% of the entire pool or from a one OSD

Alwin · Nov 5, 2019

ilia987 said:
the WAL should be 10% of the entire pool or from a one OSD

Each OSD has a DB & WAL. The 10% are taken from the disk size. And the DB+WAL partition will created by that size (eg. 4TB disk = 400GB DB+WAL).

ilia987 · Nov 5, 2019

Alwin said:
Each OSD has a DB & WAL. The 10% are taken from the disk size. And the DB+WAL partition will created by that size (eg. 4TB disk = 400GB DB+WAL).

is it stored on the osd itself or i should have dedicated storage for it?

czechsys · Nov 5, 2019

I will have db/wal on osds directly in case of the fast nvme ssds.
For hdds...it depends on how much space is allocated for db/wal due hdds size. If they are really almost read-only, i will try it without shared device.

As for read-intensive clients...how they will access hdd pool? nfs? samba? ceph? cephfs?

Ingo S · Nov 5, 2019

If the OSD is a HDD, you should place DB+WAL on a SSD, since writing to the cluster produces lots of IO to the DB. If the DB is on the HDD OSD, overall Performance will be much less, especially in the case of a recovery, e.g. in case of a HDD failure etc.

On our cluster we have one 375GB SSD per Server, that contains 8 Partitions of around 42GB, one for each OSD.

I assume, this is not ideal, regarding this:

Alwin said:
The DB (RocksDB) is merging its tables with a size of 3, 30, 300 GiB. If the DB doesn't fit into the partition it will spill over to the data media.

looks to me as if DB+WAL are spilling over to our HDDs...

Man... designing a good ceph cluster is really hard. Seems there is still plenty to learn..

ilia987 · Nov 5, 2019

czechsys said:
I will have db/wal on osds directly in case of the fast nvme ssds.
For hdds...it depends on how much space is allocated for db/wal due hdds size. If they are really almost read-only, i will try it without shared device.

As for read-intensive clients...how they will access hdd pool? nfs? samba? ceph? cephfs?

i think nfs

Ingo S said:
Man... designing a good ceph cluster is really hard. Seems there is still plenty to learn..

yep that's why i am asking all those questions

czechsys · Nov 5, 2019

ilia987 said:
i think nfs

Hm. I have 3x node hyperconverged ceph pve (testing still, 2x sata ssd osd per node) and VM as nfs server and performance is...crap. Maybe with lxc it's better? Anyway it's next performance hit for accessing hdd pool from clients. Think about it (nfs-ganesha, ceph clients etc).
Shared ssd for db/wal is suggested in case of the hdd pool but it's next failure spot. Need to calcuate with it.

You can still look for drbd9, it has some nice features against drbd8, Maybe it will fit better for hdds (with hw raid)?

ilia987 · Nov 5, 2019

czechsys said:
Hm. I have 3x node hyperconverged ceph pve (testing still, 2x sata ssd osd per node) and VM as nfs server and performance is...crap. Maybe with lxc it's better? Anyway it's next performance hit for accessing hdd pool from clients. Think about it (nfs-ganesha, ceph clients etc).
Shared ssd for db/wal is suggested in case of the hdd pool but it's next failure spot. Need to calcuate with it.

You can still look for drbd9, it has some nice features against drbd8, Maybe it will fit better for hdds (with hw raid)?

how much is crap ?

czechsys · Nov 6, 2019

ilia987 said:
how much is crap ?

test scenario - nfs server + nfs client, both in dedicated vm on those types of the storage:
drbd9 over hw raid1 on 2 nodes: fio randread 31k, randwrite 4.8k, randread/write 11k/4k iops
ceph 2/2 on 2 nodes : 11k, 0.8k, 2k/0.6k

ilia987 · Nov 6, 2019

czechsys said:
test scenario - nfs server + nfs client, both in dedicated vm on those types of the storage:
drbd9 over hw raid1 on 2 nodes: fio randread 31k, randwrite 4.8k, randread/write 11k/4k iops
ceph 2/2 on 2 nodes : 11k, 0.8k, 2k/0.6k

what is the throughput in MB?
what are your hardware

ilia987 · Nov 6, 2019

sorry for the double post but i would like to summarize what currently in mind the ask relevant questions,

assuming the following setup 5 node, when each: if everything will work well we might add more of those in the near future)

lxc pool 2*high performance ssd
data pool 8*6TB HDD (wd red pro)
2 small sata for proxmox os
12*32GB ram
2*12 core CPU
2*40 network ( one public one for ceph)

my questions

i guess the lxc pool will store the DB on its own ssd (optane storage will have half the latency speed, but also 1/2 the throughput ) will i gain something from using optane ? (the cost of it is relatively high )
i can get one more high performance ssd for journaling the data bool (the question is, it is for read only, so i think the jourling can be also stored on the pool it self)
i guess i need dedicated storage for monitor ? what size and category i should get? does pcie nvme like ?
do i need to add another storage for something else?

did i miss anything?

Alwin · Nov 7, 2019

Ingo S said:
looks to me as if DB+WAL are spilling over to our HDDs...

Ceph 14.2.4 would complain about that, ceph -s.

Ingo S said:
On our cluster we have one 375GB SSD per Server, that contains 8 Partitions of around 42GB, one for each OSD.

I guess, the SSD might become the bottleneck with 8 OSDs. Maybe split them, 2x4/SSD.

Alwin · Nov 7, 2019

ilia987 said:
i guess the lxc pool will store the DB on its own ssd (optane storage will have half the latency speed, but also 1/2 the throughput ) will i gain something from using optane ? (the cost of it is relatively high )

Optane Memory is a good fit for a DB+WAL device, as it delivers constant IOps for small writes. Depending on your IO needs, SSDs might be enough.

ilia987 said:
i can get one more high performance ssd for journaling the data bool (the question is, it is for read only, so i think the jourling can be also stored on the pool it self)

Again, depends on your IOps needs. But a DB+WAL device can be added later too.

ilia987 said:
i guess i need dedicated storage for monitor ? what size and category i should get? does pcie nvme like ?

Not necessarily, but as above, you can add that later too.

ilia987 said:
do i need to add another storage for something else?

Can't answer that.

ilia987 · Nov 7, 2019

after lots calculation, we are going do do something simpler
we will start with 3 nodes each:

1u
8x4TB ssd for pool
2 smalls ssds for os raid
40*2 network
2*12 core
optional nvme\optane (not at initial stage)

each node should give approx 10TB usable redundant data
if all work and will be stable we will added nodes based on our storage growth

Ingo S · Nov 8, 2019

Wow, all of this fits into a 1U Server?

This setup sounds reasonable. What Controller will you be using for the SSDs? Are you going for an Intel or AMD based system? Just asking because AMD made a big leap in performance and seems very good at performance per buck.

Ceph - question before first setup, one pool or two pools

Well-Known Member

Proxmox Retired Staff

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Proxmox Retired Staff

Well-Known Member

Renowned Member

We value your privacy