Need to make sure I have replicas right for PVE Ceph...

nethfel

Member
Dec 26, 2014
151
0
16
Hi All,

I'm in the midst of setting up an environment and plan to setup my pools. Reading from another post, I plan to have multiple pools separating my VMs out to try to protect the data incase a pool gets corrupt - plus in that other post, it made me concerned that I've not built in enough redundancy.

Now - here's the basic config:

3x PVE servers, each containing 3x500G OSD HDDs (a 4th hdd is set aside for boot)
So, there would be a total of 9 drives

I plan to store no more than 1.25TB out of the ~3.7GB available
Right now - I'm planning on 3 replicas, 1 min.


My goal is to be able to have 1 full system down plus the potential of loosing 3 other drives but still have the VMs running (basically 6 out of the cluster) - unless I've done my math wrong, with the amount of data I'm planning on storing, with 3 replicas, I should be able to loose 2 out of every 3 drives and still run - or am I missing something here? (wouldn't be the first time)

I understand that there is a very slim chance of a catastrophe like this happening, but I want to understand what the limits are and how much can go unavailable and still have the cluster run even in a degraded state.
 
Hi All,

I'm in the midst of setting up an environment and plan to setup my pools. Reading from another post, I plan to have multiple pools separating my VMs out to try to protect the data incase a pool gets corrupt - plus in that other post, it made me concerned that I've not built in enough redundancy.

Now - here's the basic config:

3x PVE servers, each containing 3x500G OSD HDDs (a 4th hdd is set aside for boot)
So, there would be a total of 9 drives

I plan to store no more than 1.25TB out of the ~3.7GB available
Right now - I'm planning on 3 replicas, 1 min.
Hi,
you don't have enough OSDs!
There are twice reasons:
1. "speed" came from the amount of OSDs. With "only" 9 your cluster will be not very fast. You should try the config to see if the performance is good enough for you.

2. You don't have enough space!

To 2.:
ceph weight the osd after the available space - with xfs I got an weight of 3.64 for an 4TB hdd and 3.58 for an ext4 formated 4TB-hdd (ext4 is much faster for me - 50% latency).
In your case app. 0.45 TB for each hdd is used. This mean 100%, which you never reach, is 3*0.45=1.35TB.
Normaly you should use app. 60% only, because if one node fails, the content will copied to the remaining nodes - OK, not in your case, because you have an replica of 3 and only 3 nodes. But if you expand your nodes to 4, you should use less than 70%.
The OSDs will also not filled even - I had up to 20% differences between single OSDs on one Node.
And you will get trouble if ceph-OSDs are to full!

With 1.25TB data, you should use 3 hosts with 5*500GB OSDs.
My goal is to be able to have 1 full system down plus the potential of loosing 3 other drives but still have the VMs running (basically 6 out of the cluster) - unless I've done my math wrong, with the amount of data I'm planning on storing, with 3 replicas, I should be able to loose 2 out of every 3 drives and still run - or am I missing something here? (wouldn't be the first time)
Plus isn't right! You can loose one node - not more!

"potential of loosing 3 other drives" will destroy all your data - except the "3 other drives" are on one node!
With an replica of 3, you will have three copies - if on all 3 nodes 1 hdd fails, your data is gone. In your case (with only 3 nodes) if one node remain healthy, you don't loose data with more than 2 failed hdds, but this changed if you expand to 4 nodes - then are 3 died disks equal to data loss.

Udo
 
Hi Udo,

Thanks for responding.

To #1:
How would you recommend testing the speed of the network? This is a proxceph as opposed to just a ceph network so I'm not sure what tools (if any) I might not have. I might be able to get another machine to add, the only problem with that would be quorum issues with proxmox cluster which would require me to really add 2 machines. This proxceph cluster won't be running any vms, only using proxmox for easy visual management of ceph.

Unfortunately, 5 OSDs in each host are not an option - I have to work with the equipment I have which in this case are 1U units with 4 drive slots; one boot which only leaves me 3 for storage drives.

To #2:
So if I understand your response to #2 - I should be using no more than .6 * 1.35TB, so about 810GB? That I can work around by hosting some of the vm's on some of the actual host boxes instead of the cluster.
Now also as I understand what you're saying, if I had to take 1 node down for maintenance - (say a HDD failed and I needed to replace it or a reboot from an update) which would leave me w/2 nodes; 6 drives; if I loose one more drive I'll loose data (assuming I'm at the 60% usage or less)?
 
Last edited:
...
Now also as I understand what you're saying, if I had to take 1 node down for maintenance - (say a HDD failed and I needed to replace it or a reboot from an update) which would leave me w/2 nodes; 6 drives; if I loose one more drive I'll loose data (assuming I'm at the 60% usage or less)?
Hi,
with replica 3 and three nodes each node contain all data - so in this case you don't have data loss if one node with all OSDs are ok.

But this work only with nodes = replica. If you expand your cluster to 4 nodes, thes same kill your data.

Udo
 
Ok, so as long as I have one fully functioning node, all good OSD's and at least 2 monitors (well, in my case with 3 monitors normally) I should be able to keep operating until I can replace damaged/faulty hardware.

Now - the speed test - is there a good tutorial out there for how to properly speed test a ceph cluster you can recommend?
 
Now - the speed test - is there a good tutorial out there for how to properly speed test a ceph cluster you can recommend?
Hi,
don't know an tutorial.
a good point to start is with rados bench on the node. Don't forget to clear the cache on all OSDs (and node if different - not in your case).

To remove easy the data, I use an extra pool test. Like this
Code:
ceph osd pool create test 512 512


rados -p test bench 60 write --no-cleanup

# on all nodes
echo 3 > /proc/sys/vm/drop_caches

# read again
rados -p test bench 60 seq --no-cleanup

# flush buffer again
echo 3 > /proc/sys/vm/drop_caches

#you can select the numer of threads with -t like

rados -p test bench 60 seq --no-cleanup -t 1
Inside an VM, I use fio to test the speed.
Here you must clear the cache inside the VM also and I assume the rbd_cache don't flush during this (mean, you must write long before, so that the stuff isn't in the cache).
And it's depends if you measure IO (4k-blocks) or speed (4M-blocks).
Code:
# 4k version

fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=80 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest --size=128m

# for 4M I take something like
fio --max-jobs=1 --numjobs=1 --readwrite=write --blocksize=4M --size=4G --direct=1 --name=fiojob
# clear buffers!
fio --max-jobs=1 --numjobs=1 --readwrite=read --blocksize=4M --size=5G --direct=1 --name=fiojob
Udo
 
Ok, I'm going to go ahead and do the test. On a single node, the average single HDD speed is 115 MB/s, with Rados I see:

Code:
Run on a single node:


Write:
 Total time run:         60.413675
Total writes made:      1396
Write size:             4194304
Bandwidth (MB/sec):     92.429 


Stddev Bandwidth:       15.7697
Max bandwidth (MB/sec): 108
Min bandwidth (MB/sec): 0
Average Latency:        0.69192
Stddev Latency:         0.263073
Max latency:            1.90904
Min latency:            0.21049




Read:
Total time run:        34.173462
Total reads made:     1396
Read size:            4194304
Bandwidth (MB/sec):    163.402 


Average Latency:       0.39121
Max latency:           1.60801
Min latency:           0.040045

Is this within reason for my setup? Do I need to do the write portion to all ceph nodes simultaneously or should I be doing this at just one?
 
Any thoughts about my numbers? Are they within reason for my setup? Would they improve greatly with more nodes and OSDs?
 
Any thoughts about my numbers? Are they within reason for my setup? Would they improve greatly with more nodes and OSDs?

Hi,
I have no experiences with few OSDs...
I assume, you use the journal on the disks - and due to sync and writing tree times, you get smaller writes than reads.

I assume your values are normal, but this must someone verify with an comparable setup.

Code:
rados -p test bench 60 seq --no-cleanup
...
 Total time run:        39.814489
Total reads made:     6155
Read size:            4194304
Bandwidth (MB/sec):    618.368 

Average Latency:       0.102736
Max latency:           3.30975
Min latency:           0.016923

rados -p test bench 60 write --no-cleanup
...
 Total time run:         60.155694
Total writes made:      7074
Write size:             4194304
Bandwidth (MB/sec):     470.379 

Stddev Bandwidth:       68.5092
Max bandwidth (MB/sec): 524
Min bandwidth (MB/sec): 0
Average Latency:        0.136025
Stddev Latency:         0.0752764
Max latency:            0.78294
Min latency:            0.032559
This results are with 64 OSDs.

Udo
 
Wow, that's some nice read/write bandwidth!

You are right, I am using journal on the disks; to get enterprise class SSDs is out of the question - completely aside from the fact there isn't enough space in these servers I have available to put them in.

I know I won't get to setup that many OSDs (I work for a school, no real budget to get great stuff ;) ) - in terms of that setup, what bandwidth do you have per node? How many actual nodes is that?

Currently I'm using 3x gigabit ethernet in an LACP LAG for the ceph network.
 
Wow, that's some nice read/write bandwidth!

You are right, I am using journal on the disks; to get enterprise class SSDs is out of the question - completely aside from the fact there isn't enough space in these servers I have available to put them in.

I know I won't get to setup that many OSDs (I work for a school, no real budget to get great stuff ;) ) - in terms of that setup, what bandwidth do you have per node? How many actual nodes is that?

Currently I'm using 3x gigabit ethernet in an LACP LAG for the ceph network.
Hi,
I have now 6 OSD-nodes active. All nodes have 12 OSDs (4TB) plus journal-SSD (and cache tier SSD).
The sixth node is since two day in the cluster and right now the remaining disks are filled.
After that the seventh OSD node follow to bring enough free space to use all pools with replica 3.
The nodes are connected with 2* 10GB ethernet (one network for ceph, one for pve).

I'm not at the end with tuning ceph (scrubbing and deep-scrubs makes still IO-trouble) but it's usable now.

Udo
 
Are you running proxmox on those nodes and hosting VM's as well, or are you using separate machines for the proxmox cluster?
 
Ahh, so you have no OSD's within the PVE cluster then? My original plan was (based on equipment availability) to have 3x ceph and 3xpve, the ceph cluster just doing ceph OSD and monitors, the pve mapping to the ceph cluster, but running the VM's. My ceph cluster is installed through Proxmox, just not hosting VM's. Would I be better off doing a cluster of 5 machines all with OSDs and letting 2 of them also host VM's and keep my 6th one as an emergency spare? I'd end up with the 15 drives that you recommend - I just don't know if it'd be wise hosting the VM's on machines that are dual purposing as ceph OSD nodes... They are all 2x quad core xeons @2.5GHz, the current ceph group each have 8 gig of ram, the group I was going to use for vm's each have 16gig ram (we don't run too many VM's, but 3 @ 16gig is a lot more than the 4 vmware boxes we have running between 4 and 8 gig each...
 
I tried my hands with putting VMs on same Proxmox+Ceph nodes. Too much stress on equipment specially when recovering. It could be because my hardware was not to the higher end of spectrum. With same type hardware after separating Proxmox host and Ceph nodes it was ok.
 
Yeah, I really don't want to run the VMs on the boxes that are ceph osd nodes if I can avoid it. I just know I don't have access to any more machines. I might be able to re-purpose one of the old VM hosts as a Ceph OSD node, but I wouldn't be able to add it in until after I had the entire thing running and the vm's migrated off that box. It's a 6 core AMD, was not sure how well AMD units work as ceph OSD based nodes...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!