Proxmox VE Ceph Benchmark 2018/02

Chicken76

Member
Jun 26, 2017
34
1
8
40
Hi,
from my point of view it's not true - you can build an 3-Node ceph cluster without issues.
One node can fail, without data loss.

But the downtime of the failed node should not be to long. Because ceph can't remap the data to other osds to reach the replica-count of three again.
But this depends on the amount of data. Often, in much bigger ceph-setups, it's makes not realy sense to map all data to other nodes, because you are faster to bring the failed node back (spare server...). E.g. if one node have 10 4-TB OSDs you need a long time the rebalance the data across the other nodes.
And you need the free space on the other nodes of couse!

But ceph win with more nodes (more speed, less trouble during rebalance).

Udo
So what happens when one of the three machines with ceph goes down? Can't it function with just two copies, or does it try to create a third copy on the existing free space?
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,286
289
88
It will stay in a degraded state, while the VM/CT can still access the storage in read/write. All copies are separated on host level, hence there is no recovery as no "4th" host is available to do a recovery.
 
  • Like
Reactions: Chicken76

Chicken76

Member
Jun 26, 2017
34
1
8
40
It will stay in a degraded state, while the VM/CT can still access the storage in read/write. All copies are separated on host level, hence there is no recovery as no "4th" host is available to do a recovery.
I see. So no data loss and no disruption to running VM/CT, right? Can you still create new VM/CT on the downgraded cluster? Migrating existing VM/CT between hosts is possible on a downgraded (3-host with 1 host down) cluster?

And another question, if you'll permit: if a second host goes down, the cluster will stop working (lack of quorum) but the data is not lost, as there still is a healthy copy of it, and as soon as a second ceph server is spun up, it will start replicating, right?
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,286
289
88
So no data loss and no disruption to running VM/CT, right?
Besides those on the failed node.

Can you still create new VM/CT on the downgraded cluster? Migrating existing VM/CT between hosts is possible on a downgraded (3-host with 1 host down) cluster?
Yes, between the remaining hosts.

And another question, if you'll permit: if a second host goes down, the cluster will stop working (lack of quorum) but the data is not lost, as there still is a healthy copy of it, and as soon as a second ceph server is spun up, it will start replicating, right?
From Ceph view, as long as there is a copy and a working monitor, the recovery should happen. The ceph pool(s) stays in read-only as long as the min_size (default = 2) is not met. But you can not join a new node to the proxmox cluster, as it is out of quorum and goes into read-only. Hence the read-only, the VM/CT will not run either.

So in short, never let it get there. Safes you from a lot of trouble and sleepless nights.
 
  • Like
Reactions: Chicken76

Chicken76

Member
Jun 26, 2017
34
1
8
40
Besides those on the failed node.

Yes, between the remaining hosts.

From Ceph view, as long as there is a copy and a working monitor, the recovery should happen. The ceph pool(s) stays in read-only as long as the min_size (default = 2) is not met. But you can not join a new node to the proxmox cluster, as it is out of quorum and goes into read-only. Hence the read-only, the VM/CT will not run either.

So in short, never let it get there. Safes you from a lot of trouble and sleepless nights.
Is there no manual method to add another machine into a cluster after 2 out of 3 go down? It is not impossible for a lighting storm to cause a power spike that blows electronic equipment. In case you have un-fixable damage to 2 servers out of 3, if you have spare machines (that were not plugged in) or can get new ones fast, is there no way to restore the cluster?
 

alexskysilk

Well-Known Member
Oct 16, 2015
606
64
48
Chatsworth, CA
www.skysilk.com
Why is that? Can you explain? I was contemplating a 3-node cluster and started doing the necessary reading when I stumbled upon your post. Why do you need "replication count" + 1?
data integrity in a replicated environment is verified using a "democratic" process, which is to say that a piece of data is considered to be correct if the majority of "votes" agree that it is so. As a consequence, a minimum number of "votes" for quorum is 2. With a 3 node cluster, a node outage leaves the survivors unable to form a quorum which means that if a piece of data has different values on the surviving PGs that data will be considered dirty. and no one wants that ;)
 
  • Like
Reactions: Chicken76

Chicken76

Member
Jun 26, 2017
34
1
8
40
data integrity in a replicated environment is verified using a "democratic" process, which is to say that a piece of data is considered to be correct if the majority of "votes" agree that it is so. As a consequence, a minimum number of "votes" for quorum is 2. With a 3 node cluster, a node outage leaves the survivors unable to form a quorum which means that if a piece of data has different values on the surviving PGs that data will be considered dirty. and no one wants that ;)
So if a block differs between the 2 remaining copies, there are two posibilities:
  1. one of them is corrupted, and this should be sorted out by the checksums (there are checksums for every block a-la-zfs, right?)
  2. one of them has newer data than the other one. Are there timestamps or write logs that could show which is the most recent copy?
 

PigLover

Well-Known Member
Apr 8, 2013
102
33
48
Why is that? Can you explain? I was contemplating a 3-node cluster and started doing the necessary reading when I stumbled upon your post. Why do you need "replication count" + 1?
@udo nailed it. Its not a matter of 3 node clusters not working - its a matter of how do you want them to work when there is a failure. You need the "+1" node in order to bring the cluster back to a stable operating state. It should continue to work without it, but you don't want it to stay this way very long.

This is just a statement about minimal configs. In larger clusters you almost always have many more nodes than your replica set - so no worries.
 
  • Like
Reactions: Chicken76

Chicken76

Member
Jun 26, 2017
34
1
8
40
OK guys, thank you very much for your explanations. The conclusion I take from this is that it doesn't function like a ZFS mirror with 1 drive down out of 3. While with ZFS you would be very much OK in this scenario, with ceph being distributed among separate machines things are much more complicated and consensus is much harder to achieve.
 
Last edited:

alexskysilk

Well-Known Member
Oct 16, 2015
606
64
48
Chatsworth, CA
www.skysilk.com
The conclusion I take from this is that it doesn't function like a ZFS mirror with 1 drive down out of 3. While with ZFS you would be very much OK in this scenario,
A zfs mirror is very much subject to the same limitations; when down to only one copy you do not have any parity and any disk fault can and will cause corruption or worse. The reason that a zfs mirror is acceptable while a ceph cluster is not is because I'm the scenario you're describing ceph has a node failure domain while zfs has a disk failure domain. Plainly put, there is no potential to operate your zfs file system with a third of its disks missing as part of it's normal duty cycle.
 

dmulk

Member
Jan 24, 2017
74
4
13
45
Just want to confirm that in the Benchmark document that the latency numbers are reflected in "Seconds" and that we would need to multiply by 1000 to confirm ms...

Example: 0,704943 = approx 70ms latency

This seems a bit high for SSD's...
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
13,973
470
103
Just want to confirm that in the Benchmark document that the latency numbers are reflected in "Seconds" and that we would need to multiply by 1000 to confirm ms...

Example: 0,704943 = approx 70ms latency

This seems a bit high for SSD's...
Do you get better numbers?

A fully replicated network storage depends a lot on the network latency. Compare the number of 10 Gbit and 100 Gbit examples in the PDF.
And a fast CPU will also help here.
 
  • Like
Reactions: dmulk

dmulk

Member
Jan 24, 2017
74
4
13
45
Do you get better numbers?

A fully replicated network storage depends a lot on the network latency. Compare the number of 10 Gbit and 100 Gbit examples in the PDF.
And a fast CPU will also help here.

Sorry, this is what I'm trying to confirm: The numbers in the report with comma's are throwing me off....what are these in ms?

700ms? 70ms?

Thanks,
<D>
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,286
289
88
The decimal and thousands separators are "reversed".
Code:
German       4 333 222.111,00
US-English  4,333,222,111.00
The rados bench writes the latency in seconds, hence the ~700ms on the chart.
 
  • Like
Reactions: dmulk

dmulk

Member
Jan 24, 2017
74
4
13
45
The decimal and thousands separators are "reversed".
Code:
German       4 333 222.111,00
US-English  4,333,222,111.00
The rados bench writes the latency in seconds, hence the ~700ms on the chart.
Ah....thank you Alwin. Yeah....this seems worse now...700ms seems extremely high...if that were ns that would be fantastic. I'll need to run the tests in my environment and compare. I suspect I'm seeing better numbers because if I were at 700ms latency I would certainly be "hearing about it" from my devs.
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,286
289
88
Code:
1 Gb/s   = 1 s / 1 Gb   = 0.000000001 s for 1 bit
10 Gb/s  = 1 s / 10 Gb  = 0.0000000001 s for 1 bit
100 Gb/s = 1 s / 100 Gb = 0.00000000001 s for 1 bit
This is only the theoretical difference in how long 1 bit takes to travel. This doesn't account any hardware, media, hops in between or the software stack.
https://en.wikipedia.org/wiki/OSI_model
 
  • Like
Reactions: chrone and dmulk

fips

Active Member
May 5, 2014
153
5
38
second round of my ssd benchmark testing:

Code:
SSD Controller   BW    IOPS
HP SSDF S700 240GB        5171 KB/s    1292
Hynix Canvas SL300 240GB        5166 KB/s    1291
Intel DC S3520 240GB    Intel    83085 KB/s    20771
Plextor PX-256S3C 256GB    Silicon Motion SM2254 / TLC    5045 KB/s    1261
Seagate XF1230 240GB    LAMD eMLC    86084 KB/s    21521
Intel 520 240GB        400 KB/s    97
Kingston DC400 480GB    Phison PS3110-S10 / MLC    1733 KB/s    433
Intel DC S3520 1,6TB    Intel    80301 KB/s    20075
Kingston Fury SHFS37A480G 480GB    SandForce SF-2281 / MLC    78618 KB/s    19654
Sandisk Cloudspeed 2 ECO SDLF1DAR 960GB    Toshiba / MLC    52193 KB/s    13048
Kingston SHSS37A480G 480GB    Phison PS3110-S10 / MLC    1745 KB/s    436
SP Velox V70 480GB    SandForce SF-2281 / MLC    1713 KB/s    428
Sandisk Cloudspeed Eco SDLFNDAR 480GB    Marvell 88SS9187 8Kanal / MLC    81467 KB/s    20366
Sandisk X600 1TB    Toshiba / TLC    8623 KB/s    2155
Sandisk Cloudspeed 2 ECO SDLF1DAR 480GB    Toshiba / MLC    50846 KB/s    12711
Sandisk Cloudspeed SDLFODAR 240GB    Marvell 88SS9187 8Kanal / MLC    42968 KB/s    10741
 
  • Like
Reactions: chrone

flaf

New Member
May 20, 2018
3
0
1
41
Hi,

I have made an Proxmox cluster version 5.2 with a ceph storage (internal to Proxmox) and I have made a benchmark in a VM with fio. My problem is:

1. I'm not expert at all in benchmarks (and in fio) and I'm very cautious concerning benchmarks, maybe my benchmark is not relevant at all.
2. I have absolutely no idea if my results are good or not good (as regards my conf see below).

So I would appreciate some helps and explanations 1. to make a relevant benchmark and 2. to know if the results are good or not (as regards my conf).

Thanks a lot in advance for your help. :)
Regards.

PS: message edited (MTU in the ceph cluster network added).

Here is my configuration :

Code:
- I'm using Proxmox 5.2 with an internal ceph storage (Luminous) on 3 physical nodes.
- Each node are strictly identical.
- It's a server DELL PowerEdge 1U R639: 2 CPUs Intel Xeon E5-2650, 256GB RAM, 10 disks
  (see below for the storage conf).
- A RAID controller PERC H730P which is set in HBA mode (no RAID).
- Among the 10 disks, there are 2 SSD 200GB Intel S3710 2.5" in RAID1 ZFS dedicated to the OS.
- Among the 10 disks, there are 8 SSD 800GB Intel S3520 2.5" dedicated to the ceph storage
  (one disk = one OSD, so 8x3=24 OSDs in all).
- There is one network card 2x10Gbps SFP+ strictly dedicated to the ceph cluster network.
  I have set a bonding on the two interfaces in active-backup mode with MTU=9000 like this:

auto bond1
iface bond1 inet static
    address      10.123.123.21/24
    slaves       ex0 ex1
    bond_miimon  100
    bond_mode    active-backup
    bond_primary ex0
    mtu          9000
In the cluster, I have set only one ceph pool "ceph-vm" with (I haven't changed the CRUSH map):

Code:
- size = 3
- min_size = 2
- pg_num = 1024
- rule_name = replicated_rule
When I have created the _proxmox_ storage "ceph-vm" (proxmox storage which uses the ceph pool "ceph-vm"), I have set "--krbd=false". After that, I have installed a little VM, a little Debian Stretch with 512MB RAM and I have made a fio bench in the VM:

Code:
### TLNR => read iops ~ 7700 and write iops ~ 2550 ###

root@test:~# cat fio
[rwjob]
readwrite=randrw
rwmixread=75
gtod_reduce=1
bs=4k
size=4G
ioengine=libaio
iodepth=16
direct=1
filename_format=$jobname.$jobnum.$filenum
numjobs=4


root@test:~# fio fio
rwjob: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.16
Starting 4 processes
Jobs: 4 (f=4): [m(4)] [100.0% done] [103.9MB/35007KB/0KB /s] [26.6K/8751/0 iops] [eta 00m:00s]
rwjob: (groupid=0, jobs=1): err= 0: pid=1275: Thu May 17 15:39:01 2018
  read : io=3070.4MB, bw=30821KB/s, iops=7705, runt=102007msec
  write: io=1025.8MB, bw=10297KB/s, iops=2574, runt=102007msec
  cpu          : usr=5.03%, sys=13.49%, ctx=555568, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=785996/w=262580/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16
rwjob: (groupid=0, jobs=1): err= 0: pid=1276: Thu May 17 15:39:01 2018
  read : io=3072.5MB, bw=30767KB/s, iops=7691, runt=102257msec
  write: io=1023.7MB, bw=10250KB/s, iops=2562, runt=102257msec
  cpu          : usr=5.00%, sys=13.52%, ctx=550809, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=786533/w=262043/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16
rwjob: (groupid=0, jobs=1): err= 0: pid=1277: Thu May 17 15:39:01 2018
  read : io=3073.2MB, bw=30869KB/s, iops=7717, runt=101942msec
  write: io=1022.1MB, bw=10275KB/s, iops=2568, runt=101942msec
  cpu          : usr=4.94%, sys=13.58%, ctx=556677, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=786716/w=261860/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16
rwjob: (groupid=0, jobs=1): err= 0: pid=1278: Thu May 17 15:39:01 2018
  read : io=3071.6MB, bw=30763KB/s, iops=7690, runt=102243msec
  write: io=1024.5MB, bw=10260KB/s, iops=2565, runt=102243msec
  cpu          : usr=4.94%, sys=13.60%, ctx=552381, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=786320/w=262256/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=12287MB, aggrb=123045KB/s, minb=30762KB/s, maxb=30869KB/s, mint=101942msec, maxt=102257msec
  WRITE: io=4096.7MB, aggrb=41023KB/s, minb=10250KB/s, maxb=10296KB/s, mint=101942msec, maxt=102257msec

Disk stats (read/write):
  vda: ios=3144433/1048390, merge=0/28, ticks=3463440/2746060, in_queue=6209176, util=100.00%
 
Last edited:

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,286
289
88
What do you expect to be the outcome? And what do you want to test?

In general, you need a baseline to compare your tests against. Usually a start is, a test of the capabilities of the underlying hardware on its own (eg. fio or iperf). In your case, eg. a fio test against the Intel S3520.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!