Proxmox VE Ceph Benchmark 2018/02

Chicken76 · Mar 23, 2018

udo said:
Hi,
from my point of view it's not true - you can build an 3-Node ceph cluster without issues.
One node can fail, without data loss.

But the downtime of the failed node should not be to long. Because ceph can't remap the data to other osds to reach the replica-count of three again.
But this depends on the amount of data. Often, in much bigger ceph-setups, it's makes not realy sense to map all data to other nodes, because you are faster to bring the failed node back (spare server...). E.g. if one node have 10 4-TB OSDs you need a long time the rebalance the data across the other nodes.
And you need the free space on the other nodes of couse!

But ceph win with more nodes (more speed, less trouble during rebalance).

Udo

So what happens when one of the three machines with ceph goes down? Can't it function with just two copies, or does it try to create a third copy on the existing free space?

Alwin · Mar 23, 2018

It will stay in a degraded state, while the VM/CT can still access the storage in read/write. All copies are separated on host level, hence there is no recovery as no "4th" host is available to do a recovery.

Chicken76 · Mar 23, 2018

Alwin said:
It will stay in a degraded state, while the VM/CT can still access the storage in read/write. All copies are separated on host level, hence there is no recovery as no "4th" host is available to do a recovery.

I see. So no data loss and no disruption to running VM/CT, right? Can you still create new VM/CT on the downgraded cluster? Migrating existing VM/CT between hosts is possible on a downgraded (3-host with 1 host down) cluster?

And another question, if you'll permit: if a second host goes down, the cluster will stop working (lack of quorum) but the data is not lost, as there still is a healthy copy of it, and as soon as a second ceph server is spun up, it will start replicating, right?

Alwin · Mar 23, 2018

Chicken76 said:
So no data loss and no disruption to running VM/CT, right?

Besides those on the failed node.

Chicken76 said:
Can you still create new VM/CT on the downgraded cluster? Migrating existing VM/CT between hosts is possible on a downgraded (3-host with 1 host down) cluster?

Yes, between the remaining hosts.

Chicken76 said:
And another question, if you'll permit: if a second host goes down, the cluster will stop working (lack of quorum) but the data is not lost, as there still is a healthy copy of it, and as soon as a second ceph server is spun up, it will start replicating, right?

From Ceph view, as long as there is a copy and a working monitor, the recovery should happen. The ceph pool(s) stays in read-only as long as the min_size (default = 2) is not met. But you can not join a new node to the proxmox cluster, as it is out of quorum and goes into read-only. Hence the read-only, the VM/CT will not run either.

So in short, never let it get there. Safes you from a lot of trouble and sleepless nights.

Chicken76 · Mar 23, 2018

Alwin said:
Besides those on the failed node.

Yes, between the remaining hosts.

From Ceph view, as long as there is a copy and a working monitor, the recovery should happen. The ceph pool(s) stays in read-only as long as the min_size (default = 2) is not met. But you can not join a new node to the proxmox cluster, as it is out of quorum and goes into read-only. Hence the read-only, the VM/CT will not run either.

So in short, never let it get there. Safes you from a lot of trouble and sleepless nights.

Is there no manual method to add another machine into a cluster after 2 out of 3 go down? It is not impossible for a lighting storm to cause a power spike that blows electronic equipment. In case you have un-fixable damage to 2 servers out of 3, if you have spare machines (that were not plugged in) or can get new ones fast, is there no way to restore the cluster?

alexskysilk · Mar 23, 2018

Chicken76 said:
Why is that? Can you explain? I was contemplating a 3-node cluster and started doing the necessary reading when I stumbled upon your post. Why do you need "replication count" + 1?

data integrity in a replicated environment is verified using a "democratic" process, which is to say that a piece of data is considered to be correct if the majority of "votes" agree that it is so. As a consequence, a minimum number of "votes" for quorum is 2. With a 3 node cluster, a node outage leaves the survivors unable to form a quorum which means that if a piece of data has different values on the surviving PGs that data will be considered dirty. and no one wants that

Chicken76 · Mar 24, 2018

alexskysilk said:
data integrity in a replicated environment is verified using a "democratic" process, which is to say that a piece of data is considered to be correct if the majority of "votes" agree that it is so. As a consequence, a minimum number of "votes" for quorum is 2. With a 3 node cluster, a node outage leaves the survivors unable to form a quorum which means that if a piece of data has different values on the surviving PGs that data will be considered dirty. and no one wants that

So if a block differs between the 2 remaining copies, there are two posibilities:

one of them is corrupted, and this should be sorted out by the checksums (there are checksums for every block a-la-zfs, right?)
one of them has newer data than the other one. Are there timestamps or write logs that could show which is the most recent copy?

PigLover · Mar 26, 2018

Chicken76 said:
Why is that? Can you explain? I was contemplating a 3-node cluster and started doing the necessary reading when I stumbled upon your post. Why do you need "replication count" + 1?

@udo nailed it. Its not a matter of 3 node clusters not working - its a matter of how do you want them to work when there is a failure. You need the "+1" node in order to bring the cluster back to a stable operating state. It should continue to work without it, but you don't want it to stay this way very long.

This is just a statement about minimal configs. In larger clusters you almost always have many more nodes than your replica set - so no worries.

Chicken76 · Mar 26, 2018

OK guys, thank you very much for your explanations. The conclusion I take from this is that it doesn't function like a ZFS mirror with 1 drive down out of 3. While with ZFS you would be very much OK in this scenario, with ceph being distributed among separate machines things are much more complicated and consensus is much harder to achieve.

alexskysilk · Mar 28, 2018

Chicken76 said:
The conclusion I take from this is that it doesn't function like a ZFS mirror with 1 drive down out of 3. While with ZFS you would be very much OK in this scenario,

A zfs mirror is very much subject to the same limitations; when down to only one copy you do not have any parity and any disk fault can and will cause corruption or worse. The reason that a zfs mirror is acceptable while a ceph cluster is not is because I'm the scenario you're describing ceph has a node failure domain while zfs has a disk failure domain. Plainly put, there is no potential to operate your zfs file system with a third of its disks missing as part of it's normal duty cycle.

dmulk · Apr 10, 2018

Just want to confirm that in the Benchmark document that the latency numbers are reflected in "Seconds" and that we would need to multiply by 1000 to confirm ms...

Example: 0,704943 = approx 70ms latency

This seems a bit high for SSD's...

tom · Apr 10, 2018

dmulk said:
Just want to confirm that in the Benchmark document that the latency numbers are reflected in "Seconds" and that we would need to multiply by 1000 to confirm ms...

Example: 0,704943 = approx 70ms latency

This seems a bit high for SSD's...

Do you get better numbers?

A fully replicated network storage depends a lot on the network latency. Compare the number of 10 Gbit and 100 Gbit examples in the PDF.
And a fast CPU will also help here.

Alwin · Apr 10, 2018

Seconds to milliseconds, 0.7 s are 700 ms.

dmulk · Apr 10, 2018

tom said:
Do you get better numbers?

A fully replicated network storage depends a lot on the network latency. Compare the number of 10 Gbit and 100 Gbit examples in the PDF.
And a fast CPU will also help here.

Sorry, this is what I'm trying to confirm: The numbers in the report with comma's are throwing me off....what are these in ms?

700ms? 70ms?

Thanks,
<D>

Alwin · Apr 10, 2018

The decimal and thousands separators are "reversed".

Code:

German       4 333 222.111,00
US-English  4,333,222,111.00

The rados bench writes the latency in seconds, hence the ~700ms on the chart.

dmulk · Apr 10, 2018

Alwin said:
The decimal and thousands separators are "reversed".

Code:

German 4 333 222.111,00 US-English 4,333,222,111.00

The rados bench writes the latency in seconds, hence the ~700ms on the chart.

Ah....thank you Alwin. Yeah....this seems worse now...700ms seems extremely high...if that were ns that would be fantastic. I'll need to run the tests in my environment and compare. I suspect I'm seeing better numbers because if I were at 700ms latency I would certainly be "hearing about it" from my devs.

Alwin · Apr 10, 2018

Code:

1 Gb/s   = 1 s / 1 Gb   = 0.000000001 s for 1 bit
10 Gb/s  = 1 s / 10 Gb  = 0.0000000001 s for 1 bit
100 Gb/s = 1 s / 100 Gb = 0.00000000001 s for 1 bit

This is only the theoretical difference in how long 1 bit takes to travel. This doesn't account any hardware, media, hops in between or the software stack.
https://en.wikipedia.org/wiki/OSI_model

fips · Apr 17, 2018

second round of my ssd benchmark testing:

Code:

SSD Controller   BW    IOPS
HP SSDF S700 240GB        5171 KB/s    1292
Hynix Canvas SL300 240GB        5166 KB/s    1291
Intel DC S3520 240GB    Intel    83085 KB/s    20771
Plextor PX-256S3C 256GB    Silicon Motion SM2254 / TLC    5045 KB/s    1261
Seagate XF1230 240GB    LAMD eMLC    86084 KB/s    21521
Intel 520 240GB        400 KB/s    97
Kingston DC400 480GB    Phison PS3110-S10 / MLC    1733 KB/s    433
Intel DC S3520 1,6TB    Intel    80301 KB/s    20075
Kingston Fury SHFS37A480G 480GB    SandForce SF-2281 / MLC    78618 KB/s    19654
Sandisk Cloudspeed 2 ECO SDLF1DAR 960GB    Toshiba / MLC    52193 KB/s    13048
Kingston SHSS37A480G 480GB    Phison PS3110-S10 / MLC    1745 KB/s    436
SP Velox V70 480GB    SandForce SF-2281 / MLC    1713 KB/s    428
Sandisk Cloudspeed Eco SDLFNDAR 480GB    Marvell 88SS9187 8Kanal / MLC    81467 KB/s    20366
Sandisk X600 1TB    Toshiba / TLC    8623 KB/s    2155
Sandisk Cloudspeed 2 ECO SDLF1DAR 480GB    Toshiba / MLC    50846 KB/s    12711
Sandisk Cloudspeed SDLFODAR 240GB    Marvell 88SS9187 8Kanal / MLC    42968 KB/s    10741

flaf · May 20, 2018

Hi,

I have made an Proxmox cluster version 5.2 with a ceph storage (internal to Proxmox) and I have made a benchmark in a VM with fio. My problem is:

1. I'm not expert at all in benchmarks (and in fio) and I'm very cautious concerning benchmarks, maybe my benchmark is not relevant at all.
2. I have absolutely no idea if my results are good or not good (as regards my conf see below).

So I would appreciate some helps and explanations 1. to make a relevant benchmark and 2. to know if the results are good or not (as regards my conf).

Thanks a lot in advance for your help.

Regards.

PS: message edited (MTU in the ceph cluster network added).

Here is my configuration :

Code:

- I'm using Proxmox 5.2 with an internal ceph storage (Luminous) on 3 physical nodes.
- Each node are strictly identical.
- It's a server DELL PowerEdge 1U R639: 2 CPUs Intel Xeon E5-2650, 256GB RAM, 10 disks
  (see below for the storage conf).
- A RAID controller PERC H730P which is set in HBA mode (no RAID).
- Among the 10 disks, there are 2 SSD 200GB Intel S3710 2.5" in RAID1 ZFS dedicated to the OS.
- Among the 10 disks, there are 8 SSD 800GB Intel S3520 2.5" dedicated to the ceph storage
  (one disk = one OSD, so 8x3=24 OSDs in all).
- There is one network card 2x10Gbps SFP+ strictly dedicated to the ceph cluster network.
  I have set a bonding on the two interfaces in active-backup mode with MTU=9000 like this:

auto bond1
iface bond1 inet static
    address      10.123.123.21/24
    slaves       ex0 ex1
    bond_miimon  100
    bond_mode    active-backup
    bond_primary ex0
    mtu          9000

In the cluster, I have set only one ceph pool "ceph-vm" with (I haven't changed the CRUSH map):

Code:

- size = 3
- min_size = 2
- pg_num = 1024
- rule_name = replicated_rule

When I have created the _proxmox_ storage "ceph-vm" (proxmox storage which uses the ceph pool "ceph-vm"), I have set "--krbd=false". After that, I have installed a little VM, a little Debian Stretch with 512MB RAM and I have made a fio bench in the VM:

Code:

### TLNR => read iops ~ 7700 and write iops ~ 2550 ###

root@test:~# cat fio
[rwjob]
readwrite=randrw
rwmixread=75
gtod_reduce=1
bs=4k
size=4G
ioengine=libaio
iodepth=16
direct=1
filename_format=$jobname.$jobnum.$filenum
numjobs=4


root@test:~# fio fio
rwjob: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.16
Starting 4 processes
Jobs: 4 (f=4): [m(4)] [100.0% done] [103.9MB/35007KB/0KB /s] [26.6K/8751/0 iops] [eta 00m:00s]
rwjob: (groupid=0, jobs=1): err= 0: pid=1275: Thu May 17 15:39:01 2018
  read : io=3070.4MB, bw=30821KB/s, iops=7705, runt=102007msec
  write: io=1025.8MB, bw=10297KB/s, iops=2574, runt=102007msec
  cpu          : usr=5.03%, sys=13.49%, ctx=555568, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=785996/w=262580/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16
rwjob: (groupid=0, jobs=1): err= 0: pid=1276: Thu May 17 15:39:01 2018
  read : io=3072.5MB, bw=30767KB/s, iops=7691, runt=102257msec
  write: io=1023.7MB, bw=10250KB/s, iops=2562, runt=102257msec
  cpu          : usr=5.00%, sys=13.52%, ctx=550809, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=786533/w=262043/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16
rwjob: (groupid=0, jobs=1): err= 0: pid=1277: Thu May 17 15:39:01 2018
  read : io=3073.2MB, bw=30869KB/s, iops=7717, runt=101942msec
  write: io=1022.1MB, bw=10275KB/s, iops=2568, runt=101942msec
  cpu          : usr=4.94%, sys=13.58%, ctx=556677, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=786716/w=261860/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16
rwjob: (groupid=0, jobs=1): err= 0: pid=1278: Thu May 17 15:39:01 2018
  read : io=3071.6MB, bw=30763KB/s, iops=7690, runt=102243msec
  write: io=1024.5MB, bw=10260KB/s, iops=2565, runt=102243msec
  cpu          : usr=4.94%, sys=13.60%, ctx=552381, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=786320/w=262256/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=12287MB, aggrb=123045KB/s, minb=30762KB/s, maxb=30869KB/s, mint=101942msec, maxt=102257msec
  WRITE: io=4096.7MB, aggrb=41023KB/s, minb=10250KB/s, maxb=10296KB/s, mint=101942msec, maxt=102257msec

Disk stats (read/write):
  vda: ios=3144433/1048390, merge=0/28, ticks=3463440/2746060, in_queue=6209176, util=100.00%

Alwin · May 22, 2018

What do you expect to be the outcome? And what do you want to test?

In general, you need a baseline to compare your tests against. Usually a start is, a test of the capabilities of the underlying hardware on its own (eg. fio or iperf). In your case, eg. a fio test against the Intel S3520.

Proxmox VE Ceph Benchmark 2018/02

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Distinguished Member

Well-Known Member

Renowned Member

Well-Known Member

Distinguished Member

Member

Proxmox Staff Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

Renowned Member

New Member

Proxmox Retired Staff

We value your privacy