Proxmox, Ceph and local storage performance

mateusz · Mar 13, 2017

Hello,
On our environment I see some performance issue, maybe someone can help me to find where is the problem.
We have 6 servers on PVE4.4 with ca. 200VMs (Windows and Linux). All VM disks (rbd) are stored on separated Ceph cluster (10 servers, 20 SSD osd - cache tier and 48 HDD osd ).
I do some IO test using fio from Linux VM (Linux test01 3.13.0-110-generic #157-Ubuntu SMP Mon Feb 20 11:54:05 UTC 2017 x86_64 x86_64 x86_64)

VM disk stored on ceph:
- READ: io=540704KB, aggrb=6199KB/s, minb=481KB/s, maxb=3018KB/s, mint=34097msec, maxt=87216msec
- WRITE: io=278496KB, aggrb=8167KB/s, minb=479KB/s, maxb=14077KB/s, mint=18622msec, maxt=34097msec
VM disk stored on one (raid0) SATA drive
- READ: io=540704KB, aggrb=736KB/s, minb=333KB/s, maxb=361KB/s, mint=49232msec, maxt=733843msec
- WRITE: io=278496KB, aggrb=5656KB/s, minb=332KB/s, maxb=11234KB/s, mint=23334msec, maxt=49232msec
VM disk stored on one (raid0) SAS drive (15k)
- READ: io=540704KB, aggrb=1597KB/s, minb=498KB/s, maxb=782KB/s, mint=32905msec, maxt=338542msec
- WRITE: io=278496KB, aggrb=8463KB/s, minb=496KB/s, maxb=39390KB/s, mint=6655msec, maxt=32905msec

VM config is:

agent: 1
balloon: 0
boot: c
bootdisk: virtio0
cores: 4
cpu: host
hotplug: 0
ide2: none,media=cdrom
memory: 8192
name: test
net0: virtio=32:65:61:xx:xx:xx,bridge=vmbr0,tag=2027
numa: 1
ostype: l26
virtio0: ceph01:vm-2027003-disk-3,cache=none,size=10G
virtio1: ceph01:vm-2027003-disk-2,cache=none,size=10G
scsihw: virtio-scsi
smbios1: uuid=8c947036-c62c-4e72-8e4f-f8d1xxxxxxxx
sockets: 2

Proxmox Version (pveversion -v):

proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 9.2.1-1~bpo80+1

Network interfaces (1000MBps) are never utilized more than 30%.
Is this issue with PVE or some mistakes on ceph configuration? What config I should post here to give You more info?
Best Regards
Mateusz

czechsys · Mar 13, 2017

Well, better posting fio result lines with "read" or "write" and "iops=". Anyway, i am thinking about problem on pve side, because "maxb" READ on every (specially local!) storage is slower than on WRITE. That's crazy having read << write performance.

What's fio on hypervisor side? Any tunning setup affecting read cache?

mateusz · Mar 13, 2017

There are no tunning on proxmox, upgraded from 4.0 last week (but on 4.0 the same symptoms).
Fio 2.1.11 on localstorage (SATA) from hypervisor:

bgwriter: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
queryA: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=mmap, iodepth=1
queryB: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=mmap, iodepth=1
bgupdater: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio-2.1.11
Starting 4 processes
queryA: Laying out IO file(s) (1 file(s) / 256MB)
queryB: Laying out IO file(s) (1 file(s) / 256MB)
bgupdater: Laying out IO file(s) (1 file(s) / 32MB)
Jobs: 1 (f=1): [_(2),r(1),_(1)] [99.4% done] [3452KB/0KB/0KB /s] [863/0/0 iops] [eta 00m:02s]
bgwriter: (groupid=0, jobs=1): err= 0: pid=30387: Mon Mar 13 15:16:48 2017
write: io=262144KB, bw=2355.4KB/s, iops=588, runt=111297msec
slat (usec): min=7, max=183, avg=24.83, stdev= 4.40
clat (msec): min=1, max=1503, avg=54.31, stdev=52.23
lat (msec): min=1, max=1503, avg=54.34, stdev=52.23
clat percentiles (msec):
| 1.00th=[ 4], 5.00th=[ 7], 10.00th=[ 10], 20.00th=[ 16],
| 30.00th=[ 23], 40.00th=[ 30], 50.00th=[ 39], 60.00th=[ 49],
| 70.00th=[ 63], 80.00th=[ 84], 90.00th=[ 120], 95.00th=[ 157],
| 99.00th=[ 245], 99.50th=[ 285], 99.90th=[ 388], 99.95th=[ 433],
| 99.99th=[ 562]
bw (KB /s): min= 1181, max= 2648, per=100.00%, avg=2358.99, stdev=154.17
lat (msec) : 2=0.01%, 4=1.09%, 10=8.97%, 20=16.32%, 50=34.62%
lat (msec) : 100=24.53%, 250=13.54%, 500=0.90%, 750=0.02%, 2000=0.01%
cpu : usr=0.75%, sys=2.28%, ctx=63568, majf=0, minf=147
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=65536/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
queryA: (groupid=0, jobs=1): err= 0: pid=30388: Mon Mar 13 15:16:48 2017
read : io=262144KB, bw=832342B/s, iops=203, runt=322506msec
clat (usec): min=100, max=216863, avg=4912.76, stdev=7368.91
lat (usec): min=100, max=216864, avg=4913.08, stdev=7368.88
clat percentiles (usec):
| 1.00th=[ 274], 5.00th=[ 338], 10.00th=[ 1224], 20.00th=[ 1832],
| 30.00th=[ 2384], 40.00th=[ 2928], 50.00th=[ 3472], 60.00th=[ 3984],
| 70.00th=[ 4512], 80.00th=[ 5280], 90.00th=[ 8512], 95.00th=[12864],
| 99.00th=[38144], 99.50th=[51456], 99.90th=[88576], 99.95th=[102912],
| 99.99th=[150528]
bw (KB /s): min= 113, max= 2751, per=49.14%, avg=815.66, stdev=472.12
lat (usec) : 250=0.49%, 500=6.15%, 750=0.31%, 1000=0.82%
lat (msec) : 2=15.31%, 4=36.99%, 10=32.80%, 20=4.20%, 50=2.40%
lat (msec) : 100=0.46%, 250=0.06%
cpu : usr=0.28%, sys=0.75%, ctx=65554, majf=65536, minf=46
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=65536/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
queryB: (groupid=0, jobs=1): err= 0: pid=30389: Mon Mar 13 15:16:48 2017
read : io=262144KB, bw=824210B/s, iops=201, runt=325688msec
clat (usec): min=97, max=335631, avg=4959.28, stdev=8171.98
lat (usec): min=98, max=335631, avg=4959.74, stdev=8171.98
clat percentiles (usec):
| 1.00th=[ 241], 5.00th=[ 306], 10.00th=[ 1208], 20.00th=[ 1816],
| 30.00th=[ 2384], 40.00th=[ 2896], 50.00th=[ 3440], 60.00th=[ 3952],
| 70.00th=[ 4448], 80.00th=[ 5152], 90.00th=[ 8384], 95.00th=[12864],
| 99.00th=[40704], 99.50th=[58112], 99.90th=[100864], 99.95th=[119296],
| 99.99th=[164864]
bw (KB /s): min= 78, max= 3417, per=48.69%, avg=808.28, stdev=498.22
lat (usec) : 100=0.01%, 250=1.50%, 500=4.93%, 750=0.32%, 1000=1.17%
lat (msec) : 2=15.51%, 4=37.55%, 10=32.05%, 20=4.02%, 50=2.26%
lat (msec) : 100=0.58%, 250=0.10%, 500=0.01%
cpu : usr=0.31%, sys=0.72%, ctx=65543, majf=65536, minf=35
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=65536/w=0/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
bgupdater: (groupid=0, jobs=1): err= 0: pid=30390: Mon Mar 13 15:16:48 2017
read : io=16416KB, bw=112303B/s, iops=27, runt=149684msec
slat (usec): min=8, max=95, avg=21.72, stdev= 5.18
clat (usec): min=152, max=183931, avg=6610.63, stdev=12333.95
lat (usec): min=169, max=183957, avg=6632.88, stdev=12334.56
clat percentiles (usec):
| 1.00th=[ 219], 5.00th=[ 282], 10.00th=[ 924], 20.00th=[ 1640],
| 30.00th=[ 2352], 40.00th=[ 3024], 50.00th=[ 3696], 60.00th=[ 4256],
| 70.00th=[ 5024], 80.00th=[ 7072], 90.00th=[11840], 95.00th=[23168],
| 99.00th=[68096], 99.50th=[84480], 99.90th=[125440], 99.95th=[158720],
| 99.99th=[183296]
bw (KB /s): min= 4, max= 468, per=7.66%, avg=127.18, stdev=149.98
write: io=16352KB, bw=111865B/s, iops=27, runt=149684msec
slat (usec): min=9, max=94, avg=24.15, stdev= 4.39
clat (msec): min=1, max=1308, avg=29.98, stdev=75.57
lat (msec): min=1, max=1308, avg=30.00, stdev=75.57
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 5],
| 30.00th=[ 6], 40.00th=[ 7], 50.00th=[ 8], 60.00th=[ 9],
| 70.00th=[ 12], 80.00th=[ 17], 90.00th=[ 77], 95.00th=[ 174],
| 99.00th=[ 363], 99.50th=[ 482], 99.90th=[ 709], 99.95th=[ 824],
| 99.99th=[ 1303]
bw (KB /s): min= 4, max= 469, per=6.57%, avg=122.25, stdev=151.44
lat (usec) : 250=1.79%, 500=2.27%, 750=0.50%, 1000=1.01%
lat (msec) : 2=7.14%, 4=22.09%, 10=42.26%, 20=11.44%, 50=4.55%
lat (msec) : 100=2.58%, 250=2.94%, 500=1.18%, 750=0.20%, 1000=0.02%
lat (msec) : 2000=0.01%
cpu : usr=0.27%, sys=0.20%, ctx=8197, majf=0, minf=9
IO depths : 1=99.8%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=4104/w=4088/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
READ: io=540704KB, aggrb=1660KB/s, minb=109KB/s, maxb=812KB/s, mint=149684msec, maxt=325688msec
WRITE: io=278496KB, aggrb=1860KB/s, minb=109KB/s, maxb=2355KB/s, mint=111297msec, maxt=149684msec

Disk stats (read/write):
dm-0: ios=135160/70370, merge=0/0, ticks=669476/3928668, in_queue=4601776, util=100.00%, aggrios=135176/70297, aggrmerge=0/73, aggrticks=669364/3901808, aggrin_queue=4571056, aggrutil=100.00%
sda: ios=135176/70297, merge=0/73, ticks=669364/3901808, in_queue=4571056, util=100.00%

czechsys · Mar 13, 2017

Can you test fio from non-pve live linux? Problem is on all hypervisors (same HW, basic info)?

Your results are hardly readable, but compare yours:

Code:

read : io=262144KB, bw=832342B/s, iops=203, runt=322506msec
read : io=262144KB, bw=824210B/s, iops=201, runt=325688msec
read : io=16416KB, bw=112303B/s, iops=27, runt=149684msec

write: io=262144KB, bw=2355.4KB/s, iops=588, runt=111297msec
write: io=16352KB, bw=111865B/s, iops=27, runt=149684msec

with (SAS 10k 2x300GB, P410i, HP DL1xx G6, pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.44-1-pve)):

Code:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

  read : io=3071.7MB, bw=3780.6KB/s, iops=945, runt=831985msec
  write: io=1024.4MB, bw=1260.8KB/s, iops=315, runt=831985msec

What's your iowait, avg load when you run read tests?

udo · Mar 13, 2017

mateusz said:
Hello,
On our environment I see some performance issue, maybe someone can help me to find where is the problem.
We have 6 servers on PVE4.4 with ca. 200VMs (Windows and Linux).

Hi,
app. 30 VMs on a server with 1GB-connection ? to an ceph-cluster?

All VM disks (rbd) are stored on separated Ceph cluster (10 servers, 20 SSD osd - cache tier and 48 HDD osd ).

this mean an EC-Pool - with journal on the SSDs too or are the ssds for the cache tier only?
EC-Pools are not the fastest ones... esp. if the data aren't in the cache...
How IO-saturated are your ceph-cluster?

I do some IO test using fio from Linux VM (Linux test01 3.13.0-110-generic #157-Ubuntu SMP Mon Feb 20 11:54:05 UTC 2017 x86_64 x86_64 x86_64)

VM disk stored on ceph:

READ: io=540704KB, aggrb=6199KB/s, minb=481KB/s, maxb=3018KB/s, mint=34097msec, maxt=87216msec

WRITE: io=278496KB, aggrb=8167KB/s, minb=479KB/s, maxb=14077KB/s, mint=18622msec, maxt=34097msec

VM disk stored on one (raid0) SATA drive

READ: io=540704KB, aggrb=736KB/s, minb=333KB/s, maxb=361KB/s, mint=49232msec, maxt=733843msec

WRITE: io=278496KB, aggrb=5656KB/s, minb=332KB/s, maxb=11234KB/s, mint=23334msec, maxt=49232msec

VM disk stored on one (raid0) SAS drive (15k)

READ: io=540704KB, aggrb=1597KB/s, minb=498KB/s, maxb=782KB/s, mint=32905msec, maxt=338542msec

WRITE: io=278496KB, aggrb=8463KB/s, minb=496KB/s, maxb=39390KB/s, mint=6655msec, maxt=32905msec

How looks the test with 4MB block-size?

That write is faster than read is imho quite normal - afaik the rbd-driver combined small writes to bigger ones.

Udo

mateusz · Mar 14, 2017

czechsys said:
Can you test fio from non-pve live linux? Problem is on all hypervisors (same HW, basic info)

Tested from live CD on my laptop, using fio and this config:

Code:

[global]
ioengine=rbd 
clientname=admin
pool=sata
rbdname=fio_test
invalidate=0    # mandatory
rw=randwrite
bs=4k

[rbd_iodepth32]
iodepth=32

Result:

Code:

write: io=2048.0MB, bw=7717.5KB/s, iops=1929, runt=271742msec

So it's looks like ceph problem.

What's your iowait, avg load when you run read tests?

Next i do the tests again on local storage on main hypervisors.
Command for bs=4M

Code:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/var/lib/vz/images/fio_4M_test --bs=4M --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Command for bs=4K

Code:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/var/lib/vz/images/fio_4k_test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Results in files libaio_bs_4k_size_2G_randrw.png and libaio_bs_4M_size_8G_randrw.png
Values for iowait, avgqu-sz and await from command iostat -x 10 5

At last I do bs=4M test inside guest with fio:

Code:

[global]
rw=randread
size=256m
directory=/root/fio-testing/data
ioengine=libaio
iodepth=4
invalidate=1
direct=1
bs=4M

[bgwriter]
rw=randwrite
iodepth=32

[queryA]
iodepth=1
ioengine=mmap
direct=0
thinktime=3

[queryB]
iodepth=1
ioengine=mmap
direct=0
thinktime=5

[bgupdater]
rw=randrw
iodepth=16
thinktime=40
size=32m

and result:

Code:

bgwriter: write: io=262144KB, bw=45104KB/s, iops=11, runt=  5812msec
queryA:   read : io=262144KB, bw=1741.9KB/s, iops=0, runt=150499msec
queryB:   read : io=262144KB, bw=1726.6KB/s, iops=0, runt=151829msec
bgupdater:   read : io=16384KB, bw=6246.3KB/s, iops=1, runt=  2623msec

Write speed for bs=4M inside guest is OK, so problem is with bs=4k.
How can I improve write speed for 4k writes on PVE?

mateusz · Mar 14, 2017

udo said:
Hi,
app. 30 VMs on a server with 1GB-connection ? to an ceph-cluster?

Yes, but there is max 30% of network device usage.

udo said:
this mean an EC-Pool - with journal on the SSDs too or are the ssds for the cache tier only?
EC-Pools are not the fastest ones... esp. if the data aren't in the cache...

This is replicated (replica 3) pool with cache tier and journal on SSD. All SSDs drives are INTEL SSDSC2BX200G4.

udo said:
How IO-saturated are your ceph-cluster?

How can I check this?

udo said:
How looks the test with 4MB block-size?

4MB block-size on local storage gets 69MB/s read and 20MB/s write on SATA.

udo said:
That write is faster than read is imho quite normal - afaik the rbd-driver combined small writes to bigger ones.

Udo

OK

udo · Mar 14, 2017

mateusz said:
Yes, but there is max 30% of network device usage.

This is replicated (replica 3) pool with cache tier and journal on SSD. All SSDs drives are INTEL SSDSC2BX200G4.

also SSD DC S3610 - this should be ok.
replica 3 with cache tier?? sure? sounds for me like an EC-Pool with cache tier. But this shouldn't change anything for your speed-tests.

How can I check this?

atop is an nice tool

4MB block-size on local storage gets 69MB/s read and 20MB/s write on SATA.

Yes - but what values do you get on ceph?

Because 4k-access: with small blocks the latency has an higher impact - and 1GB-Ethernet has an much higher latency than 10GB-Ethernet.
This is one reason, why many people use 10GB-Nics for ceph/iscsi and so so on.
And normaly the test is 4k for IOPS and 4M for throughput.
Even if you buy an SSD - they wrote " > 500MB/s + 50k IOPS" - mean more than 500MB/s with 4MB-Blocks but (this is >125 IOPS) but with 4K and 50k IOPS you get 195MB/s.

You can look with "ceph -w" how many data your ceph-cluster provide - or ceph-dash as an gui (or other tools).

Udo

Q-wulf · Mar 16, 2017

Can you post the following, please?

Ceph server specs (cpu, ram, networking)
Ceph config (including networking)
Do you use the 20 cache-tier SSD's as pure cache or also as a journal?
Ceph crush map.

preferably encapsulated by code/quote bb-code.

udo said:
replica 3 with cache tier?? sure? sounds for me like an EC-Pool with cache tier.
[...]
Udo

You can put a pool as cache on any other pool, even a cache-pool.

mateusz · Mar 16, 2017

Q-wulf said:
Can you post the following, please?

Ceph server specs (cpu, ram, networking)

Code:

ceph10: 2x E5504  @ 2.00GHz, 32GB RAM, 4x NetXtreme II BCM5709  Gigabit Ethernet (2 active)
ceph15: 2x E5504  @ 2.00GHz, 32GB RAM, 4x NetXtreme II BCM5709  Gigabit Ethernet (2 active)
ceph20: 2x E5410  @ 2.33GHz, 32GB RAM, 4x 82571EB Gigabit Ethernet Controller (2 active)
ceph25: 2x E5620  @ 2.40GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
ceph30: 1x E5530  @ 2.40GHz, 32GB RAM, 2x 82571EB Gigabit Ethernet Controller (1 active), 4x  NetXtreme II BCM5709 Gigabit Ethernet (1 active)
ceph35: 2x E5540  @ 2.53GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
ceph40: 2x X5670  @ 2.93GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
ceph45: 2x X5670  @ 2.93GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
ceph50: 2x X5670  @ 2.93GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
ceph55: 2x X5670  @ 2.93GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)

Q-wulf said:
Ceph config (including networking)

Network configuration:

Code:

#ceph10:
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.10
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8
 auto em4
 iface em4 inet static
         address 10.20.4.10
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255

#ceph15:
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.15
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8 8.8.4.4
 auto em4
 iface em4 inet static
         address 10.20.4.15
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255
#ceph20:
 auto lo
 iface lo inet loopback
 auto eth0
 iface eth0 inet static
         address 10.20.8.20
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8 8.8.4.4
 auto eth2
 iface eth2 inet static
         address 10.20.4.20
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255

#ceph25:
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.25
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8 8.8.4.4
 auto em4
 iface em4 inet static
         address 10.20.4.25
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255

#ceph30:
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.30
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8 8.8.4.4
 auto p4p2
 iface p4p2 inet static
         address 10.20.4.30
         netmask 255.255.252.0
         network 10.20.4.0
         bracast 10.20.7.255

#ceph35:
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.35
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8 8.8.4.4
 auto em4
 iface em4 inet static
         address 10.20.4.35
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255

#ceph40:
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.40
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8
 auto em3
 iface em3 inet static
         address 10.20.4.40
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255

#ceph45
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.45
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8
 auto em3
 iface em3 inet static
         address 10.20.4.45
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255

#ceph50:
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.50
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8
 auto em3
 iface em3 inet static
         address 10.20.4.50
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255

#ceph55:
 auto lo
 iface lo inet loopback
 auto em1
 iface em1 inet static
         address 10.20.8.55
         netmask 255.255.252.0
         network 10.20.8.0
         broadcast 10.20.11.255
         gateway 10.20.8.1
         dns-nameservers 8.8.8.8
 auto em3
 iface em3 inet static
         address 10.20.4.55
         netmask 255.255.252.0
         network 10.20.4.0
         broadcast 10.20.7.255

Ceph.conf:

Code:

[global]

fsid=some_uuid

mon initial members =ceph55, ceph50, ceph45, ceph40, ceph35, ceph30, ceph25, ceph20, ceph15, ceph10
mon host = 10.20.8.55, 10.20.8.50, 10.20.8.45, 10.20.8.40, 10.20.8.35, 10.20.8.30, 10.20.8.25, 10.20.8.20, 10.20.8.15, 10.20.8.10


public network = 10.20.8.0/22
cluster network = 10.20.4.0/22

filestore xattr use omap = true
filestore max sync interval = 30


osd journal size = 10240
osd mount options xfs = "rw,noatime,inode64,allocsize=4M"
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 2048
osd pool default pgp num = 2048
osd disk thread ioprio class = idle
osd disk thread ioprio priority = 7
osd crush update on start = false

osd crush chooseleaf type = 1
osd recovery max active = 1
osd recovery op priority = 1
osd max backfills = 1

auth cluster required = cephx
auth service required = cephx
auth client required = cephx

rbd default format = 2

##ceph35 osds
[osd.0]
cluster addr = 10.20.4.35
public addr = 10.20.8.35
[osd.1]
cluster addr = 10.20.4.35
public addr = 10.20.8.35
[osd.2]
cluster addr = 10.20.4.35
public addr = 10.20.8.35
[osd.3]
cluster addr = 10.20.4.35
public addr = 10.20.8.35
[osd.4]
cluster addr = 10.20.4.35
[osd.5]
cluster addr = 10.20.4.35
[osd.66]
cluster addr = 10.20.4.35
[osd.67]
cluster addr = 10.20.4.35

##ceph25 osds
[osd.6]
cluster addr = 10.20.4.25
[osd.7]
cluster addr = 10.20.4.25
[osd.8]
cluster addr = 10.20.4.25
[osd.9]
cluster addr = 10.20.4.25
[osd.10]
cluster addr = 10.20.4.25
[osd.11]
cluster addr = 10.20.4.25
[osd.62]
cluster addr = 10.20.4.25
[osd.63]
cluster addr = 10.20.4.25

##ceph15 osds
[osd.12]
cluster addr = 10.20.4.15
[osd.13]
cluster addr = 10.20.4.15
[osd.14]
cluster addr = 10.20.4.15
[osd.15]
cluster addr = 10.20.4.15
[osd.58]
cluster addr = 10.20.4.15
[osd.59]
cluster addr = 10.20.4.15


##ceph30 osds
[osd.16]
cluster addr = 10.20.4.30
[osd.17]
cluster addr = 10.20.4.30
[osd.18]
cluster addr = 10.20.4.30
[osd.19]
cluster addr = 10.20.4.30
[osd.20]
cluster addr = 10.20.4.30
[osd.21]
cluster addr = 10.20.4.30
[osd.64]
cluster addr = 10.20.4.30
[osd.65]
cluster addr = 10.20.4.30

##ceph20 osds
[osd.22]
cluster addr = 10.20.4.20
[osd.23]
cluster addr = 10.20.4.20
[osd.24]
cluster addr = 10.20.4.20
[osd.25]
cluster addr = 10.20.4.20
[osd.26]
cluster addr = 10.20.4.20
[osd.27]
cluster addr = 10.20.4.20
[osd.60]
cluster addr = 10.20.4.20
[osd.61]
cluster addr = 10.20.4.20

##ceph10 osd
[osd.28]
cluster addr = 10.20.4.10
[osd.29]
cluster addr = 10.20.4.10
[osd.30]
cluster addr = 10.20.4.10
[osd.31]
cluster addr = 10.20.4.10
[osd.56]
cluster addr = 10.20.4.10
[osd.57]
cluster addr = 10.20.4.10

#ceph40 osd
[osd.32]
cluster addr = 10.20.4.40
[osd.33]
cluster addr = 10.20.4.40
[osd.34]
cluster addr = 10.20.4.40
[osd.35]
cluster addr = 10.20.4.40
[osd.36]
cluster addr = 10.20.4.40
[osd.52]
cluster addr = 10.20.4.40

#ceph45 osd
[osd.37]
cluster addr = 10.20.4.45
[osd.38]
cluster addr = 10.20.4.45
[osd.39]
cluster addr = 10.20.4.45
[osd.40]
cluster addr = 10.20.4.45
[osd.41]
cluster addr = 10.20.4.45
[osd.54]
cluster addr = 10.20.4.45

#ceph50 osd
[osd.42]
cluster addr = 10.20.4.50
[osd.43]
cluster addr = 10.20.4.50
[osd.44]
cluster addr = 10.20.4.50
[osd.45]
cluster addr = 10.20.4.50
[osd.46]
cluster addr = 10.20.4.50
[osd.53]
cluster addr = 10.20.4.50

#ceph55 osd
[osd.47]
cluster addr = 10.20.4.55
[osd.48]
cluster addr = 10.20.4.55
[osd.49]
cluster addr = 10.20.4.55
[osd.50]
cluster addr = 10.20.4.55
[osd.51]
cluster addr = 10.20.4.55
[osd.55]
cluster addr = 10.20.4.55



[mon.ceph35]
host = ceph35
mon addr = 10.20.8.35:6789
[mon.ceph30]
host = ceph30
mon addr = 10.20.8.30:6789
[mon.ceph20]
host = ceph20
mon addr = 10.20.8.20:6789
[mon.ceph15]
host = ceph15
mon addr = 10.20.8.15:6789
[mon.ceph25]
mon addr = 10.20.8.25:6789
[mon.ceph10]
host = ceph10
mon addr = 10.20.8.10:6789
[mon.ceph40]
host = ceph40
mon addr = 10.20.8.40:6789
[mon.ceph45]
host = ceph45
mon addr = 10.20.8.45:6789
[mon.ceph50]
host = ceph50
mon addr = 10.20.8.50:6789
[mon.ceph55]
host = ceph55
mon addr = 10.20.8.55:6789

Q-wulf said:
Do you use the 20 cache-tier SSD's as pure cache or also as a journal?

The disks are also used as a journal, system disk is SSD with partition for system, 6 or 8 partitions (10GB) for osds journal and rest free space as osd in cache tier pool (~ca 100GB), second SSD disk is used only as a osd for cache tier.

Q-wulf said:
Ceph crush map.

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph30 {
        id -5           # do not change unnecessarily
        # weight 5.000
        alg straw
        hash 0  # rjenkins1
        item osd.19 weight 0.910
        item osd.17 weight 0.910
        item osd.20 weight 0.680
        item osd.16 weight 0.680
        item osd.18 weight 0.910
        item osd.21 weight 0.910
}
host ceph20 {
        id -6           # do not change unnecessarily
        # weight 3.890
        alg straw
        hash 0  # rjenkins1
        item osd.22 weight 0.550
        item osd.24 weight 0.680
        item osd.25 weight 0.680
        item osd.26 weight 0.680
        item osd.27 weight 0.680
        item osd.23 weight 0.620
}
host ceph10 {
        id -7           # do not change unnecessarily
        # weight 3.639
        alg straw
        hash 0  # rjenkins1
        item osd.28 weight 0.910
        item osd.30 weight 0.910
        item osd.31 weight 0.910
        item osd.29 weight 0.909
}
rack skwer {
        id -10          # do not change unnecessarily
        # weight 12.529
        alg straw
        hash 0  # rjenkins1
        item ceph30 weight 5.000
        item ceph20 weight 3.890
        item ceph10 weight 3.639
}
host ceph35 {
        id -2           # do not change unnecessarily
        # weight 5.410
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 0.900
        item osd.1 weight 0.900
        item osd.2 weight 0.900
        item osd.3 weight 0.900
        item osd.5 weight 0.900
        item osd.4 weight 0.910
}
host ceph25 {
        id -3           # do not change unnecessarily
        # weight 4.310
        alg straw
        hash 0  # rjenkins1
        item osd.6 weight 0.680
        item osd.7 weight 0.680
        item osd.8 weight 0.680
        item osd.9 weight 0.680
        item osd.11 weight 0.680
        item osd.10 weight 0.910
}
host ceph15 {
        id -4           # do not change unnecessarily
        # weight 3.640
        alg straw
        hash 0  # rjenkins1
        item osd.12 weight 0.910
        item osd.13 weight 0.910
        item osd.14 weight 0.910
        item osd.15 weight 0.910
}
rack nzoz {
        id -20          # do not change unnecessarily
        # weight 13.360
        alg straw
        hash 0  # rjenkins1
        item ceph35 weight 5.410
        item ceph25 weight 4.310
        item ceph15 weight 3.640
}
root default {
        id -1           # do not change unnecessarily
        # weight 25.889
        alg straw
        hash 0  # rjenkins1
        item skwer weight 12.529
        item nzoz weight 13.360
}
host ceph40-ssd {
        id -16          # do not change unnecessarily
        # weight 0.296
        alg straw
        hash 0  # rjenkins1
        item osd.32 weight 0.171
        item osd.52 weight 0.125
}
host ceph50-ssd {
        id -19          # do not change unnecessarily
        # weight 0.296
        alg straw
        hash 0  # rjenkins1
        item osd.42 weight 0.171
        item osd.53 weight 0.125
}
rack skwer-ssd {
        id -9           # do not change unnecessarily
        # weight 0.592
        alg straw
        hash 0  # rjenkins1
        item ceph40-ssd weight 0.296
        item ceph50-ssd weight 0.296
}
host ceph45-ssd {
        id -17          # do not change unnecessarily
        # weight 0.296
        alg straw
        hash 0  # rjenkins1
        item osd.37 weight 0.171
        item osd.54 weight 0.125
}
host ceph55-ssd {
        id -22          # do not change unnecessarily
        # weight 0.296
        alg straw
        hash 0  # rjenkins1
        item osd.47 weight 0.171
        item osd.55 weight 0.125
}
rack nzoz-ssd {
        id -11          # do not change unnecessarily
        # weight 0.592
        alg straw
        hash 0  # rjenkins1
        item ceph45-ssd weight 0.296
        item ceph55-ssd weight 0.296
}
root ssd {
        id -8           # do not change unnecessarily
        # weight 1.184
        alg straw
        hash 0  # rjenkins1
        item skwer-ssd weight 0.592
        item nzoz-ssd weight 0.592
}
host ceph40-sata {
        id -15          # do not change unnecessarily
        # weight 7.272
        alg straw
        hash 0  # rjenkins1
        item osd.33 weight 1.818
        item osd.34 weight 1.818
        item osd.35 weight 1.818
        item osd.36 weight 1.818
}
host ceph50-sata {
        id -21          # do not change unnecessarily
        # weight 7.272
        alg straw
        hash 0  # rjenkins1
        item osd.43 weight 1.818
        item osd.44 weight 1.818
        item osd.45 weight 1.818
        item osd.46 weight 1.818
}
rack skwer-sata {
        id -13          # do not change unnecessarily
        # weight 14.544
        alg straw
        hash 0  # rjenkins1
        item ceph40-sata weight 7.272
        item ceph50-sata weight 7.272
}
host ceph45-sata {
        id -18          # do not change unnecessarily
        # weight 7.272
        alg straw
        hash 0  # rjenkins1
        item osd.38 weight 1.818
        item osd.39 weight 1.818
        item osd.40 weight 1.818
        item osd.41 weight 1.818
}
host ceph55-sata {
        id -23          # do not change unnecessarily
        # weight 7.272
        alg straw
        hash 0  # rjenkins1
        item osd.48 weight 1.818
        item osd.49 weight 1.818
        item osd.50 weight 1.818
        item osd.51 weight 1.818
}
rack nzoz-sata {
        id -14          # do not change unnecessarily
        # weight 14.544
        alg straw
        hash 0  # rjenkins1
        item ceph45-sata weight 7.272
        item ceph55-sata weight 7.272
}
root sata {
        id -12          # do not change unnecessarily
        # weight 29.088
        alg straw
        hash 0  # rjenkins1
        item skwer-sata weight 14.544
        item nzoz-sata weight 14.544
}
host ceph10-ssd {
        id -27          # do not change unnecessarily
        # weight 0.296
        alg straw
        hash 0  # rjenkins1
        item osd.56 weight 0.171
        item osd.57 weight 0.125
}
host ceph20-ssd {
        id -29          # do not change unnecessarily
        # weight 0.278
        alg straw
        hash 0  # rjenkins1
        item osd.60 weight 0.175
        item osd.61 weight 0.103
}
host ceph30-ssd {
        id -31          # do not change unnecessarily
        # weight 0.278
        alg straw
        hash 0  # rjenkins1
        item osd.64 weight 0.175
        item osd.65 weight 0.103
}
rack rbd-cache-skwer {
        id -25          # do not change unnecessarily
        # weight 0.852
        alg straw
        hash 0  # rjenkins1
        item ceph10-ssd weight 0.296
        item ceph20-ssd weight 0.278
        item ceph30-ssd weight 0.278
}
host ceph15-ssd {
        id -28          # do not change unnecessarily
        # weight 0.296
        alg straw
        hash 0  # rjenkins1
        item osd.58 weight 0.171
        item osd.59 weight 0.125
}
host ceph25-ssd {
        id -30          # do not change unnecessarily
        # weight 0.278
        alg straw
        hash 0  # rjenkins1
        item osd.62 weight 0.175
        item osd.63 weight 0.103
}
host ceph35-ssd {
        id -32          # do not change unnecessarily
        # weight 0.278
        alg straw
        hash 0  # rjenkins1
        item osd.66 weight 0.175
        item osd.67 weight 0.103
}
rack rbd-cache-nzoz {
        id -26          # do not change unnecessarily
        # weight 0.852
        alg straw
        hash 0  # rjenkins1
        item ceph15-ssd weight 0.296
        item ceph25-ssd weight 0.278
        item ceph35-ssd weight 0.278
}
root rbd-cache {
        id -24          # do not change unnecessarily
        # weight 1.704
        alg straw
        hash 0  # rjenkins1
        item rbd-cache-skwer weight 0.852
        item rbd-cache-nzoz weight 0.852
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
        step take default
        step chooseleaf firstn -2 type osd
        step emit
}
rule ssd {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take ssd
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
        step take ssd
        step chooseleaf firstn -2 type osd
        step emit
}
rule sata {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take sata
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
        step take sata
        step chooseleaf firstn -2 type osd
        step emit
}
rule rbd-cache {
        ruleset 3
        type replicated
        min_size 1
        max_size 10
        step take rbd-cache
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
        step take rbd-cache
        step chooseleaf firstn -2 type osd
        step emit
}

# end crush map

Pool "rbd-cache" is set as cache tier for pool "rbd", pool "ssd" is set as cache tier for pool "sata".

Code:

 ceph osd pool ls detail
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 81958 lfor 70295 flags hashpspool tiers 6 read_tier 6 write_tier 6 min_read_recency_for_promote 3 min_write_recency_for_promote 3 stripe_width 0
        removed_snaps [1~2,4~12,17~2e,46~121,16b~a]
pool 4 'ssd' replicated size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 81967 flags hashpspool,incomplete_clones tier_of 5 cache_mode readforward target_bytes 298195056179 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 600s x6 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
        removed_snaps [1~14d,150~1e,16f~8]
pool 5 'sata' replicated size 3 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 512 pgp_num 512 last_change 81967 lfor 66807 flags hashpspool tiers 4 read_tier 4 write_tier 4 stripe_width 0
        removed_snaps [1~14d,150~1e,16f~8]
pool 6 'rbd-cache' replicated size 3 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 81958 flags hashpspool,incomplete_clones tier_of 2 cache_mode readforward target_bytes 429496729600 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 600s x6 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
        removed_snaps [1~2,4~12,17~2e,46~121,16b~a]

Servers ceph40, ceph45, ceph50 and ceph55 have better SATA disks ( WDC WD2004FBYZ). On this servers are placed pools "ssd" and "sata". RAID controler is H700 512MB, all disks are in raid0.

mateusz · Mar 16, 2017

udo said:
also SSD DC S3610 - this should be ok.
replica 3 with cache tier?? sure? sounds for me like an EC-Pool with cache tier. But this shouldn't change anything for your speed-tests.

Yes, I'm sure, please look at output from ceph osd ls pool detail

udo said:
atop is an nice tool

Did You mean avq, avio?

udo said:
Yes - but what values do you get on ceph?

Please look at charts ceph_reads and ceph_writes. It's parsed from ceph -w from today.

Q-wulf · Mar 17, 2017

okay, let me recap this based on the information you provided:

You have 10 Servers

each has a 1G link for Public Network
each has a 1G link for Cluster Network (exception node Ceph30 - shared with public network)
each Server acts as a MON (10 Mons total)
you have split Spinners from SSD's using some sort of Crush location hook script.

OSD spread:

Ceph 10
- 2x SSD (2 seperate weights)
- 4x Sata (uniform weight )
Ceph 15
- 2x SSD (2 seperate weights)
- 4x Sata (uniform weight )
Ceph 20
- 2x SSD (2 seperate weights)
- 6x Sata (3 separate weights)
Ceph 25
- 2x SSD (2 seperate weights)
- 6x Sata (2 seperate weights)
Ceph 30
- 2x SSD (2 seperate weights)
- 6x Sata (2 separate weights)
Ceph 35
- 2x SSD (2 seperate weights)
- 6x Sata (2 uniform weight
Ceph 40
- 2x SSD (2 seperate weights)
- 4x Sata (uniform weight)
Ceph 45
- 2x SSD (2 seperate weights)
- 4x Sata (uniform weight)
Ceph 50
- 2x SSD (2 seperate weights)
- 4x Sata (uniform weight)
Ceph 55
- 2x SSD (2 seperate weights)
- 4x Sata (uniform weight)

Bucket-Type spread:

root rbd-cache
- Rack rbd-cache-skwer
  - host Ceph10-ssd
  - host Ceph20-ssd
  - host Ceph30-ssd
- rack rbd-cache-nzoz
  - host Ceph15-ssd
  - host Ceph24-ssd
  - host Ceph35-ssd
root default
- Skwer
  - host Ceph30
  - host Ceph20
  - host Ceph10
- nzoz
  - host Ceph35
  - host Ceph25
  - host Ceph15
root ssd
- skwer-ssd
  - host Ceph40-ssd
  - host Ceph50-ssd
- Nzoz-SSD
  - host Ceph45-ssd
  - host Ceph55-ssd
root sata
- rack Skwer-sata
  - host Ceph40-sata
  - host Ceph50-sata
- rack Nzoz-sata
  - host Ceph40-sata
  - host Ceph50-sata

Crush rule / pool config :

pool rdb
- - Crush-rule: Default
  - root: Default
- Cache-pool RDB-Cache
  - Crush-rule rdb-cache
  - root rdb-cache
pool sata
- - Crush-rule: sata
  - root: sata
- Cache-pool ssd
  - Crush-rule: ssd
  - root: ssd

Questions I still have:

Q1: SSDs: Your crush map has them added at different weights. This leads me to believe, that there have been different sizes of SSD-Space allocated to these SSD-Osd. Can you shed a light on the exact config of these SSDs?

Q2: You seem to be using differently weighted HDD based OSDs on the same node and cluster. Any chance these have different performance characteristics? looks like you use at least 4 different types of HDDs on the cluster

Q3: You mentioned raid devices ... where and how do you use Raid ?

Things I can already say are this:
(in no particular order)

A1. You effectively have 2 logical Clusters under the physical Cluster (or management engine).

Cluster 1: hosts pool RDB (and its caching pool). It is separated into 2 Racks with 3 nodes each. 6 backing OSD per node.
Cluster 2: hosts pool sata (and its caching pool). It is separated into 2 Racks with 2 nodes each. 4 Backing OSD per node

If I was running this setup. I'd have separate these into 2 physical Clusters. Now there is no need to change this.
You just have to be aware, that performance wise, you basically have a 32 OSD cluster AND a 16 OSD cluster.

A2. Too many Monintors (mon)
compare http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/
You basically run more Mons then you'd need to. Given your rack config (assuming it mirrors physical rack configuration). I'd run an odd number of Mons. At least 1 per logical rack. In your situation I'd run 5 mons. Better yet setup 3 dedicated Mons with more network links for client/cluster network communication.

compare : http://docs.ceph.com/docs/jewel/start/hardware-recommendations/

A3. You are most likely (severely) network bottled:
You can test this by doing your benchmarks and simultaneously monitoring your network links with something like e.g. nload .
given your already deployed resources, i'd use 2x1G for Public and 2x1G for Cluster networking.
Compare: http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
If that is still not enough to elleviate your network bottleneck (which i suppsect it won't), i'd stick the constraint network on a 10G link and use 4x1G for the network or even do 10G and 10G.
But you basically have to benchmark this.
edit: given the fact that you use 2 logical Clusters, you might not need to upgrade all nodes to 10G links. seeing how one is a 4-node Cluster, one is a 6 node Cluster with 3 times the OSDs and you probably having different performance requierments for the different Clusters.

A4. That Ceph30 node with a single 1G network link can't really be helping matters.
I'd do the network constraint test on that one first.

edit: Typos & added Question 3
edit2: expanded on A3

udo · Mar 17, 2017

Hi,
for me it's looks, that the ceph-cluster is still quite busy - esp. with writes. So you can't expected good performance during tests.

Udo

mateusz · Mar 20, 2017

Q-wulf said:
each has a 1G link for Cluster Network (exception node Ceph30 - shared with public network)

Ceph30 also have separated 1G interface for cluster.

Q-wulf said:
Questions I still have:

Q1: SSDs: Your crush map has them added at different weights. This leads me to believe, that there have been different sizes of SSD-Space allocated to these SSD-Osd. Can you shed a light on the exact config of these SSDs?

As I wrote before, all SSD drives are identical. On each server there are 2 SSD, plase look at partition table for this drives:

First with system, journals, and osd

Code:

 parted /dev/sda
Disk /dev/sda: 199GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type      File system  Flags
 1      1049kB  256MB   255MB   primary   ext2         boot
 2      256MB   12.3GB  12.0GB  primary                lvm
 3      12.3GB  199GB   187GB   extended               lba
 5      12.6GB  24.7GB  12.1GB  logical
 6      24.7GB  36.8GB  12.1GB  logical
 7      36.8GB  48.9GB  12.1GB  logical
 8      48.9GB  60.9GB  12.1GB  logical
 9      60.9GB  199GB   139GB   logical   xfs

And second only with osd

Code:

 parted /dev/sdf
Disk /dev/sdf: 199GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name          Flags
 2      1049kB  10.7GB  10.7GB               ceph journal
 1      10.7GB  199GB   189GB   xfs          ceph data

I think that it isn't recommended configuration when system, journal and osd are on the same physical drive, but have no idea how to improve this config and utilize maximum of SSD.

Q-wulf said:
Q2: You seem to be using differently weighted HDD based OSDs on the same node and cluster. Any chance these have different performance characteristics? looks like you use at least 4 different types of HDDs on the cluster

Yes, HDD drives on ceph10-ceph35 are mixed 5.2k, 7.2k. unfortunately it 's old cluster, drives are changed without unification.
Servers ceph40-ceph55 have uniform drives.

Q-wulf said:
Q3: You mentioned raid devices ... where and how do you use Raid ?

Because Dell H700i doesn't have JBOD setting, I setup raid0 on each drive.

Q-wulf said:
Things I can already say are this:
(in no particular order)

A1. You effectively have 2 logical Clusters under the physical Cluster (or management engine).

Cluster 1: hosts pool RDB (and its caching pool). It is separated into 2 Racks with 3 nodes each. 6 backing OSD per node.
Cluster 2: hosts pool sata (and its caching pool). It is separated into 2 Racks with 2 nodes each. 4 Backing OSD per node

If I was running this setup. I'd have separate these into 2 physical Clusters. Now there is no need to change this.
You just have to be aware, that performance wise, you basically have a 32 OSD cluster AND a 16 OSD cluster.

Yes, it is because of difference in performance. Servers in 16OSD have better hardware and they are new, this 32OSD part of cluster in future will be removed.

Q-wulf said:
A2. Too many Monintors (mon)
compare http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/
You basically run more Mons then you'd need to. Given your rack config (assuming it mirrors physical rack configuration). I'd run an odd number of Mons. At least 1 per logical rack. In your situation I'd run 5 mons. Better yet setup 3 dedicated Mons with more network links for client/cluster network communication.

compare : http://docs.ceph.com/docs/jewel/start/hardware-recommendations/

Thank You for this.

Q-wulf said:
A3. You are most likely (severely) network bottled:
You can test this by doing your benchmarks and simultaneously monitoring your network links with something like e.g. nload .
given your already deployed resources, i'd use 2x1G for Public and 2x1G for Cluster networking.
Compare: http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
If that is still not enough to elleviate your network bottleneck (which i suppsect it won't), i'd stick the constraint network on a 10G link and use 4x1G for the network or even do 10G and 10G.
But you basically have to benchmark this.
edit: given the fact that you use 2 logical Clusters, you might not need to upgrade all nodes to 10G links. seeing how one is a 4-node Cluster, one is a 6 node Cluster with 3 times the OSDs and you probably having different performance requierments for the different Clusters.

For an 1hour test with nload average incoming and outgoing transfers are ~150Mbit/s on each server. Max transfer is ~400Mbit/s.

Q-wulf said:

A4. That Ceph30 node with a single 1G network link can't really be helping matters.

Click to expand...

Q-wulf said:
I'd do the network constraint test on that one first.

Ceph30 have two 1G interfaces, em1 and p4p2 are interfaces from different network adapters.
Test with iperf from ceph40 to ceph45 gives me ~930Mbit/s

Q-wulf · Mar 21, 2017

mateusz said:
As I wrote before, all SSD drives are identical. On each server there are 2 SSD, plase look at partition table for this drives:

First with system, journals, and osd

Code:

parted /dev/sda Disk /dev/sda: 199GB Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End Size Type File system Flags 1 1049kB 256MB 255MB primary ext2 boot 2 256MB 12.3GB 12.0GB primary lvm 3 12.3GB 199GB 187GB extended lba 5 12.6GB 24.7GB 12.1GB logical 6 24.7GB 36.8GB 12.1GB logical 7 36.8GB 48.9GB 12.1GB logical 8 48.9GB 60.9GB 12.1GB logical 9 60.9GB 199GB 139GB logical xfs

And second only with osd

Code:

parted /dev/sdf Disk /dev/sdf: 199GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 2 1049kB 10.7GB 10.7GB ceph journal 1 10.7GB 199GB 189GB xfs ceph data

I think that it isn't recommended configuration when system, journal and osd are on the same physical drive, but have no idea how to improve this config and utilize maximum of SSD.

That is one of your bottlenecks right there.

If you use one SSD as a journal and OSD and another only as a OSD, and their weight is similar, then statistically they are as likely to be used as Primary OSD.
This "multipurpose SSD" handles the following writes:

4x Journal Writes for the 4x HDD-OSD's attached to it.
1x Journal Write for its own OSD
1x Write to its OSD
Writes done by the System

Which most likely makes it significantly slower then the OSD-only SSD.

You basically tried to maximize the Space your Cache can hold, but in the process you slowed your Cache perfromance way down. Which also makes the other SSD underutilised in this scenario.

You have 4 options here:

Option 1:
- SSD-0: Buy a "cheap" OS-SSD
- SSD-1: Journals
- SSD-2: Cache
Option 2:
- SSD-1: OS + Journals
- SSD-2: Cache
Option 3:
- SSD-1: OS + Cache
- SSD-2: Journals (this one will most likely reach its end first)
Option 4:
- SSD-1: OS + 2x Journals + cache
- SSD-2: 2x Journals + cache

Option 1: gives you the best performance of the cache, half the cach-size (will wear out the OS-Disk first)
Option 2: second best performance of cache (your journal SSD will wear out first)
Option 3: slowest performance of cache, maximum utilisation of Cache-size (will wear out at same rate)

Q2: You seem to be using differently weighted HDD based OSDs on the same node and cluster. Any chance these have different performance characteristics? looks like you use at least 4 different types of HDDs on the cluster

Click to expand...

Yes, HDD drives on ceph10-ceph35 are mixed 5.2k, 7.2k. unfortunately it 's old cluster, drives are changed without unification.
Servers ceph40-ceph55 have uniform drives.

Did you manually set the weight of these drives ? If not, the weight is based on the disk-size (which then must also be different).

You have another bottle-neck right there.
Lets take this example for Ceph30:
item osd.16 weight 0.680
item osd.18 weight 0.910

OSD.16 is roughly 25% less likely to be selected as a OSD compared to OSD.18. Unless your OSD.16 is exactly 25% less performant, you are leaving large chunks of performance on the table, because OSD.18 is selected more often.

Now you could figure out exactly how much difference it is, and manually adjust this weight, but quite honestly, you'll never think about all factors that go into this equation, nor is it worth the time.

You could also go and remove the 5.2k drives, and it MIGHT be worth it, but my gut tells me that removing 30% of your OSD's on the node is not worth the performance jump you get with pure 4x7.2k drives. Unless we are talking really really old drives, at which point they should probably not be used anyways.

In short:
Best performance is maintained when using same-speed same capacity drives as OSD's.

Q3: You mentioned raid devices ... where and how do you use Raid ?

Because Dell H700i doesn't have JBOD setting, I setup raid0 on each drive.
Let me rephrase that.

On which node(s) did you raid-0 which drives ?

Or are you saying you "raid-0" each drive into its own volume ?

mateusz said:
For an 1hour test with nload average incoming and outgoing transfers are ~150Mbit/s on each server. Max transfer is ~400Mbit/s.

What type of test did you run ?
Did you load up your VM's and created a benchmark on said VM ?
You should be maxing at least 1 1G network link. And by that i mean at least the 1G link your VM accesses its storage on.

Magneto · Nov 21, 2017

@mateusz did you ever get to the bottom of your performance problem?

dgd950712 · Feb 11, 2019

Hi!! im making a comparative work and i need to now what is the max storage capacity for Ceph?, do you now where i can find that information from a reliable source?

udo · Feb 12, 2019

dgd950712 said:
Hi!! im making a comparative work and i need to now what is the max storage capacity for Ceph?, do you now where i can find that information from a reliable source?

Harhar,
good joke!

Time ago ceph say it's an Petabyte Storage... so i'm sure it's can provide more space than you needed.

Look at ceph.com - ceph block devices show there: Images up to 16 exabytes

Udo

Proxmox, Ceph and local storage performance

Member

Renowned Member

Member

Renowned Member

Distinguished Member

Member

Attachments

Member

Distinguished Member

Renowned Member

Member

Member

Renowned Member

Distinguished Member

Member

Renowned Member

Well-Known Member

New Member

Distinguished Member

We value your privacy