Proxmox CEPH Cluster's Performance

Dave Wood · Feb 28, 2017

Hi,

I need your help. I’m getting very poor performance.
I have 3 nodes Proxmox Cluster setup with HP DL580 g7 Server. Each server has dual port 10 Gbps NIC.
Each node has 4 x 15K 600 2.5 SAS and 4 X 1 TB 7.2k SATA

Each node has following Partitions ( I'm use Logical Volume as OSD):

Node 1
100GB for Proxmox (7.2 K SATA)
2.63 TB (7.2K) OSD.3
1.63 TB (15K) OSD.0

Node 2
100GB for Proxmox (7.2 K SATA)
2.63 TB (7.2K) OSD.4
1.63 TB (15K) OSD.1

Node 3
100GB for Proxmox (7.2 K SATA)
2.63 TB (7.2K) OSD.5
1.63 TB (15K) OSD.2

I create two runsets and two pools

Runset 0
osd.0, osd.1, osd.2 and
Pool ceph-performance is using runset 0

Runset 1
osd.3, osd.4, osd.5
Pool ceph-capacity is using runset 1

Proxmox and CEPH versions:
pve-manager/4.4-1/eb2d6f1e (running kernel: 4.4.35-1-pve)
ceph version 0.94.10

Performance:
LXC VM Running on ceph-capacity Storage
[root@TEST01 /]# dd if=/dev/zero of=here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 15.9393 s, 67.4 MB/s

LXC VM Running on 7.2 K Storage without CEPH
[root@TEST02 /]# dd if=/dev/zero of=here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 2.12752 s, 505 MB/s

root@XXXXXXXX:/# rados -p ceph-capacity bench 10 write --no-cleanup
Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
Object prefix: benchmark_data_xxxxxxx_65731
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)

0 0 0 0 0 0 - 0
1 16 46 30 119.951 120 0.694944 0.389095
2 16 76 60 119.959 120 0.547648 0.462089
3 16 80 64 85.3068 16 0.610866 0.468057
4 16 96 80 79.9765 64 1.83572 0.701601
5 16 107 91 72.7796 44 1.25032 0.774318
6 16 122 106 70.6472 60 0.799959 0.81822
7 16 133 117 66.8386 44 1.51327 0.8709
8 16 145 129 64.4822 48 1.11328 0.913094
9 16 158 142 63.0938 52 0.683712 0.917917
10 16 158 142 56.7846 0 - 0.917917

Total time run: 10.390136
Total writes made: 159
Write size: 4194304
Bandwidth (MB/sec): 61.2119
Stddev Bandwidth: 40.3764
Max bandwidth (MB/sec): 120
Min bandwidth (MB/sec): 0
Average IOPS: 15
Average Latency(s): 1.02432
Stddev Latency(s): 0.567672
Max latency(s): 2.57507
Min latency(s): 0.135008

Any help will be really appreciated.

Thank You,
Dave

Ashley · Feb 28, 2017

In the current version of CEPH the underlying storage is a system called filestore, this means that every write is also wrote to a journal, this causes a thing called a double write penalty, basically for every write if it wrote twice, causing filestore to max out at 1/2 the speed of the physical disk.

This is why your see most people talking about placing the OSD journal on a SSD, this removes the double write and allows the disk to just focus on the writing of the physical data, the speeds you are seeing above is what I would expect to see for this form of setup.

Dave Wood · Feb 28, 2017

Ashley,
Thank you for your response. You mentioned 1/2 the speed of the physical disk, but I'm getting 1/7 th of the physical drive speed.

Ashley · Feb 28, 2017

A standard 7.5K disk will give you around 150, the 505 result you got will be due to a Raid Cache or others means.

Dave Wood · Feb 28, 2017

If I use individual disk as OSD and add SSD for journal, what is the highest read/write I will get?

Ashley · Feb 28, 2017

In your current setup (amount of OSD's) the most you will get the is the max rating of a single OSD disk, however you will never hit 100% due to overheads and different performance ratings at different queue depths and I/O type.

Dave Wood · Feb 28, 2017

Do you have any suggestion? This Proxmox and CEPH cluster is not in production. I can make any changes.

Ashley · Feb 28, 2017

If you can get 6 OSD's of the same type and then one or two SSD's per a server for the Journal then your be able to improve performance, with the current hardware you have and without any changes not too much you can do.

Dave Wood · Feb 28, 2017

Thank you Ashley. I appreciate your help. How large an SSD do I need?

Ashley · Feb 28, 2017

Journal only needs to be 5-10GB, however you need a decent SSD for the job, something like : http://www.samsung.com/us/business/computing/solid-state-drives/MZ-7KM240E

Dave Wood · Feb 28, 2017

Thank you

udo · Feb 28, 2017

Dave Wood said:
Hi,

I need your help. I’m getting very poor performance.
I have 3 nodes Proxmox Cluster setup with HP DL580 g7 Server. Each server has dual port 10 Gbps NIC.
Each node has 4 x 15K 600 2.5 SAS and 4 X 1 TB 7.2k SATA

Each node has following Partitions ( I'm use Logical Volume as OSD):

Node 1
100GB for Proxmox (7.2 K SATA)
2.63 TB (7.2K) OSD.3
1.63 TB (15K) OSD.0

Hi Dave,
this shows, that you use the OSDs as RAID-5... this is realy not recommend by ceph.
Depends on your Backplane/raidcontroller, can you switch to IT-mode to access the single disks?

LXC VM Running on 7.2 K Storage without CEPH
[root@TEST02 /]# dd if=/dev/zero of=here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 2.12752 s, 505 MB/s

I assume you measure caching here.

root@XXXXXXXX:/# rados -p ceph-capacity bench 10 write --no-cleanup
Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
Object prefix: benchmark_data_xxxxxxx_65731
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)

0 0 0 0 0 0 - 0
1 16 46 30 119.951 120 0.694944 0.389095
2 16 76 60 119.959 120 0.547648 0.462089
3 16 80 64 85.3068 16 0.610866 0.468057
4 16 96 80 79.9765 64 1.83572 0.701601
5 16 107 91 72.7796 44 1.25032 0.774318
6 16 122 106 70.6472 60 0.799959 0.81822
7 16 133 117 66.8386 44 1.51327 0.8709
8 16 145 129 64.4822 48 1.11328 0.913094
9 16 158 142 63.0938 52 0.683712 0.917917
10 16 158 142 56.7846 0 - 0.917917

Total time run: 10.390136
Total writes made: 159
Write size: 4194304
Bandwidth (MB/sec): 61.2119
Stddev Bandwidth: 40.3764
Max bandwidth (MB/sec): 120
Min bandwidth (MB/sec): 0
Average IOPS: 15
Average Latency(s): 1.02432
Stddev Latency(s): 0.567672
Max latency(s): 2.57507
Min latency(s): 0.135008

10 seconds are much to less to say anything, but due that the writespeed get worser and worser, it's looks that your raid-cache is filled and/or you don't have any BBU?

Raid-5 is not realy good for writes - this is normaly covered by an raid-controller write-cache.

I would split the disks to single OSDs - additional an SSD-journal, like Ashley wrote, help a lot too for writes (with an good Enterprise SSD).

How looks your ceph.conf - perhaps there are some tunings possible.

Udo

Dave Wood · Feb 28, 2017

Thank You Udo,
I'm going to add SSD for journal and going to split the disks to single OSDs.

My ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.0.0.0/24
filestore xattr use omap = true
fsid = 8cf9eac7-ad29-4c4c-94fe-9c6dab145740
keyring = /etc/pve/priv/$cluster.$name.keyring
osd crush update on start = false
osd journal size = 5120
osd pool default min size = 1
public network = 10.0.0.0/24

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.2]
host = XXXX01
mon addr = 10.0.0.16:6789

[mon.1]
host = XXXX02
mon addr = 10.0.0.14:6789

[mon.0]
host = XXXX03
mon addr = 10.0.0.12:6789

Crush Map
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host XXXXPVE01 {
id -2 # do not change unnecessarily
# weight 1.630
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.630
}
host XXXXPVE02 {
id -3 # do not change unnecessarily
# weight 1.630
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.630
}
host XXXXPVE03 {
id -4 # do not change unnecessarily
# weight 1.630
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.630
}
root default {
id -1 # do not change unnecessarily
# weight 4.890
alg straw
hash 0 # rjenkins1
item XXXXPVE01 weight 1.630
item XXXXPVE02 weight 1.630
item XXXXPVE03 weight 1.630
}
host PVE01-72KDRIVES {
id -6 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.3 weight 1.000
}
host PVE02-72KDRIVES {
id -7 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.4 weight 1.000
}
host PVE03-72KDRIVES {
id -8 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.5 weight 1.000
}
root 72KDRIVES {
id -5 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item PVE01-72KDRIVES weight 1.000
item PVE02-72KDRIVES weight 1.000
item PVE03-72KDRIVES weight 1.000
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule 72KDRIVES_replicated_ruleset {
ruleset 1
type replicated
min_size 1
max_size 10
step take 72KDRIVES
step chooseleaf firstn 0 type host
step emit
}

# end crush map

Dave Wood · Mar 8, 2017

I would like to thank you guys and give you update.
I have reconfigured CEPH and I'm happy with the performance. I'm still tweaking.

I have added 1 x HP SAS Enterprise Performance 200GB SSD to each node for Journal.
Configured SAS SSD single drive RAID 0 with Array Acceleration Disable
Each node has 4 x 15K 600 SAS drives. I have configured each dirve as a RADI0 logical volumes with Array Acceleration Disable. 4 logical drive and 4 OSDs each node

root@XXXXXX:~# rados -p test bench 10 write --no-cleanup
Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
Object prefix: benchmark_data_XXXXX_28182
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 137 121 483.757 484 0.136917 0.12519
2 16 268 252 503.809 524 0.0959942 0.123927
3 16 400 384 511.824 528 0.0687946 0.122819
4 16 532 516 515.834 528 0.167677 0.122263
5 16 663 647 517.438 524 0.107319 0.121523
6 16 795 779 519.175 528 0.115925 0.122049
7 16 921 905 516.987 504 0.179843 0.122662
8 16 1052 1036 517.847 524 0.0806617 0.122769
9 16 1171 1155 513.185 476 0.0876048 0.123064
10 16 1245 1229 491.459 296 0.0784057 0.127504
Total time run: 10.249352
Total writes made: 1245
Write size: 4194304
Bandwidth (MB/sec): 485.884
Stddev Bandwidth: 162.954
Max bandwidth (MB/sec): 528
Min bandwidth (MB/sec): 0
Average IOPS: 121
Average Latency(s): 0.130875
Stddev Latency(s): 0.0702443
Max latency(s): 0.732217
Min latency(s): 0.0329304

root@XXXXXX:~# rados -p test bench 10 seq --no-cleanup
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 2 2 0 0 0 - 0
1 15 373 358 1430.86 1432 0.0579409 0.0423946
2 15 754 739 1477.18 1524 0.00717156 0.0414042
3 10 1087 1077 1435.34 1352 0.0563994 0.0425022
Total time run: 3.018737
Total reads made: 1087
Read size: 4194304
Bandwidth (MB/sec): 1440.34
Average IOPS: 360
Average Latency(s): 0.0429984
Max latency(s): 0.176286
Min latency(s): 0.00574725

udo · Mar 9, 2017

Dave Wood said:
I would like to thank you guys and give you update.
I have reconfigured CEPH and I'm happy with the performance. I'm still tweaking.

I have added 1 x HP SAS Enterprise Performance 200GB SSD to each node for Journal.

Configured SAS SSD single drive RAID 0 with Array Acceleration Disable

Each node has 4 x 15K 600 SAS drives. I have configured each dirve as a RADI0 logical volumes with Array Acceleration Disable. 4 logical drive and 4 OSDs each node

Hi Dave,
perhaps (depends on your Raid-Controller and if you have an BBU) you can speed up the IO in production a little bit with "Array Acceleration" enabled for the SAS-Drives - I guess HP mean with Array Acceleration write-cache?!

root@XXXXXX:~# rados -p test bench 10 write --no-cleanup

like I wrote before - 10 seconds are to less for an test - use 60 sec.

root@XXXXXX:~# rados -p test bench 10 seq --no-cleanup
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 2 2 0 0 0 - 0
1 15 373 358 1430.86 1432 0.0579409 0.0423946
2 15 754 739 1477.18 1524 0.00717156 0.0414042
3 10 1087 1077 1435.34 1352 0.0563994 0.0425022
Total time run: 3.018737
Total reads made: 1087
Read size: 4194304
Bandwidth (MB/sec): 1440.34
Average IOPS: 360
Average Latency(s): 0.0429984
Max latency(s): 0.176286
Min latency(s): 0.00574725

if you don't cleared the cache before on all nodes, you measure caching - not an real performance.

Clear the cache on each node:

Code:

sync; echo 3 > /proc/sys/vm/drop_caches

one point: you can optimize the ceph.conf with disabling debug-information (speed up a little bit):

Code:

debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

Udo

Dave Wood · Mar 9, 2017

Udo,

Going to apply your tweaks and update you soon.

Dave Wood · Mar 9, 2017

Udo,

Thank you so much. I have build a Windows 2012 R2 Server using Virtio iSCSI HDD with writeback and IO Thread enabled HDD. But I'm not getting performance.

Crystal Disk Mark 1 GB 514 MB/sec (Read) 260 MB/sec (Write). Any suggestions?

Find below my update Ceph's update

Added debug config in the ceph.conf
We have 1 GB array cache in our server. I have enabled Array Accelerator for all OSDs.
I did not enable Journal SAS SSD's Array Accelerator. According to HP better performance if I disable it. http://h20566.www2.hpe.com/hpsc/doc...955644&docId=emr_na-c02963968&docLocale=en_US
I have cleared all nodes's cache.

root@XXXXXX:~# rados -p test bench 60 write --no-cleanup
.
.
.

2017-03-09 09:49:30.612012 min lat: 0.0218557 max lat: 1.1251 avg lat: 0.241465
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
60 16 3978 3962 264.062 244 0.617707 0.241465
Total time run: 60.460085
Total writes made: 3979
Write size: 4194304
Bandwidth (MB/sec): 263.248
Stddev Bandwidth: 85.2774
Max bandwidth (MB/sec): 552
Min bandwidth (MB/sec): 0
Average IOPS: 65
Average Latency(s): 0.242892
Stddev Latency(s): 0.263076
Max latency(s): 1.1251
Min latency(s): 0.0218557

root@XXXXXX:~# rados -p test bench 60 seq --no-cleanup
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 1 1 0 0 0 - 0
1 16 349 333 1328.38 1332 0.082535 0.0446865
2 16 678 662 1322.02 1316 0.00935964 0.0465674
3 16 1008 992 1320.88 1320 0.0579302 0.046662
4 16 1347 1331 1329.56 1356 0.126847 0.0465424
5 16 1658 1642 1312.41 1244 0.00807041 0.047201
6 16 1984 1968 1310.96 1304 0.0359098 0.0473376
7 15 2303 2288 1306.5 1280 0.04179 0.0475998
8 16 2630 2614 1306.01 1304 0.0128456 0.0476415
9 16 2952 2936 1303.97 1288 0.00791568 0.0477699
10 16 3283 3267 1305.94 1324 0.0376555 0.0477338
11 16 3632 3616 1314.09 1396 0.0192136 0.0474186
12 16 3957 3941 1312.89 1300 0.0337602 0.0474839
Total time run: 12.152072
Total reads made: 3979
Read size: 4194304
Bandwidth (MB/sec): 1309.74
Average IOPS: 327
Average Latency(s): 0.0477081
Max latency(s): 0.256799
Min latency(s): 0.00559258

Dave Wood · Mar 9, 2017

I had config error. I adjusted

2017-03-09 10:58:34.165841 min lat: 0.0274781 max lat: 0.648084 avg lat: 0.136532
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
60 16 7037 7021 467.95 476 0.118431 0.136532
Total time run: 60.184267
Total writes made: 7037
Write size: 4194304
Bandwidth (MB/sec): 467.697
Stddev Bandwidth: 71.7206
Max bandwidth (MB/sec): 528
Min bandwidth (MB/sec): 0
Average IOPS: 116
Average Latency(s): 0.136704
Stddev Latency(s): 0.0872827
Max latency(s): 0.648084
Min latency(s): 0.0274781

udo · Mar 9, 2017

Dave Wood said:
Udo,

Thank you so much. I have build a Windows 2012 R2 Server using Virtio iSCSI HDD with writeback and IO Thread enabled HDD. But I'm not getting performance.

Crystal Disk Mark 1 GB 514 MB/sec (Read) 260 MB/sec (Write). Any suggestions?

Hi,
why "iSCSI-HDD"?? I think you mean scsi-hdd with virtio-scsi driver?!

About Performance: for an single thread, which one VM-Disk is, is the value are not too bad. ceph like to serve many simultanous threads.

Find below my update Ceph's update

Added debug config in the ceph.conf

We have 1 GB array cache in our server. I have enabled Array Accelerator for all OSDs.

I did not enable Journal SAS SSD's Array Accelerator. According to HP better performance if I disable it. http://h20566.www2.hpe.com/hpsc/doc...955644&docId=emr_na-c02963968&docLocale=en_US

I have cleared all nodes's cache.

to top 1: after adding the debug-stuff you must restart the OSDs to activate the settings (or use injectargs).
to top 4: cache clearing must be done between rados bench write and rados bench read!

Udo

Dave Wood · Mar 9, 2017

Yes, I meant virtio-scsi driver.
After clear cache, read drops from 1309 MB/sec to 1042 MB/sec

Proxmox CEPH Cluster's Performance

Active Member

Member

Active Member

Member

Active Member

Member

Active Member

Member

Active Member

Member

Active Member

Distinguished Member

Active Member

Active Member

Distinguished Member

Active Member

Active Member

Active Member

Distinguished Member

Active Member