small 3 node ceph blustore: how use NVME? config recommendations?

high_performer

Active Member
Jul 16, 2018
15
4
43
Hi folks,
given are 3 nodes:
each node 10 GB network
each node 8 enterprise spinners 4TB
each node 1 enterprise nvme 1TB
each node 64 GB RAM
each node 4 Core cpu -> 8 threads up to 3.2 GHz
pveperf of cpu:
CPU BOGOMIPS: 47999.28
REGEX/SECOND: 2721240
each node latest proxmox of course
each node 2 more slots available for disks
each node´s OS is on a Superdom SSD
pveperf of superdom:
BUFFERED READS: 247.56 MB/sec
AVERAGE SEEK TIME: 0.11 ms
FSYNCS/SECOND: 322.70

We want to use bluestore, 3/2 in the pools.
I know - its a small system, spinners are slow, latencies etc.
But how could i get best ceph-performance ?

Use the NVME as "normal" OSD with WAL on itself (same as the hdds)?
Use the NVME as WAL-device for the hdds? (Problem: if the NVME is broken, 8 HDDs are lost?)
Is it worth to add 2 more SSDs per node for WAL? (so 1 SSD for 4 HDD) or use them as OSDs aswell?
Need more RAM?
1 Monitor running on each node (i could move them to other machines - if recommended)
Would a fourth node give considerably better performance?
Still no Jumbo-Frames, is it really bad?
And still no link aggregation, could be possible.

If i start rados bench and monitor the node with atop on the node, ETH gets up to 50% used, several disks varying up to 90% usage - knowing that the WAL ist on same disk.
But: WAL is "small" - and luminous made for WAL on same disk !?

Would crush take different performance (classes hdd, ssd, nvme) in consideration when optimizing?


Recommendations welcome.
 
Last edited:
Update
some performance tests

pveperf
CPU BOGOMIPS: 47995.04
REGEX/SECOND: 3133624
HD SIZE: 7.07 GB (/dev/mapper/pve-root)
BUFFERED READS: 235.98 MB/sec
AVERAGE SEEK TIME: 0.10 ms
FSYNCS/SECOND: 322.44
DNS EXT: 90.98 ms
DNS INT: 0.49 ms




rados bench write:
Total time run: 10.115603
Total writes made: 1088
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 430.226
Stddev Bandwidth: 16.7013
Max bandwidth (MB/sec): 448
Min bandwidth (MB/sec): 400
Average IOPS: 107
Stddev IOPS: 4
Max IOPS: 112
Min IOPS: 100
Average Latency(s): 0.148661
Stddev Latency(s): 0.0525165
Max latency(s): 0.421106
Min latency(s): 0.0546044



rados bench read seq:
Total time run: 3.660390
Total reads made: 1088
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1188.94
Average IOPS: 297
Stddev IOPS: 30
Max IOPS: 318
Min IOPS: 261
Average Latency(s): 0.0530154
Max latency(s): 0.31674
Min latency(s): 0.0198092



rados bench read rand:
Total time run: 10.059865
Total reads made: 3847
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1529.64
Average IOPS: 382
Stddev IOPS: 15
Max IOPS: 402
Min IOPS: 352
Average Latency(s): 0.0412249
Max latency(s): 0.202156
Min latency(s): 0.00134031



bench block device
rbd bench --io-type write image01 --pool=rbdbench

bench type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 30789 30430.65 124643946.07
2 58828 29362.18 120267499.80
3 85629 28537.63 116890129.29
4 112631 28161.99 115351497.18
5 138629 27729.13 113578533.54
6 165328 26941.31 110351610.64
7 190118 26254.89 107540013.63
8 215311 25942.24 106259420.45
9 241134 25700.58 105269582.61
elapsed: 10 ops: 262144 ops/sec: 25916.28 bytes/sec: 106153084.70


############
Somewhat poor performance in my testvm (WIN7 with virtio for disk and net)
done with

CrystalDiskMark 5.1.2 x64 (C) 2007-2016 hiyohiyo
Crystal Dew World :
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 522.022 MB/s
Sequential Write (Q= 32,T= 1) : 32.805 MB/s
Random Read 4KiB (Q= 32,T= 1) : 113.093 MB/s [ 27610.6 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 2.267 MB/s [ 553.5 IOPS]
Sequential Read (T= 1) : 439.588 MB/s
Sequential Write (T= 1) : 21.169 MB/s
Random Read 4KiB (Q= 1,T= 1) : 5.258 MB/s [ 1283.7 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 0.190 MB/s [ 46.4 IOPS]

Test : 2048 MiB [C: 55.6% (111.1/199.9 GiB)] (x4) [Interval=5 sec]
Date : 2018/07/16 9:18:24
OS : Windows 7 Professional SP1 [6.1 Build 7601] (x64)

####################

really poor vzdump performance
INFO: starting new backup job: vzdump 145 --remove 0 --storage NFSSTORAGE --node xxxxxxxx --compress lzo --mode snapshot
INFO: Starting Backup of VM 145 (qemu)
INFO: status = running
INFO: update VM 145: -lock backup
INFO: VM Name: NameOfVM
INFO: include disk 'ide3' 'ceph_pool1_vm:vm-145-disk-3' 32G
INFO: include disk 'virtio0' 'ceph_pool1_vm:vm-145-disk-2' 200G
INFO: include disk 'virtio1' 'ceph_pool1_vm:vm-145-disk-1' 1G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/NFSSTORAGE/dump/vzdump-qemu-145-2018_07_19-09_01_38.vma.lzo'
INFO: started backup task '51e7047d-e8d6-4bf9-84cf-e6f1027db6c1'
INFO: status: 1% (2541748224/250181844992), sparse 0% (4984832), duration 41, read/write 55/55 MB/s
INFO: status: 2% (5049942016/250181844992), sparse 0% (233963520), duration 84, read/write 58/53 MB/s

VZDumps from other proxmox machines can dump with 300 MB/s to the same NFSSTORAGE

-----------------
atop -d while a vzdump is running

PRC | sys 0.38s | user 0.56s | #proc 319 | #trun 1 | #tslpi 724 | #tslpu 0 | #zombie 0 | #exit 32 |
CPU | sys 4% | user 5% | irq 1% | idle 767% | wait 24% | guest 0% | curf 2.70GHz | curscal 77% |
cpu | sys 1% | user 1% | irq 0% | idle 93% | cpu007 w 5% | guest 0% | curf 2.83GHz | curscal 80% |
cpu | sys 1% | user 1% | irq 0% | idle 97% | cpu000 w 2% | guest 0% | curf 2.30GHz | curscal 65% |
cpu | sys 1% | user 1% | irq 0% | idle 94% | cpu001 w 5% | guest 0% | curf 2.83GHz | curscal 80% |
cpu | sys 1% | user 1% | irq 0% | idle 95% | cpu006 w 4% | guest 0% | curf 2.69GHz | curscal 76% |
cpu | sys 0% | user 1% | irq 0% | idle 97% | cpu002 w 2% | guest 0% | curf 2.57GHz | curscal 73% |
cpu | sys 0% | user 1% | irq 0% | idle 97% | cpu005 w 2% | guest 0% | curf 2.98GHz | curscal 85% |
cpu | sys 0% | user 0% | irq 0% | idle 96% | cpu003 w 3% | guest 0% | curf 2.40GHz | curscal 68% |
cpu | sys 0% | user 1% | irq 0% | idle 97% | cpu004 w 2% | guest 0% | curf 3.03GHz | curscal 86% |
CPL | avg1 0.45 | avg5 0.32 | avg15 0.21 | | csw 40579 | intr 25455 | | numcpu 8 |
MEM | tot 62.8G | free 50.9G | cache 217.8M | buff 22.6M | slab 196.8M | shmem 73.7M | vmbal 0.0M | hptot 0.0M |
SWP | tot 3.6G | free 3.6G | | | | | vmcom 19.5G | vmlim 35.0G |
LVM | pve-root | busy 6% | read 4 | write 217 | KiB/w 113 | MBr/s 0.0 | MBw/s 2.4 | avio 2.73 ms |
DSK | sda | busy 6% | read 72 | write 58 | KiB/w 5 | MBr/s 2.4 | MBw/s 0.0 | avio 4.80 ms |
DSK | sdi | busy 6% | read 25 | write 159 | KiB/w 155 | MBr/s 0.0 | MBw/s 2.4 | avio 3.28 ms |
DSK | sdh | busy 5% | read 102 | write 24 | KiB/w 5 | MBr/s 2.7 | MBw/s 0.0 | avio 4.19 ms |
DSK | sde | busy 5% | read 39 | write 72 | KiB/w 24 | MBr/s 0.7 | MBw/s 0.2 | avio 4.58 ms |
DSK | sdb | busy 5% | read 50 | write 20 | KiB/w 5 | MBr/s 2.4 | MBw/s 0.0 | avio 7.03 ms |
DSK | sdd | busy 5% | read 56 | write 29 | KiB/w 7 | MBr/s 2.0 | MBw/s 0.0 | avio 5.55 ms |
DSK | sdg | busy 4% | read 59 | write 10 | KiB/w 4 | MBr/s 2.4 | MBw/s 0.0 | avio 6.43 ms |
DSK | sdf | busy 4% | read 63 | write 14 | KiB/w 5 | MBr/s 2.4 | MBw/s 0.0 | avio 5.09 ms |
DSK | sdc | busy 4% | read 22 | write 60 | KiB/w 5 | MBr/s 0.4 | MBw/s 0.0 | avio 4.29 ms |
DSK | sdk | busy 0% | read 20 | write 0 | KiB/w 0 | MBr/s 0.0 | MBw/s 0.0 | avio 0.20 ms |
DSK | sdm | busy 0% | read 20 | write 0 | KiB/w 0 | MBr/s 0.0 | MBw/s 0.0 | avio 0.20 ms |
DSK | sdj | busy 0% | read 20 | write 0 | KiB/w 0 | MBr/s 0.0 | MBw/s 0.0 | avio 0.00 ms |
DSK | sdl | busy 0% | read 20 | write 0 | KiB/w 0 | MBr/s 0.0 | MBw/s 0.0 | avio 0.00 ms |
NFC | rpc 4 | read 0 | write 0 | retxmit 0 | autref 4 | | | |
NET | transport | tcpi 6955 | tcpo 124349 | udpi 1010 | udpo 857 | tcpao 20 | tcppo 13 | tcprs 0 |
NET | network | ipi 7993 | ipo 9878 | ipfrw 0 | deliv 7981 | | icmpi 0 | icmpo 0 |
NET | ens15f1 1% | pcki 9070 | pcko 123656 | sp 10 Gbps | si 4871 Kbps | so 144 Mbps | erri 0 | erro 0 |
NET | ens15f0 0% | pcki 1589 | pcko 1145 | sp 10 Gbps | si 538 Kbps | so 147 Kbps | erri 0 | erro 0 |
NET | lo ---- | pcki 406 | pcko 406 | sp 0 Mbps | si 369 Kbps | so 369 Kbps | erri 0 | erro 0 |
NET | vmbr0 ---- | pcki 1400 | pcko 1144 | sp 0 Mbps | si 511 Kbps | so 147 Kbps | erri 0 | erro 0 |

PID TID RDDSK WRDSK WCANCL DSK CMD 1/7
3592 - 28052K 128K 0K 15% ceph-osd
3996 - 24640K 328K 0K 13% ceph-osd
3404 - 24584K 100K 0K 13% ceph-osd
2983 - 24584K 72K 0K 13% ceph-osd
3218 - 24544K 48K 0K 13% ceph-osd
1828 - 16K 23908K 4K 13% ceph-mon
3100 - 20504K 220K 0K 11% ceph-osd
2862 - 6892K 1784K 0K 5% ceph-osd
4117 - 4100K 316K 0K 2% ceph-osd

CPU MHz sometimes move up to 3,3 or 3,5 MHz

IO-Wait?
Too many context switches? Too few CPU-cores?
No significant netwokg usage (1%)
HDDs could do faster than ~2,5 MB/s
 
Last edited: