small 3 node ceph blustore: how use NVME? config recommendations?

high_performer

Active Member
Jul 16, 2018
15
4
43
Hi folks,
given are 3 nodes:
each node 10 GB network
each node 8 enterprise spinners 4TB
each node 1 enterprise nvme 1TB
each node 64 GB RAM
each node 4 Core cpu -> 8 threads up to 3.2 GHz
pveperf of cpu:
CPU BOGOMIPS: 47999.28
REGEX/SECOND: 2721240
each node latest proxmox of course
each node 2 more slots available for disks
each node´s OS is on a Superdom SSD
pveperf of superdom:
BUFFERED READS: 247.56 MB/sec
AVERAGE SEEK TIME: 0.11 ms
FSYNCS/SECOND: 322.70

We want to use bluestore, 3/2 in the pools.
I know - its a small system, spinners are slow, latencies etc.
But how could i get best ceph-performance ?

Use the NVME as "normal" OSD with WAL on itself (same as the hdds)?
Use the NVME as WAL-device for the hdds? (Problem: if the NVME is broken, 8 HDDs are lost?)
Is it worth to add 2 more SSDs per node for WAL? (so 1 SSD for 4 HDD) or use them as OSDs aswell?
Need more RAM?
1 Monitor running on each node (i could move them to other machines - if recommended)
Would a fourth node give considerably better performance?
Still no Jumbo-Frames, is it really bad?
And still no link aggregation, could be possible.

If i start rados bench and monitor the node with atop on the node, ETH gets up to 50% used, several disks varying up to 90% usage - knowing that the WAL ist on same disk.
But: WAL is "small" - and luminous made for WAL on same disk !?

Would crush take different performance (classes hdd, ssd, nvme) in consideration when optimizing?


Recommendations welcome.
 
Last edited:
Update
some performance tests

pveperf
CPU BOGOMIPS: 47995.04
REGEX/SECOND: 3133624
HD SIZE: 7.07 GB (/dev/mapper/pve-root)
BUFFERED READS: 235.98 MB/sec
AVERAGE SEEK TIME: 0.10 ms
FSYNCS/SECOND: 322.44
DNS EXT: 90.98 ms
DNS INT: 0.49 ms




rados bench write:
Total time run: 10.115603
Total writes made: 1088
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 430.226
Stddev Bandwidth: 16.7013
Max bandwidth (MB/sec): 448
Min bandwidth (MB/sec): 400
Average IOPS: 107
Stddev IOPS: 4
Max IOPS: 112
Min IOPS: 100
Average Latency(s): 0.148661
Stddev Latency(s): 0.0525165
Max latency(s): 0.421106
Min latency(s): 0.0546044



rados bench read seq:
Total time run: 3.660390
Total reads made: 1088
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1188.94
Average IOPS: 297
Stddev IOPS: 30
Max IOPS: 318
Min IOPS: 261
Average Latency(s): 0.0530154
Max latency(s): 0.31674
Min latency(s): 0.0198092



rados bench read rand:
Total time run: 10.059865
Total reads made: 3847
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1529.64
Average IOPS: 382
Stddev IOPS: 15
Max IOPS: 402
Min IOPS: 352
Average Latency(s): 0.0412249
Max latency(s): 0.202156
Min latency(s): 0.00134031



bench block device
rbd bench --io-type write image01 --pool=rbdbench

bench type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 30789 30430.65 124643946.07
2 58828 29362.18 120267499.80
3 85629 28537.63 116890129.29
4 112631 28161.99 115351497.18
5 138629 27729.13 113578533.54
6 165328 26941.31 110351610.64
7 190118 26254.89 107540013.63
8 215311 25942.24 106259420.45
9 241134 25700.58 105269582.61
elapsed: 10 ops: 262144 ops/sec: 25916.28 bytes/sec: 106153084.70


############
Somewhat poor performance in my testvm (WIN7 with virtio for disk and net)
done with

CrystalDiskMark 5.1.2 x64 (C) 2007-2016 hiyohiyo
Crystal Dew World :
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 522.022 MB/s
Sequential Write (Q= 32,T= 1) : 32.805 MB/s
Random Read 4KiB (Q= 32,T= 1) : 113.093 MB/s [ 27610.6 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 2.267 MB/s [ 553.5 IOPS]
Sequential Read (T= 1) : 439.588 MB/s
Sequential Write (T= 1) : 21.169 MB/s
Random Read 4KiB (Q= 1,T= 1) : 5.258 MB/s [ 1283.7 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 0.190 MB/s [ 46.4 IOPS]

Test : 2048 MiB [C: 55.6% (111.1/199.9 GiB)] (x4) [Interval=5 sec]
Date : 2018/07/16 9:18:24
OS : Windows 7 Professional SP1 [6.1 Build 7601] (x64)

####################

really poor vzdump performance
INFO: starting new backup job: vzdump 145 --remove 0 --storage NFSSTORAGE --node xxxxxxxx --compress lzo --mode snapshot
INFO: Starting Backup of VM 145 (qemu)
INFO: status = running
INFO: update VM 145: -lock backup
INFO: VM Name: NameOfVM
INFO: include disk 'ide3' 'ceph_pool1_vm:vm-145-disk-3' 32G
INFO: include disk 'virtio0' 'ceph_pool1_vm:vm-145-disk-2' 200G
INFO: include disk 'virtio1' 'ceph_pool1_vm:vm-145-disk-1' 1G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/NFSSTORAGE/dump/vzdump-qemu-145-2018_07_19-09_01_38.vma.lzo'
INFO: started backup task '51e7047d-e8d6-4bf9-84cf-e6f1027db6c1'
INFO: status: 1% (2541748224/250181844992), sparse 0% (4984832), duration 41, read/write 55/55 MB/s
INFO: status: 2% (5049942016/250181844992), sparse 0% (233963520), duration 84, read/write 58/53 MB/s

VZDumps from other proxmox machines can dump with 300 MB/s to the same NFSSTORAGE

-----------------
atop -d while a vzdump is running

PRC | sys 0.38s | user 0.56s | #proc 319 | #trun 1 | #tslpi 724 | #tslpu 0 | #zombie 0 | #exit 32 |
CPU | sys 4% | user 5% | irq 1% | idle 767% | wait 24% | guest 0% | curf 2.70GHz | curscal 77% |
cpu | sys 1% | user 1% | irq 0% | idle 93% | cpu007 w 5% | guest 0% | curf 2.83GHz | curscal 80% |
cpu | sys 1% | user 1% | irq 0% | idle 97% | cpu000 w 2% | guest 0% | curf 2.30GHz | curscal 65% |
cpu | sys 1% | user 1% | irq 0% | idle 94% | cpu001 w 5% | guest 0% | curf 2.83GHz | curscal 80% |
cpu | sys 1% | user 1% | irq 0% | idle 95% | cpu006 w 4% | guest 0% | curf 2.69GHz | curscal 76% |
cpu | sys 0% | user 1% | irq 0% | idle 97% | cpu002 w 2% | guest 0% | curf 2.57GHz | curscal 73% |
cpu | sys 0% | user 1% | irq 0% | idle 97% | cpu005 w 2% | guest 0% | curf 2.98GHz | curscal 85% |
cpu | sys 0% | user 0% | irq 0% | idle 96% | cpu003 w 3% | guest 0% | curf 2.40GHz | curscal 68% |
cpu | sys 0% | user 1% | irq 0% | idle 97% | cpu004 w 2% | guest 0% | curf 3.03GHz | curscal 86% |
CPL | avg1 0.45 | avg5 0.32 | avg15 0.21 | | csw 40579 | intr 25455 | | numcpu 8 |
MEM | tot 62.8G | free 50.9G | cache 217.8M | buff 22.6M | slab 196.8M | shmem 73.7M | vmbal 0.0M | hptot 0.0M |
SWP | tot 3.6G | free 3.6G | | | | | vmcom 19.5G | vmlim 35.0G |
LVM | pve-root | busy 6% | read 4 | write 217 | KiB/w 113 | MBr/s 0.0 | MBw/s 2.4 | avio 2.73 ms |
DSK | sda | busy 6% | read 72 | write 58 | KiB/w 5 | MBr/s 2.4 | MBw/s 0.0 | avio 4.80 ms |
DSK | sdi | busy 6% | read 25 | write 159 | KiB/w 155 | MBr/s 0.0 | MBw/s 2.4 | avio 3.28 ms |
DSK | sdh | busy 5% | read 102 | write 24 | KiB/w 5 | MBr/s 2.7 | MBw/s 0.0 | avio 4.19 ms |
DSK | sde | busy 5% | read 39 | write 72 | KiB/w 24 | MBr/s 0.7 | MBw/s 0.2 | avio 4.58 ms |
DSK | sdb | busy 5% | read 50 | write 20 | KiB/w 5 | MBr/s 2.4 | MBw/s 0.0 | avio 7.03 ms |
DSK | sdd | busy 5% | read 56 | write 29 | KiB/w 7 | MBr/s 2.0 | MBw/s 0.0 | avio 5.55 ms |
DSK | sdg | busy 4% | read 59 | write 10 | KiB/w 4 | MBr/s 2.4 | MBw/s 0.0 | avio 6.43 ms |
DSK | sdf | busy 4% | read 63 | write 14 | KiB/w 5 | MBr/s 2.4 | MBw/s 0.0 | avio 5.09 ms |
DSK | sdc | busy 4% | read 22 | write 60 | KiB/w 5 | MBr/s 0.4 | MBw/s 0.0 | avio 4.29 ms |
DSK | sdk | busy 0% | read 20 | write 0 | KiB/w 0 | MBr/s 0.0 | MBw/s 0.0 | avio 0.20 ms |
DSK | sdm | busy 0% | read 20 | write 0 | KiB/w 0 | MBr/s 0.0 | MBw/s 0.0 | avio 0.20 ms |
DSK | sdj | busy 0% | read 20 | write 0 | KiB/w 0 | MBr/s 0.0 | MBw/s 0.0 | avio 0.00 ms |
DSK | sdl | busy 0% | read 20 | write 0 | KiB/w 0 | MBr/s 0.0 | MBw/s 0.0 | avio 0.00 ms |
NFC | rpc 4 | read 0 | write 0 | retxmit 0 | autref 4 | | | |
NET | transport | tcpi 6955 | tcpo 124349 | udpi 1010 | udpo 857 | tcpao 20 | tcppo 13 | tcprs 0 |
NET | network | ipi 7993 | ipo 9878 | ipfrw 0 | deliv 7981 | | icmpi 0 | icmpo 0 |
NET | ens15f1 1% | pcki 9070 | pcko 123656 | sp 10 Gbps | si 4871 Kbps | so 144 Mbps | erri 0 | erro 0 |
NET | ens15f0 0% | pcki 1589 | pcko 1145 | sp 10 Gbps | si 538 Kbps | so 147 Kbps | erri 0 | erro 0 |
NET | lo ---- | pcki 406 | pcko 406 | sp 0 Mbps | si 369 Kbps | so 369 Kbps | erri 0 | erro 0 |
NET | vmbr0 ---- | pcki 1400 | pcko 1144 | sp 0 Mbps | si 511 Kbps | so 147 Kbps | erri 0 | erro 0 |

PID TID RDDSK WRDSK WCANCL DSK CMD 1/7
3592 - 28052K 128K 0K 15% ceph-osd
3996 - 24640K 328K 0K 13% ceph-osd
3404 - 24584K 100K 0K 13% ceph-osd
2983 - 24584K 72K 0K 13% ceph-osd
3218 - 24544K 48K 0K 13% ceph-osd
1828 - 16K 23908K 4K 13% ceph-mon
3100 - 20504K 220K 0K 11% ceph-osd
2862 - 6892K 1784K 0K 5% ceph-osd
4117 - 4100K 316K 0K 2% ceph-osd

CPU MHz sometimes move up to 3,3 or 3,5 MHz

IO-Wait?
Too many context switches? Too few CPU-cores?
No significant netwokg usage (1%)
HDDs could do faster than ~2,5 MB/s
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!