I have a cluster of 6 nodes, each containing 8x Intel SSDSC2BB016T7R for a total of 48 OSDs. each node has 384GB ram and 40 logical cpus. For some reason, this cluster performance is really low in comparison to other deployments. deploying the gitlab template took well over 5 minutes:
next I benchmarked the setup using rados bench:
The bandwidth looks reasonable but the IOPs are approx 1/10th of my other clusters.
OSD latency is 1ms or less.
What am I missing?
extracting archive '/mnt/pve/template/template/cache/debian-8-turnkey-gitlab_14.2-1_amd64.tar.gz'
Total bytes read: 2216007680 (2.1GiB, 6.4MiB/s)
to eliminate the template source, I excecuted tar -tf on the template file:Total bytes read: 2216007680 (2.1GiB, 6.4MiB/s)
Code:
time tar tf /mnt/pve/template/template/cache/debian-8-turnkey-gitlab_14.2-1_amd64.tar.gz
real 0m14.582s
user 0m17.900s
sys 0m5.854s
next I benchmarked the setup using rados bench:
Code:
rados bench -p scbench 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 177 161 643.884 644 0.070158 0.0905813
2 16 344 328 655.885 668 0.0706853 0.0922631
3 16 508 492 655.887 656 0.0812616 0.094763
4 16 684 668 667.891 704 0.0619202 0.0935216
5 16 842 826 660.697 632 0.0662235 0.0950198
6 16 1010 994 662.567 672 0.132235 0.0950342
Total time run: 6.819637
Total reads made: 1134
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 665.138
Average IOPS: 166
Stddev IOPS: 6
Max IOPS: 176
Min IOPS: 158
Average Latency(s): 0.095415
Max latency(s): 0.477642
Min latency(s): 0.0129592
The bandwidth looks reasonable but the IOPs are approx 1/10th of my other clusters.
OSD latency is 1ms or less.
Code:
pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.15.15-1-pve)
pve-manager: 5.1-49 (running version: 5.1-49/1e427a54)
pve-kernel-4.13: 5.1-44
pve-kernel-4.15: 5.1-3
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-2-pve: 4.13.16-47
ceph: 12.2.4-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
openvswitch-switch: 2.7.0-2
proxmox-widget-toolkit: 1.0-14
pve-cluster: 5.0-24
pve-container: 2.0-21
pve-docs: 5.1-17
pve-firewall: 3.0-7
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
qemu-server: 5.0-24
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9
Code:
ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.18.21.0/24
fsid = a4b0bc0a-cf15-44f3-8410-f3816c155685
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 10.18.21.0/24
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.sky36]
host = sky36
mon addr = 10.18.21.36:6789
[mon.sky32]
host = sky32
mon addr = 10.18.21.32:6789
[mon.sky31]
host = sky31
mon addr = 10.18.21.31:6789
[mon.sky33]
host = sky33
mon addr = 10.18.21.33:6789
[client]
rbd cache = true
rbd cache size = 268435456
rbd cache max dirty = 134217728
rbd cache max dirty age = 5
What am I missing?