Poor performance with Ceph

javii · Feb 18, 2020

Hi, I am building my a Ceph cluster with Proxmox 6.1, and I am experiencing a low performance. Hope you can help me identify where is my bottleneck.

At this moment I am using 3 nodes, with 5 OSDs on each node (all SSD).

Specs per node:
Supermicro Fatwin SYS-F618R2-RT+
128 Gb DDR4
1x E5-1630v4
5x Intel S3510 800Gb for Ceph, connected to SATA ports on motherboard
1x SSD 80Gb for Proxmox
2x Gigabit NIC (only one used for WAN)
1x Mellanox MT27500 [ConnectX-3] Infiniband QDR 40g (for Ceph) - mtu 65520
No journal
Switch ceph network: Voltaire 4036

# ceph osd pool create scbench 100 100
# rados bench -p scbench 10 write --no-cleanup

Code:

hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph-test3_1776125
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        54        38   151.997       152   0.0386031    0.264344
    2      16        99        83    165.99       180   0.0197462    0.317243
    3      16       142       126   167.989       172   0.0383428    0.335559
    4      16       186       170   169.987       176     1.44781    0.341263
    5      16       244       228   182.386       232    0.269771    0.321895
    6      16       281       265   176.653       148      1.2315    0.339699
    7      16       318       302   172.557       148    0.314595     0.34656
    8      16       353       337   168.486       140   0.0184961    0.357753
    9      16       392       376   167.097       156    0.622394    0.359017
   10      16       435       419   167.585       172    0.365768    0.358431
Total time run:         10.504
Total writes made:      436
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     166.032
Stddev Bandwidth:       26.4961
Max bandwidth (MB/sec): 232
Min bandwidth (MB/sec): 140
Average IOPS:           41
Stddev IOPS:            6.62403
Max IOPS:               58
Min IOPS:               35
Average Latency(s):     0.38319
Stddev Latency(s):      0.438058
Max latency(s):         2.14432
Min latency(s):         0.0179365

# rados bench -p scbench 10 seq

Code:

hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        58        42   167.976       168    0.626339    0.265368
    2      16       102        86   171.981       176   0.0130061    0.319466
    3      16       141       125   166.649       156   0.0164162    0.340917
    4      16       186       170   169.983       180     1.46721    0.344456
    5      16       244       228   182.383       232    0.529887    0.322279
    6      16       280       264   175.983       144    0.248866    0.339829
    7      16       320       304   173.698       160   0.0182624      0.3492
    8      16       353       337   168.485       132    0.789411    0.362284
    9      16       392       376   167.096       156    0.280278    0.363076
   10      16       436       420   167.984       176   0.0186207    0.364722
Total time run:       10.5163
Total reads made:     436
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   165.837
Average IOPS:         41
Stddev IOPS:          6.76593
Max IOPS:             58
Min IOPS:             33
Average Latency(s):   0.385245
Max latency(s):       1.81629
Min latency(s):       0.0124946

# rados bench -p scbench 10 rand

Code:

hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        66        50    199.97       200    0.441923    0.198354
    2      16       101        85    169.98       140    0.207021    0.306616
    3      16       142       126   167.981       164    0.562973    0.329745
    4      16       178       162   161.983       144  0.00239542    0.365981
    5      16       220       204   163.183       168    0.509352    0.353979
    6      16       273       257   171.316       212    0.355821     0.34913
    7      16       315       299    170.84       168  0.00242466    0.350936
    8      16       354       338   168.983       156    0.394233    0.363754
    9      16       394       378   167.983       160    0.482949    0.361794
   10      16       435       419   167.584       164  0.00235319    0.361498
Total time run:       10.492
Total reads made:     436
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   166.222
Average IOPS:         41
Stddev IOPS:          5.62633
Max IOPS:             53
Min IOPS:             35
Average Latency(s):   0.382806
Max latency(s):       1.91117
Min latency(s):       0.0022631

# pveversion -v

Code:

proxmox-ve: 6.1-2 (running kernel: 5.3.18-1-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-5.3: 6.1-4
pve-kernel-helper: 6.1-4
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.14-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-12
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-19
pve-docs: 6.1-4
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-10
pve-firmware: 3.0-5
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

# cat /etc/pve/ceph.conf

Code:

[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.12.12.0/24
         fsid = 459c3e1d-06bc-4525-9f95-3e8fb62e2d77
         mon_allow_pool_delete = true
         mon_host = 185.47.xxx.23 185.47.xxx.25 185.47.xxx.27
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 185.47.xxx.0/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

# cat /etc/network/interfaces

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual

auto eno1.200

iface eno1.200 inet manual
        vlan-raw-device eno1

auto vmbr0
iface vmbr0 inet static
        address 185.47.xxx.23
        netmask 255.255.255.0
        gateway 185.47.xxx.1
        bridge_ports eno1.200
        bridge_stp off
        bridge_fd 0

iface eno2 inet manual

auto ibs7
iface ibs7 inet static
        address  10.12.12.13
        netmask  255.255.255.0
        pre-up modprobe ib_ipoib
        pre-up modprobe mlx4_ib
        pre-up modprobe ib_umad
        pre-up echo connected > /sys/class/net/ibs7/mode
        mtu 65520

Switch Voltaire 4036:
4036-46EC# module-firmware show

Code:

Module No.      Type            Node GUID             LID   FW Version  SW Version
----------      ----            ---------             ---   ----------  ----------
4036/2036                                                               3.9.1-985
---------
        CPLD    1                                             0xa
        IS4     1               0x0008f105002046ec    0      7.4.2200      VLT1210032201

Infiniband card:
# ibstat

Code:

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 1
        Firmware version: 2.33.5100
        Hardware version: 1
        Node GUID: 0x002590ffff907508
        System image GUID: 0x002590ffff90750b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 2
                LMC: 0
                SM lid: 2
                Capability mask: 0x0251486a
                Port GUID: 0x002590ffff907509
                Link layer: InfiniBand

# iperf -c 10.12.12.13

Code:

------------------------------------------------------------
Client connecting to 10.12.12.13, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 10.12.12.15 port 58698 connected with 10.12.12.13 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  31.6 GBytes  27.2 Gbits/sec

Any idea? Any help would be appreciated. Thank you in advance.

javii · Feb 18, 2020

Some pings:

Code:

# ping -M do -s 8700 10.12.12.15
PING 10.12.12.15 (10.12.12.15) 8700(8728) bytes of data.
8708 bytes from 10.12.12.15: icmp_seq=1 ttl=64 time=0.111 ms
8708 bytes from 10.12.12.15: icmp_seq=2 ttl=64 time=0.088 ms
8708 bytes from 10.12.12.15: icmp_seq=3 ttl=64 time=0.162 ms
8708 bytes from 10.12.12.15: icmp_seq=4 ttl=64 time=0.154 ms
8708 bytes from 10.12.12.15: icmp_seq=5 ttl=64 time=0.181 ms
8708 bytes from 10.12.12.15: icmp_seq=6 ttl=64 time=0.150 ms
8708 bytes from 10.12.12.15: icmp_seq=7 ttl=64 time=0.124 ms
8708 bytes from 10.12.12.15: icmp_seq=8 ttl=64 time=0.132 ms
8708 bytes from 10.12.12.15: icmp_seq=9 ttl=64 time=0.172 ms
^C
--- 10.12.12.15 ping statistics ---
9 packets transmitted, 9 received, 0% packet loss, time 193ms
rtt min/avg/max/mdev = 0.088/0.141/0.181/0.031 ms
# ping -M do -s 8700 185.47.xxx.23
PING 185.47.xxx.23 (185.47.xxx.23) 8700(8728) bytes of data.
8708 bytes from 185.47.xxx.23: icmp_seq=1 ttl=64 time=0.041 ms
8708 bytes from 185.47.xxx.23: icmp_seq=2 ttl=64 time=0.043 ms
8708 bytes from 185.47.xxx.23: icmp_seq=3 ttl=64 time=0.037 ms
8708 bytes from 185.47.xxx.23: icmp_seq=4 ttl=64 time=0.027 ms
8708 bytes from 185.47.xxx.23: icmp_seq=5 ttl=64 time=0.044 ms
8708 bytes from 185.47.xxx.23: icmp_seq=6 ttl=64 time=0.040 ms
8708 bytes from 185.47.xxx.23: icmp_seq=7 ttl=64 time=0.041 ms
8708 bytes from 185.47.xxx.23: icmp_seq=8 ttl=64 time=0.024 ms
8708 bytes from 185.47.xxx.23: icmp_seq=9 ttl=64 time=0.043 ms
8708 bytes from 185.47.xxx.23: icmp_seq=10 ttl=64 time=0.028 ms
8708 bytes from 185.47.xxx.23: icmp_seq=11 ttl=64 time=0.044 ms
8708 bytes from 185.47.xxx.23: icmp_seq=12 ttl=64 time=0.043 ms
8708 bytes from 185.47.xxx.23: icmp_seq=13 ttl=64 time=0.024 ms
8708 bytes from 185.47.xxx.23: icmp_seq=14 ttl=64 time=0.031 ms
8708 bytes from 185.47.xxx.23: icmp_seq=15 ttl=64 time=0.046 ms
8708 bytes from 185.47.xxx.23: icmp_seq=16 ttl=64 time=0.028 ms
8708 bytes from 185.47.xxx.23: icmp_seq=17 ttl=64 time=0.045 ms
8708 bytes from 185.47.xxx.23: icmp_seq=18 ttl=64 time=0.050 ms
8708 bytes from 185.47.xxx.23: icmp_seq=19 ttl=64 time=0.028 ms
8708 bytes from 185.47.xxx.23: icmp_seq=20 ttl=64 time=0.043 ms
^C
--- 185.47.xxx.23 ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 454ms
rtt min/avg/max/mdev = 0.024/0.037/0.050/0.010 ms
# ping -M do -s 8700 10.12.12.15
PING 10.12.12.15 (10.12.12.15) 8700(8728) bytes of data.
8708 bytes from 10.12.12.15: icmp_seq=1 ttl=64 time=0.110 ms
8708 bytes from 10.12.12.15: icmp_seq=2 ttl=64 time=0.160 ms
8708 bytes from 10.12.12.15: icmp_seq=3 ttl=64 time=0.149 ms
8708 bytes from 10.12.12.15: icmp_seq=4 ttl=64 time=0.194 ms
^C
--- 10.12.12.15 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 82ms
rtt min/avg/max/mdev = 0.110/0.153/0.194/0.031 ms
# ping 10.12.12.15
PING 10.12.12.15 (10.12.12.15) 56(84) bytes of data.
64 bytes from 10.12.12.15: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 10.12.12.15: icmp_seq=2 ttl=64 time=0.139 ms
64 bytes from 10.12.12.15: icmp_seq=3 ttl=64 time=0.125 ms
64 bytes from 10.12.12.15: icmp_seq=4 ttl=64 time=0.160 ms
64 bytes from 10.12.12.15: icmp_seq=5 ttl=64 time=0.083 ms
64 bytes from 10.12.12.15: icmp_seq=6 ttl=64 time=0.113 ms
64 bytes from 10.12.12.15: icmp_seq=7 ttl=64 time=0.129 ms
64 bytes from 10.12.12.15: icmp_seq=8 ttl=64 time=0.061 ms
64 bytes from 10.12.12.15: icmp_seq=9 ttl=64 time=0.057 ms
64 bytes from 10.12.12.15: icmp_seq=10 ttl=64 time=0.137 ms
64 bytes from 10.12.12.15: icmp_seq=11 ttl=64 time=0.125 ms
^C
--- 10.12.12.15 ping statistics ---
11 packets transmitted, 11 received, 0% packet loss, time 256ms
rtt min/avg/max/mdev = 0.057/0.113/0.160/0.031 ms

# ping 185.47.xxx.23
PING 185.47.xxx.23 (185.47.xxx.23) 56(84) bytes of data.
64 bytes from 185.47.xxx.23: icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from 185.47.xxx.23: icmp_seq=2 ttl=64 time=0.032 ms
64 bytes from 185.47.xxx.23: icmp_seq=3 ttl=64 time=0.019 ms
64 bytes from 185.47.xxx.23: icmp_seq=4 ttl=64 time=0.033 ms
64 bytes from 185.47.xxx.23: icmp_seq=5 ttl=64 time=0.033 ms
64 bytes from 185.47.xxx.23: icmp_seq=6 ttl=64 time=0.020 ms
64 bytes from 185.47.xxx.23: icmp_seq=7 ttl=64 time=0.018 ms
64 bytes from 185.47.xxx.23: icmp_seq=8 ttl=64 time=0.016 ms
64 bytes from 185.47.xxx.23: icmp_seq=9 ttl=64 time=0.018 ms
64 bytes from 185.47.xxx.23: icmp_seq=10 ttl=64 time=0.032 ms
64 bytes from 185.47.xxx.23: icmp_seq=11 ttl=64 time=0.019 ms
64 bytes from 185.47.xxx.23: icmp_seq=12 ttl=64 time=0.040 ms
64 bytes from 185.47.xxx.23: icmp_seq=13 ttl=64 time=0.019 ms
64 bytes from 185.47.xxx.23: icmp_seq=14 ttl=64 time=0.021 ms
64 bytes from 185.47.xxx.23: icmp_seq=15 ttl=64 time=0.029 ms
64 bytes from 185.47.xxx.23: icmp_seq=16 ttl=64 time=0.020 ms
^C
--- 185.47.xxx.23 ping statistics ---
16 packets transmitted, 16 received, 0% packet loss, time 376ms
rtt min/avg/max/mdev = 0.016/0.025/0.042/0.010 ms

The only strange here is that I see bigger latency sending pings over the Infiniband network than Gigabit network.

Alwin · Feb 19, 2020

javii said:
Switch ceph network: Voltaire 4036

The switch only support a MTU of 4096 bytes.

javii said:
5x Intel S3510 800Gb for Ceph, connected to SATA ports on motherboard

They only do ~15000 IO/s write.

javii said:
2x Gigabit NIC (only one used for WAN)

And Ceph's public network as well. Limiting the client to 1 Gb/s. The Proxmox VE node is client and server [0].

See our Ceph benchmark paper [1] and forum thread [1] for comparison.

[0] https://docs.ceph.com/docs/nautilus/rados/configuration/network-config-ref/
[1] https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

javii · Feb 20, 2020

Thank you! You saved my life! I misunderstood public and cluster network in Ceph. Now everything is going through the Infiniband network and the performance is good:

# rados bench -p scbench 10 write --no-cleanup

Code:

hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph-test3_29555

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       353       337   1347.93      1348   0.0198316   0.0453675
    2      16       715       699    1397.9      1448   0.0351452   0.0453644
    3      16      1055      1039   1385.22      1360   0.0242386   0.0459088
    4      16      1399      1383   1382.88      1376   0.0228827    0.046114
    5      16      1753      1737   1389.48      1416   0.0157466    0.045677
    6      16      2108      2092   1394.55      1420   0.0686309   0.0456067
    7      16      2476      2460   1405.58      1472   0.0308837   0.0453771
    8      16      2818      2802   1400.87      1368   0.0249358   0.0455547
    9      16      3185      3169   1408.31      1468   0.0334542   0.0453847
   10      16      3532      3516   1406.26      1388   0.0267773   0.0453776

Total time run:         10.0394
Total writes made:      3533
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1407.65
Stddev Bandwidth:       45.2234
Max bandwidth (MB/sec): 1472
Min bandwidth (MB/sec): 1348
Average IOPS:           351
Stddev IOPS:            11.3058
Max IOPS:               368
Min IOPS:               337
Average Latency(s):     0.0454573
Stddev Latency(s):      0.0228097
Max latency(s):         0.173977
Min latency(s):         0.0136959

Even the switch does not support MTU of 65520, I noticed a 2x performance moving from 4096 to 65520.

Thank you very much.

javii · Feb 20, 2020

After getting good numbers with rados bench, I am now testing inside a VM, and I am getting poor perfomance compared to a single disk, not sure if this should happen with Ceph or I have any problem here.

# cat /etc/pve/qemu-server/100.conf

Code:

bootdisk: scsi0
cores: 8
ide2: local:iso/CentOS-8.1.1911-x86_64-dvd1.iso,media=cdrom,size=7205M
memory: 8192
name: testing-ceph-vm
net0: virtio=6E:A8:5A:A9:72:A9,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: pool-ceph:vm-100-disk-0,discard=on,iothread=1,size=32G,ssd=1
scsi1: pool-ceph:vm-100-disk-1,discard=on,iothread=1,size=32G,ssd=1
scsi2: single800disk:vm-100-disk-0,discard=on,iothread=1,size=32G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=d7095593-3907-4564-8602-69a39915d3a0
sockets: 1
vmgenid: b95e5e4b-7f9a-40c6-9a20-01f397d11f8d

scsi0 is for operating system, Centos 8.
scsi1 is Ceph.
scsi2 is LVM on a single s3510 disk.

Command used for testing:
# fio --ioengine=libaio --filename=/dev/sdX --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio --bandwidth-log

Ceph:

Code:

fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=4568KiB/s][r=0,w=1142 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2040: Thu Feb 20 15:36:51 2020
  write: IOPS=1205, BW=4821KiB/s (4936kB/s)(282MiB/60001msec)
    slat (usec): min=11, max=111, avg=16.08, stdev= 4.99
    clat (usec): min=484, max=102887, avg=808.46, stdev=564.16
     lat (usec): min=499, max=102904, avg=825.66, stdev=564.45
    clat percentiles (usec):
     |  1.00th=[  562],  5.00th=[  603], 10.00th=[  619], 20.00th=[  652],
     | 30.00th=[  676], 40.00th=[  709], 50.00th=[  750], 60.00th=[  807],
     | 70.00th=[  873], 80.00th=[  947], 90.00th=[ 1057], 95.00th=[ 1156],
     | 99.00th=[ 1369], 99.50th=[ 1467], 99.90th=[ 1811], 99.95th=[ 2180],
     | 99.99th=[ 7767]
   bw (  KiB/s): min= 3736, max= 5624, per=100.00%, avg=4824.30, stdev=318.61, samples=119
   iops        : min=  934, max= 1406, avg=1206.06, stdev=79.65, samples=119
  lat (usec)   : 500=0.01%, 750=49.72%, 1000=35.62%
  lat (msec)   : 2=14.60%, 4=0.04%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=1.02%, sys=1.38%, ctx=144623, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,72310,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=4821KiB/s (4936kB/s), 4821KiB/s-4821KiB/s (4936kB/s-4936kB/s), io=282MiB (296MB), run=60001-60001msec

Disk stats (read/write):
  sdb: ios=50/72182, merge=0/0, ticks=7/58662, in_queue=11171, util=97.10%

LVM:

Code:

fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=20.1MiB/s][r=0,w=5149 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=2047: Thu Feb 20 15:38:30 2020
  write: IOPS=6082, BW=23.8MiB/s (24.9MB/s)(1426MiB/60001msec)
    slat (usec): min=8, max=690, avg=12.07, stdev= 3.09
    clat (usec): min=72, max=5296, avg=148.24, stdev=40.17
     lat (usec): min=124, max=5308, avg=161.28, stdev=40.57
    clat percentiles (usec):
     |  1.00th=[  126],  5.00th=[  133], 10.00th=[  135], 20.00th=[  137],
     | 30.00th=[  139], 40.00th=[  141], 50.00th=[  143], 60.00th=[  145],
     | 70.00th=[  147], 80.00th=[  153], 90.00th=[  165], 95.00th=[  180],
     | 99.00th=[  225], 99.50th=[  277], 99.90th=[  717], 99.95th=[  758],
     | 99.99th=[  848]
   bw (  KiB/s): min=20136, max=25704, per=100.00%, avg=24361.29, stdev=1110.14, samples=119
   iops        : min= 5034, max= 6426, avg=6090.31, stdev=277.53, samples=119
  lat (usec)   : 100=0.01%, 250=99.33%, 500=0.28%, 750=0.33%, 1000=0.06%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=3.96%, sys=5.28%, ctx=729950, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,364966,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1426MiB (1495MB), run=60001-60001msec

Disk stats (read/write):
  sdc: ios=48/364429, merge=0/0, ticks=8/55413, in_queue=25, util=99.86%

During the tests ceph osd tree shows 0 latency:

Code:

osd commit_latency(ms) apply_latency(ms)
14                  0                 0
13                  0                 0
12                  0                 0
11                  0                 0
10                  0                 0
  9                  0                 0
  7                  0                 0
  6                  0                 0
  1                  0                 0
  0                  0                 0
  2                  0                 0
  3                  0                 0
  4                  0                 0
  5                  0                 0

New ceph.conf

Code:

# cat /etc/ceph/ceph.conf
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         fsid = 459c3e1d-06bc-4525-9f95-3e8fb62e2d77
         mon_allow_pool_delete = true
         mon_host = 10.12.12.18 10.12.12.15 10.12.12.13
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 10.12.12.0/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

# ceph status

Code:

  cluster:
    id:     459c3e1d-06bc-4525-9f95-3e8fb62e2d76
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-test8,ceph-test5,ceph-test3 (age 2m)
    mgr: ceph-test5(active, since 2m), standbys: ceph-test3, ceph-test8
    osd: 14 osds: 14 up (since 2m), 14 in (since 28m)

  data:
    pools:   1 pools, 256 pgs
    objects: 7.49k objects, 28 GiB
    usage:   207 GiB used, 10 TiB / 10 TiB avail
    pgs:     256 active+clean

Any help would be appreciated.

Alwin · Feb 20, 2020

It would be nice if distributed storage would be as fast as local disk access.

javii · Feb 20, 2020

hehe, it would really nice, but 5x slower seems a lot.... this is what I should expect? Isn't there any bottleneck here?

1200 IOPS (15 OSD Ceph)
6000 IOPS (Single same local disk)

Thank you

Alwin · Feb 20, 2020

javii said:
1200 IOPS (15 OSD Ceph)

On write, the data has to travel over the network to the primary OSD and will then be copied (again network) to its peers (depending on replica). Only after they wrote the data successfully, the ACK is send to the client. This adds a lot of latency, compared to a local disk.

javii said:
6000 IOPS (Single same local disk)

Direct access and no other writes are going to the disk. This will bring low latency, essential for high IO/s.

In general, you may be able to tune Ceph, to get better results. Foremost, activate the cache=writeback on the disks so small writes can be cached before send out. By default the cache is 25 MB in size and a client needs to be able to flush to activate it.

Younex · Aug 13, 2021

As CEPH seems to be very bad with high IO load, even on enterprise hardware, what is best use for it.
Archiving and static storage solution? If it struggles mostly due its architecture to save date across different OSD and wait for ACK i ask myself what to use it mostly for.
For heavy workload and high IO i thinks best is direct disk, fast SAN and similar approach best?

spirit · Aug 13, 2021

Younex said:
As CEPH seems to be very bad with high IO load, even on enterprise hardware, what is best use for it.
Archiving and static storage solution? If it struggles mostly due its architecture to save date across different OSD and wait for ACK i ask myself what to use it mostly for.
For heavy workload and high IO i thinks best is direct disk, fast SAN and similar approach best?

It's really depend of the workload. For example, if you can do a lot of parralel read/write with high iodepth, it's not too much a problem.
Single threaded stream with low iodepth can be a problem with the latency.

with iodepth=64, I can achive around 60000-90000iops with a single vm drive.

Moatasem · May 14, 2023

javii said:

Thank you! You saved my life! I misunderstood public and cluster network in Ceph. Now everything is going through the Infiniband network and the performance is good:

# rados bench -p scbench 10 write --no-cleanup

Code:

hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph-test3_29555

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       353       337   1347.93      1348   0.0198316   0.0453675
    2      16       715       699    1397.9      1448   0.0351452   0.0453644
    3      16      1055      1039   1385.22      1360   0.0242386   0.0459088
    4      16      1399      1383   1382.88      1376   0.0228827    0.046114
    5      16      1753      1737   1389.48      1416   0.0157466    0.045677
    6      16      2108      2092   1394.55      1420   0.0686309   0.0456067
    7      16      2476      2460   1405.58      1472   0.0308837   0.0453771
    8      16      2818      2802   1400.87      1368   0.0249358   0.0455547
    9      16      3185      3169   1408.31      1468   0.0334542   0.0453847
   10      16      3532      3516   1406.26      1388   0.0267773   0.0453776

Total time run:         10.0394
Total writes made:      3533
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1407.65
Stddev Bandwidth:       45.2234
Max bandwidth (MB/sec): 1472
Min bandwidth (MB/sec): 1348
Average IOPS:           351
Stddev IOPS:            11.3058
Max IOPS:               368
Min IOPS:               337
Average Latency(s):     0.0454573
Stddev Latency(s):      0.0228097
Max latency(s):         0.173977
Min latency(s):         0.0136959

Even the switch does not support MTU of 65520, I noticed a 2x performance moving from 4096 to 65520.

Thank you very much.

Hi Javii

I know it have been a while since you had this issue.. but I'm facing the issue.
What config changes did you make to get an improved performance?

Search

Search

Poor performance with Ceph

javii

Member

javii

Member

Alwin

Proxmox Retired Staff

javii

Member

javii

Member

Alwin

Proxmox Retired Staff

javii

Member

Alwin

Proxmox Retired Staff

Younex

Well-Known Member

spirit

Distinguished Member

Moatasem

New Member

We value your privacy