Very Poor Ceph Performance

johnwkerns · Jun 26, 2024

Hosts
- 3x HP DL380 Gen8's
- Each with 2x Intel Xeon CPU E5-2650 v2 @ 2.60GHz
- Each with 128GB RAM
Networking
- Ceph cluster network: 10Gbps direct attach (host-to-host)
- Ceph public network: 1Gbps via a switch
Disks
- "VMs" pool
  - 36x [12 per host] 900GB 10k SAS spinners (HP EG0900FBVFQ) for data
  - 18x [6 per host] 400GB enterprise SSD (Hitachi SSD400M) for DB and WAL
  - Each SSD performs DB and WAL operations for two HDDs
- "SSD_ONLY" test pool
  - 3x [1 per host] 400GB enterprise SSD (Hitachi SSD400M)
Bench tests
- Testing with "rados bench" renders >600MBps read and write
- Testing on a Linux VM with "dd" renders 70-120MBps read, ~140MBps write
- Testing with CrystalDisk on Windows gives ~140MBps read and 10-20MBps write
  - Task manager shows 100% active time and
  - 500-900ms average response time when performance the write tests, sometimes this spikes to 5000-7000ms!
  - 75-150ms average response time when performing the read tests

I've been troubleshooting poor disk performance on this cluster [on and off] for months now. The Windows VMs are practically unusable when on the main "VMs" storage pool. I've tried playing with settings on the VM disks and storage adapters and none of them make any difference.

What troubleshooting steps can I take to help narrow down this problem?

jsterr · Jun 27, 2024

This should be related to having only 1Gbit/s in the ceph public and only 10Gbit on ceph cluster network. These days its not best practice anymore to seperate those services, I would recommend to only use the 10Gbit bond for ceph public and cluster.

Edit: on 10Gbit doing public and cluster network separation is good, but higher 10gbit its not worth it (switchport costs etc)

Maximiliano · Jun 27, 2024

Hello,

As mentioned already, having a 1G network for either Ceph network will heavily bottleneck the cluster.

These days its not best practice anymore to seperate those services

I can see a use for having the `cluster_network` and `public_nework` Ceph networks in separate NICs if you only count with two 10G cards. In our Ceph benchmarks [1] it is shown how quickly one can overwhelm a single 10G NIC with Ceph. But in general one shouldn't have separate networks without a clear reason for it.

Regarding the disks (SSD400M), I didn't find any info on whether they have power-loss protection (PLP) or not. This is the most important factor when it comes to storage for reliable performance, see [2].

[1] https://www.proxmox.com/images/download/pve/docs/Proxmox-VE-Ceph-Benchmark-202312-rev0.pdf
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_system_requirements

czechsys · Jun 27, 2024

1] dd test are useless, use fio, anyway, please post bechnmark results every time
2] use 1 gbit for VM's traffic, move public_network to 10 G (if it's possible, because public_network = proxmox mgmt network usually). or get 10 G switches (ebay, etc)
3] how is connected backplane to controller(s)? I count 18 disks per server. No experience with such G8s, but
- if you don't have pure hba card, you have perf hit from Pxxx cards
- if backplane is multiplexed <-> controller, you have another problem - for example, 10x2.5" DL360 G8 is multiplexed, 12x3.5" DL380 G8 is multiplexed

johnwkerns · Jul 1, 2024

So I checked and found that I am using my 10G links for both public and cluster traffic, apologies for the mistake on my original write up.

I tried fio for testing inside the Linux VM and it gave very similar results to Windows: about 10MBps. I also ran some rados bench tests and watched network traffic on the 10G links while they ran. The results looked like

rados bench: Consistent 585MBps writes (sometimes bumps above 600MBps). Network traffic sits steady at about 3.5Gbps
fio on a Linux VM: Starts at over 100MBps and quickly drops to 10MBps, stays consistent there

Another thing I did test was performing iperf tests between hosts using the Ceph cluster IPs. There I get a solid 9.5Gbps for as long as I run the tests.

What can explain the astronomical difference between the two types of tests I'm running?

Maximiliano · Jul 2, 2024

Hello,

Have you measured the disks used by the OSDs directly? When measuring within the VM you have multiple layers of overhead and you need a baseline. Make sure that you benchmark the disks for 10 minutes.

Nevertheless, you can try using KRBD (you can set this in the RBD Storage in the web UI at Datacenter->Storage->{The RBD Storage}), or set the VMs to use the writeback cache from its Hard Disk options in the web UI at Datacenter->{Node}->{Guest}->Hardware->{Hard Disk}.

johnwkerns · Jul 9, 2024

Below are a three tests I ran on three different storage systems. All tests were run from the same LXC container and from the same compute host, I just migrated the LXC disk to different storage locations. The SSDs involved in each test are the same models.

The fact that the ZFS-based test gives much better performance makes me think I might have some kind of Ceph performance issue. What do you all think?

#1 - "VMs" pool [36x (12 per host) 900GB 10k SAS spinners, 18x (6 per host) 400GB enterprise SSD]

Code:

root@DISKTEST:~#
root@DISKTEST:~#
root@DISKTEST:~# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.28
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]                         
random-write: (groupid=0, jobs=1): err= 0: pid=792: Tue Jul  9 13:25:03 2024
  write: IOPS=3772, BW=14.7MiB/s (15.5MB/s)(1184MiB/80360msec); 0 zone resets
    slat (nsec): min=854, max=265835, avg=4021.27, stdev=4242.24
    clat (nsec): min=313, max=282106k, avg=192360.71, stdev=1540685.34
     lat (usec): min=12, max=282112, avg=196.38, stdev=1540.73
    clat percentiles (nsec):
     |  1.00th=[     724],  5.00th=[   13376], 10.00th=[   14016],
     | 20.00th=[   14656], 30.00th=[   15296], 40.00th=[   15808],
     | 50.00th=[   16320], 60.00th=[   17536], 70.00th=[   20352],
     | 80.00th=[   24960], 90.00th=[   30080], 95.00th=[   37632],
     | 99.00th=[ 9109504], 99.50th=[10682368], 99.90th=[16449536],
     | 99.95th=[19529728], 99.99th=[34865152]
   bw (  KiB/s): min=11200, max=184824, per=100.00%, avg=20258.29, stdev=27919.67, samples=119
   iops        : min= 2800, max=46206, avg=5064.59, stdev=6979.91, samples=119
  lat (nsec)   : 500=0.60%, 750=0.43%, 1000=0.27%
  lat (usec)   : 2=0.12%, 4=0.01%, 10=0.01%, 20=67.15%, 50=29.03%
  lat (usec)   : 100=0.73%, 250=0.06%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.85%, 20=0.70%, 50=0.04%
  lat (msec)   : 100=0.01%, 500=0.01%
  cpu          : usr=1.62%, sys=2.05%, ctx=366907, majf=0, minf=216
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,303144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=14.7MiB/s (15.5MB/s), 14.7MiB/s-14.7MiB/s (15.5MB/s-15.5MB/s), io=1184MiB (1242MB), run=80360-80360msec

Disk stats (read/write):
  rbd0: ios=0/243508, merge=0/10298, ticks=0/4813583, in_queue=4813584, util=97.89%
root@DISKTEST:~#
root@DISKTEST:~#

#2 - "SSD_ONLY" pool [3x (1 per host) 400GB enterprise SSD]

Code:

root@DISKTEST:~#
root@DISKTEST:~#
root@DISKTEST:~# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]                         
random-write: (groupid=0, jobs=1): err= 0: pid=312: Tue Jul  9 13:32:57 2024
  write: IOPS=3636, BW=14.2MiB/s (14.9MB/s)(1164MiB/81937msec); 0 zone resets
    slat (nsec): min=898, max=306294, avg=4091.74, stdev=4187.76
    clat (nsec): min=337, max=39767k, avg=195584.39, stdev=1373822.12
     lat (usec): min=13, max=39770, avg=199.68, stdev=1373.86
    clat percentiles (nsec):
     |  1.00th=[     596],  5.00th=[   13888], 10.00th=[   14400],
     | 20.00th=[   14912], 30.00th=[   15296], 40.00th=[   15808],
     | 50.00th=[   16512], 60.00th=[   17792], 70.00th=[   20864],
     | 80.00th=[   24704], 90.00th=[   29824], 95.00th=[   37120],
     | 99.00th=[ 9240576], 99.50th=[10158080], 99.90th=[13434880],
     | 99.95th=[15269888], 99.99th=[34340864]
   bw (  KiB/s): min= 5744, max=171752, per=100.00%, avg=19925.85, stdev=25597.90, samples=119
   iops        : min= 1436, max=42938, avg=4981.46, stdev=6399.48, samples=119
  lat (nsec)   : 500=0.76%, 750=0.41%, 1000=0.25%
  lat (usec)   : 2=0.08%, 4=0.01%, 10=0.01%, 20=65.86%, 50=30.02%
  lat (usec)   : 100=0.63%, 250=0.07%, 500=0.01%, 750=0.01%, 1000=0.04%
  lat (msec)   : 2=0.14%, 4=0.01%, 10=1.04%, 20=0.65%, 50=0.02%
  cpu          : usr=1.65%, sys=1.91%, ctx=366463, majf=0, minf=176
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,297952,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=14.2MiB/s (14.9MB/s), 14.2MiB/s-14.2MiB/s (14.9MB/s-14.9MB/s), io=1164MiB (1220MB), run=81937-81937msec

Disk stats (read/write):
  rbd0: ios=554/239847, merge=0/74, ticks=442/5126951, in_queue=5127394, util=97.89%
root@DISKTEST:~#
root@DISKTEST:~#

#3 - "local-zfs" [Local 2x 400GB enterprise SSD ZFS pool]

Code:

root@DISKTEST:~#
root@DISKTEST:~#
root@DISKTEST:~# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=31.4MiB/s][w=8043 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=313: Tue Jul  9 13:58:13 2024
  write: IOPS=9585, BW=37.4MiB/s (39.3MB/s)(2250MiB/60096msec); 0 zone resets
    slat (nsec): min=1088, max=251748, avg=2316.81, stdev=2564.04
    clat (nsec): min=640, max=114355k, avg=100531.36, stdev=1723836.72
     lat (usec): min=16, max=114359, avg=102.85, stdev=1723.83
    clat percentiles (usec):
     |  1.00th=[   21],  5.00th=[   29], 10.00th=[   54], 20.00th=[   58],
     | 30.00th=[   61], 40.00th=[   63], 50.00th=[   65], 60.00th=[   69],
     | 70.00th=[   73], 80.00th=[   77], 90.00th=[   85], 95.00th=[   92],
     | 99.00th=[  113], 99.50th=[  125], 99.90th=[  322], 99.95th=[  578],
     | 99.99th=[93848]
   bw (  KiB/s): min=29560, max=41616, per=100.00%, avg=38402.58, stdev=2452.65, samples=120
   iops        : min= 7390, max=10404, avg=9600.64, stdev=613.17, samples=120
  lat (nsec)   : 750=0.01%, 1000=0.04%
  lat (usec)   : 2=0.01%, 4=0.01%, 20=0.52%, 50=6.40%, 100=90.65%
  lat (usec)   : 250=2.25%, 500=0.07%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 100=0.03%, 250=0.01%
  cpu          : usr=4.52%, sys=3.35%, ctx=583281, majf=17, minf=230
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,576049,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=37.4MiB/s (39.3MB/s), 37.4MiB/s-37.4MiB/s (39.3MB/s-39.3MB/s), io=2250MiB (2359MB), run=60096-60096msec
root@DISKTEST:~#
root@DISKTEST:~

alexskysilk · Jul 9, 2024

jsterr said:
These days its not best practice anymore to seperate those services,

Thats not accurate. While it's true that under NORMAL circumstances having a separate private network pipe is not really an issue as long as the public interface has sufficient bandwidth, this changes with a rebalance storm. and that DOES happen. The ceph documentation suggests that a separate interface adds complexity but doesn't say outright not to use it. Having said that- with only three nodes it serves very little purpose since rebalancing across nodes is impossible.

johnwkerns said:
The fact that the ZFS-based test gives much better performance makes me think I might have some kind of Ceph performance issue. What do you all think?

There are two things to consider here:
1. ceph is not a local file system. you have a lot more potential bottlenecks. In your case, there are a BUNCH- I'll point out two that I see right off the bat:
- you have 18 OSDs per node, with a total of 16 cores each. if your setup is hyperconverged) your osds will fight your vms for cpu
- lxc is not true isolation; it controls processing by limiting cpu sets. try increasing the "cpus" for your containers to have a better chance that IO goes without having to hop nodes, etc.

2. ceph really wants (and benefits from) scale. three nodes will never be as performant as, say 10 nodes. Also, in your setup, I'd probably set up two pools (HDD and SSD) instead of doing db/wal devices. you have too much SSD for the pool, which could be better utilized.

Search

Search

Very Poor Ceph Performance

johnwkerns

New Member

jsterr

Renowned Member

Maximiliano

Proxmox Staff Member

czechsys

Renowned Member

johnwkerns

New Member

Attachments

Maximiliano

Proxmox Staff Member

johnwkerns

New Member

alexskysilk

Distinguished Member