[SOLVED] ceph atrocious performance inside LXC/VM

Eddy Buhler · Aug 31, 2021

I have set up a three note Proxmox cluster with ceph and a pool (size 3, min_size 2) for testing. OSDs are two NVMe disks on each server with a capacity of 3.5TiB each. ceph has one 10G network connection each for public and private network. The public network is also the LAN bridge for my VMs.

Code:

# hdparm -tT --direct /dev/nvme1n1                                                                                                        
                                                                                                                                                           
/dev/nvme1n1:                                                                                                                                              
 Timing O_DIRECT cached reads:   5000 MB in  2.00 seconds = 2500.62 MB/sec                                                                                  
 Timing O_DIRECT disk reads: 7936 MB in  3.00 seconds = 2645.14 MB/sec

The OSDs deliver, too:

Code:

# ceph tell osd.0 bench                                                                                                                    
{                                                                                                                                                          
    "bytes_written": 1073741824,                                                                                                                            
    "blocksize": 4194304,                                                                                                                                  
    "elapsed_sec": 0.30341430800000002,
    "bytes_per_sec": 3538863513.3185611,
    "iops": 843.73081047977473
}

I created pools with various PG settings, and performance wise ended up sticking with size 3, min. size 2, autoscale off, PGs 128/128 with the default replicated_rule.

The rados benchmark I ran gave me the following throughput here:

(10G network saturated in all non-4K-blocksize tests, as expected)

Code:

rados bench -p bench 120 write --no-cleanup -> 1092 MiB/s
rados bench -p bench 60 seq  -> 1529 MiB/s
rados bench -p bench 60 rand -> 1524 MiB/s

I then cleaned the pool before testing 4KB blocks.

Code:

rados -p bench cleanup

And measured with 4K blocks (network never saturates here):

Code:

rados bench -p bench 120 write --no-cleanup -b 4K -> 65 MiB/s
rados bench -p bench 60 seq  -> 258 MiB/s
rados bench -p bench 60 rand -> 271 MiB/s

So far, so good. Next I created both a Ubuntu 20.04 LXC container and a Ubuntu 20.04 VM,neither has a network configured. The root disks were created on the ceph rdb pool with default settings. Inside those I did simple dd commands:

LXC

Code:

# dd if=/dev/zero of=test.data bs=4M count=1000 oflag=direct                                                                                  
1000+0 records in                                                                                                                                          
1000+0 records out                                                                                                                                          
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 15.7207 s, 267 MB/s


# dd if=/dev/zero of=test.data bs=4K count=100000 oflag=direct                                                                                  
100000+0 records in                                                                                                                                        
100000+0 records out                                                                                                                                        
409600000 bytes (410 MB, 391 MiB) copied, 51.1674 s, 8.0 MB/s

VM

For one, I'm not sure why the VM gets better 4M write performance than the LXC, but more...these values are a long, long throw away from the rados bench values my ceph setup produced before, the 4K blocksize performance in particular is terrible. None of the dd calls saturated either of the network connections at all, either, so I'm guessing it's not ceph itself that's the bottleneck here.

What can I do to improve especially small file performance inside my virtual machines/containers? How am I getting only about 1/10th of the rados bench performance from inside them at 4K blocksize?

czechsys · Aug 31, 2021

hdparm and dd are useless for testing.

Eddy Buhler · Aug 31, 2021

Well, I'd expect dd to show too high values if things were cached, but in this case, I am seeing much too low throughput, and I don't see how dd might be "obscurely slow" here.

Anyway, here you go, this is from inside an LXC container:

Code:

# fio --name=test.data --ioengine=libaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 --rwmixwrite 100 --direct=1

test.data: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.16
clock setaffinity failed: Invalid argument
clock setaffinity failed: Invalid argument
clock setaffinity failed: Invalid argument
clock setaffinity failed: Invalid argument
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=6154KiB/s][w=1538 IOPS][eta 00m:00s]
test.data: (groupid=0, jobs=1): err= 0: pid=943: Tue Aug 31 12:45:01 2021
  write: IOPS=1485, BW=5940KiB/s (6083kB/s)(348MiB/60002msec); 0 zone resets
    slat (usec): min=2, max=128, avg= 6.80, stdev= 2.30
    clat (usec): min=469, max=5404, avg=664.47, stdev=133.69
     lat (usec): min=474, max=5532, avg=671.47, stdev=133.84
    clat percentiles (usec):
     |  1.00th=[  537],  5.00th=[  570], 10.00th=[  586], 20.00th=[  611],
     | 30.00th=[  627], 40.00th=[  644], 50.00th=[  660], 60.00th=[  668],
     | 70.00th=[  685], 80.00th=[  693], 90.00th=[  717], 95.00th=[  734],
     | 99.00th=[ 1631], 99.50th=[ 1860], 99.90th=[ 1991], 99.95th=[ 2024],
     | 99.99th=[ 2180]
   bw (  KiB/s): min= 4984, max= 6248, per=99.99%, avg=5939.26, stdev=127.74, samples=119
   iops        : min= 1246, max= 1562, avg=1484.82, stdev=31.93, samples=119
  lat (usec)   : 500=0.05%, 750=97.46%, 1000=1.42%
  lat (msec)   : 2=0.99%, 4=0.08%, 10=0.01%
  cpu          : usr=0.68%, sys=1.62%, ctx=89411, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,89103,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=5940KiB/s (6083kB/s), 5940KiB/s-5940KiB/s (6083kB/s-6083kB/s), io=348MiB (365MB), run=60002-60002msec

Disk stats (read/write):
  rbd2: ios=0/88964, merge=0/11, ticks=0/58230, in_queue=58230, util=99.81%

SINOS · Aug 31, 2021

Eddy Buhler said:
OSDs are two NVMe disks on each server with a capacity of 3.5TiB each

Can you provide the exact model? Do they have Power Loss Protection? This can affect sync-write speed afaik.
Also the VM / Disk config would be helpful

Eddy Buhler · Aug 31, 2021

Sure:

Code:

# smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-3-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQL23T8HCLS-00A07
Serial Number:                      S64HNE0R503229
Firmware Version:                   GDC5302Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 3,840,755,982,336 [3.84 TB]

They are advertised as "datacenter edition SSDs" - so I expect the do have power loss protection.

Since performance is terrible in LXC, too, I think we can neglect the particular VM disk config for now. Only thing I set for my LXC disk was "noatime", which should speed things up, not slow them down.

spirit · Aug 31, 2021

with iodepth=1 in a vm, I'm able to reach around 1000iops randwr 4k && 5000 iops randread 4k. (with low latency switches, and 3ghz cpu on hypervisor && ceph nodes + nvme datacenter drives + repliation).

This is because of latency (network latency (client->osd, osd->osd replication) + cpu latency on client/server).

(write are slower because of the replication, and use more cpu cycles too, so around 1ms by iops)

note that rados bench is doing something like iodepth=16 or 32.

if you do 4k iops or 4M iops, it's still use almost same cpu, that's why you can reach big throughput (low iops) with 4M bench.

Eddy Buhler · Aug 31, 2021

Okay, so if I have write requests that queue up several write requests (iodepth=16 would queue up 16 4K blocks for writing to disk), I get the higher speed - at iodepth=16, I do get ~80MB/s, from inside LXC.

But a single threaded app writing 4K blocks one at a time (such as dd with bs=4K) would still crawl along slowly because of latency? (but I could run 10 of them in parallel and all of them would probably get near their "full" slow speed?)

Felix. · Aug 31, 2021

When using VMs, cache=writeback (not the (unsafe) one) could possibly help a little bit.
Also VirtIO SCSI single controller & iothread=1 for the disks could possibly help, when using multiple disks.
Maybe Jumboframes and some QoS configuration for the Ceph Traffic can help with the latency.

spirit · Sep 1, 2021

Hi,
I have done other test on another cluster, and I'm able to reach 7000 iops with 4k write. (and also for reads)

I didn't notice, but my previous test cluster was empty, and the pg autoscale reduce pg_num to 32.

on the other cluster (3 nodes with 8 nvme each), I have pg_num = 512.

So, maybe check that too.

qemu bench, virtio-scsi, no iothread, cache=none, no writeback, with sync io.

Code:

fio --filename=/dev/sdb --sync=1 --direct=1 --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting  --name=test --ioengine=io_uring

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1
fio-3.25
Starting 1 process
^Cbs: 1 (f=2): [w(1)][38.3%][w=27.9MiB/s][w=7148 IOPS][eta 00m:37s]
fio: terminating on signal 2

test: (groupid=0, jobs=1): err= 0: pid=56303: Wed Sep  1 08:40:11 2021
  write: IOPS=6668, BW=26.0MiB/s (27.3MB/s)(620MiB/23798msec); 0 zone resets
    slat (usec): min=2, max=277, avg= 9.76, stdev= 1.35
    clat (nsec): min=450, max=2495.6k, avg=138942.02, stdev=19877.33
     lat (usec): min=129, max=2506, avg=148.85, stdev=20.01
    clat percentiles (usec):
     |  1.00th=[  124],  5.00th=[  127], 10.00th=[  128], 20.00th=[  133],
     | 30.00th=[  137], 40.00th=[  139], 50.00th=[  141], 60.00th=[  141],
     | 70.00th=[  141], 80.00th=[  143], 90.00th=[  145], 95.00th=[  151],
     | 99.00th=[  165], 99.50th=[  172], 99.90th=[  258], 99.95th=[  351],
     | 99.99th=[ 1106]
   bw (  KiB/s): min=25240, max=28656, per=99.95%, avg=26661.96, stdev=1003.26, samples=47
   iops        : min= 6310, max= 7164, avg=6665.49, stdev=250.81, samples=47
  lat (nsec)   : 500=0.01%
  lat (usec)   : 100=0.01%, 250=99.89%, 500=0.07%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=2.62%, sys=3.21%, ctx=317434, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,158706,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=26.0MiB/s (27.3MB/s), 26.0MiB/s-26.0MiB/s (27.3MB/s-27.3MB/s), io=620MiB (650MB), run=23798-23798msec

Disk stats (read/write):
  sdb: ios=104/157607, merge=0/0, ticks=23/22656, in_queue=30998, util=99.71%

Eddy Buhler · Sep 1, 2021

I created pools with various PG settings, and performance wise ended up sticking with size 3, min. size 2, autoscale off, PGs 128/128 with the default replicated_rule.

I did. I created pools with PG settings from 32 all the way up to 1024, and rados gave me the most comfortable performance mix out of them all at 128 PGs.

I also experimented with splitting the drives up into four logical OSDs each, but that led to the ceph cluster showing degraded PGs occasionally, including the occasional pg going into recovery, and I deemed that unacceptable, so I went back to single disk OSDs. I went back down from ceph 16 to 15, but haven't tried splitting the OSDs again - it's time to get this cluster into operation now, so I'm going to stick with what I have now.

As mentioned above, if I set "iodepth=16" in fio, I get the full 4K write speed that rados shows,too, so apart from improving network latency, there probably is little more to be done. I'm going to set jumbo frames on the ceph network links and see if that helps, but otherwise, I'm starting moving the first (and least mission critical) servers over to the new cluster now.

spirit · Sep 1, 2021

Eddy Buhler said:
I did. I created pools with PG settings from 32 all the way up to 1024, and rados gave me the most comfortable performance mix out of them all at 128 PGs.

I also experimented with splitting the drives up into four logical OSDs each, but that led to the ceph cluster showing degraded PGs occasionally, including the occasional pg going into recovery, and I deemed that unacceptable, so I went back to single disk OSDs. I went back down from ceph 16 to 15, but haven't tried splitting the OSDs again - it's time to get this cluster into operation now, so I'm going to stick with what I have now.

As mentioned above, if I set "iodepth=16" in fio, I get the full 4K write speed that rados shows,too, so apart from improving network latency, there probably is little more to be done. I'm going to set jumbo frames on the ceph network links and see if that helps, but otherwise, I'm starting moving the first (and least mission critical) servers over to the new cluster now.

Great

If your workload don't use synchronous writes, in a vm, you can enable writeback, it's helping a lot. (small iops are aggreated in bigger one). But I don't known how to enable it in lxc.

Eddy Buhler · Sep 1, 2021

I'm marking this solved, as the original issue stems mostly from a misconception on my part.

Search

Search

[SOLVED] ceph atrocious performance inside LXC/VM

Eddy Buhler

Member

czechsys

Renowned Member

Eddy Buhler

Member

SINOS

Member

Eddy Buhler

Member

spirit

Distinguished Member

Eddy Buhler

Member

Felix.

Renowned Member

spirit

Distinguished Member

Eddy Buhler

Member

spirit

Distinguished Member

Eddy Buhler

Member