ceph + lxc + ssd Abnormal performance

majorchen

New Member
May 27, 2019
13
0
1
39
I was deploying a small node and found that write performance using lxc was very poor. My network is 1G, but it is not so bad. Is there any place to check?


my config:
nodes: 4
CPU: i5-8250U
RAM: 32G
HDD1: 128G SSD
HDD2: 480G SSD

pve version:
Code:
root@tpa-pve1:~# pveversion
pve-manager/5.4-3/0a6eaa62 (running kernel: 4.15.18-12-pve)

ceph config:
Code:
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 172.16.1.0/24
     fsid = c244e68c-82e2-4bca-a66c-f3eab7d589d5
     keyring = /etc/pve/priv/$cluster.$name.keyring
     mon allow pool delete = true
     osd journal size = 5120
     osd pool default min size = 2
     osd pool default size = 3
     public network = 172.16.1.0/24

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mds.tpa-pve4]
     host = tpa-pve4
     mds standby for name = pve

[mds.tpa-pve2]
     host = tpa-pve2
     mds standby for name = pve

[mon.tpa-pve2]
     host = tpa-pve2
     mon addr = 172.16.1.136:6789

[mon.tpa-pve1]
     host = tpa-pve1
     mon addr = 172.16.1.135:6789

[mon.tpa-pve3]
     host = tpa-pve3
     mon addr = 172.16.1.137:6789

osd:
Code:
root@tpa-pve4:~# ceph osd tree
ID CLASS WEIGHT  TYPE NAME         STATUS REWEIGHT PRI-AFF
-1       1.74640 root default                             
-3       0.43660     host tpa-pve1                         
 0   ssd 0.43660         osd.0         up  1.00000 1.00000
-5       0.43660     host tpa-pve2                         
 1   ssd 0.43660         osd.1         up  1.00000 1.00000
-7       0.43660     host tpa-pve3                         
 2   ssd 0.43660         osd.2         up  1.00000 1.00000
-9       0.43660     host tpa-pve4                         
 3   ssd 0.43660         osd.3         up  1.00000 1.00000


lxc fio test:
Code:
root@sysbench:~# fio -filename=./test -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=4k -size=1G -numjobs=50 -runtime=180 -group_reporting -name=rand_rw_4k
rand_rw_4k: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.8
Starting 50 threads
Jobs: 50 (f=50): [w(50)][100.0%][r=0KiB/s,w=832KiB/s][r=0,w=208 IOPS][eta 00m:00s]
rand_rw_4k: (groupid=0, jobs=50): err= 0: pid=539: Thu Aug  8 12:55:52 2019
  write: IOPS=208, BW=836KiB/s (856kB/s)(147MiB/180237msec)
    clat (msec): min=3, max=6335, avg=239.18, stdev=928.64
     lat (msec): min=3, max=6335, avg=239.18, stdev=928.64
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    5],
     | 30.00th=[    5], 40.00th=[    5], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    6], 80.00th=[    6], 90.00th=[    6], 95.00th=[ 3037],
     | 99.00th=[ 4530], 99.50th=[ 4799], 99.90th=[ 5537], 99.95th=[ 5738],
     | 99.99th=[ 6208]
   bw (  KiB/s): min=    7, max=  864, per=13.54%, avg=113.10, stdev=152.59, samples=2658
   iops        : min=    1, max=  216, avg=28.22, stdev=38.15, samples=2658
  lat (msec)   : 4=19.59%, 10=73.99%, 20=0.07%, 50=0.03%, 100=0.01%
  lat (msec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.02%
  cpu          : usr=0.00%, sys=0.03%, ctx=75420, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,37652,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=836KiB/s (856kB/s), 836KiB/s-836KiB/s (856kB/s-856kB/s), io=147MiB (154MB), run=180237-180237msec

Disk stats (read/write):
  rbd0: ios=0/37757, merge=0/43, ticks=0/178104, in_queue=178080, util=98.52%
root@sysbench:~# fio -filename=./test -direct=1 -iodepth 1 -thread -rw=randread -ioengine=psync -bs=4k -size=1G -numjobs=50 -runtime=180 -group_reporting -name=rand_rw_4krand_rw_4k: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.8
Starting 50 threads
Jobs: 50 (f=50): [r(50)][100.0%][r=153MiB/s,w=0KiB/s][r=39.1k,w=0 IOPS][eta 00m:00s]
rand_rw_4k: (groupid=0, jobs=50): err= 0: pid=600: Thu Aug  8 13:00:07 2019
   read: IOPS=39.0k, BW=152MiB/s (160MB/s)(26.8GiB/180002msec)
    clat (usec): min=84, max=17139, avg=1279.31, stdev=769.22
     lat (usec): min=84, max=17139, avg=1279.46, stdev=769.22
    clat percentiles (usec):
     |  1.00th=[  117],  5.00th=[  133], 10.00th=[  149], 20.00th=[  190],
     | 30.00th=[  338], 40.00th=[ 1647], 50.00th=[ 1713], 60.00th=[ 1762],
     | 70.00th=[ 1811], 80.00th=[ 1844], 90.00th=[ 1909], 95.00th=[ 1958],
     | 99.00th=[ 2089], 99.50th=[ 2474], 99.90th=[ 4490], 99.95th=[ 5342],
     | 99.99th=[ 6587]
   bw (  KiB/s): min= 2664, max= 3592, per=2.00%, avg=3122.04, stdev=97.54, samples=17984
   iops        : min=  666, max=  898, avg=780.49, stdev=24.39, samples=17984
  lat (usec)   : 100=0.02%, 250=26.85%, 500=4.37%, 750=0.57%, 1000=0.54%
  lat (msec)   : 2=65.09%, 4=2.43%, 10=0.13%, 20=0.01%
  cpu          : usr=0.25%, sys=0.66%, ctx=7026106, majf=0, minf=50
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=7025489,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=152MiB/s (160MB/s), 152MiB/s-152MiB/s (160MB/s-160MB/s), io=26.8GiB (28.8GB), run=180002-180002msec

Disk stats (read/write):
  rbd0: ios=7020502/50, merge=0/2, ticks=8921860/80, in_queue=8986844, util=100.00%


node ssd fio test
Code:
root@tpa-pve4:~# fio -filename=/dev/sdb -direct=1 -iodepth 1 -thread -rw=randread -ioengine=psync -bs=4k -size=1G -numjobs=50 -runtime=180 -group_reporting -name=rand_100read_4k
rand_100read_4k: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 50 threads
Jobs: 50 (f=50): r(50)external link 99.3% doneexternal link 359.2MB/0KB/0KB /sexternal link 91.1K/0/0 iopsexternal link eta 00m:01sexternal link
rand_100read_4k: (groupid=0, jobs=50): err= 0: pid=6307: Thu Aug  8 14:30:01 2019
  read : io=51200MB, bw=367377KB/s, iops=91844, runt=142711msec
    clat (usec): min=27, max=9607, avg=543.47, stdev=134.12
     lat (usec): min=27, max=9608, avg=543.51, stdev=134.12
    clat percentiles (usec):
     |  1.00th=  370external link,  5.00th=  382external link, 10.00th=  402external link, 20.00th=  434external link,
     | 30.00th=  466external link, 40.00th=  498external link, 50.00th=  532external link, 60.00th=  564external link,
     | 70.00th=  596external link, 80.00th=  628external link, 90.00th=  684external link, 95.00th=  740external link,
     | 99.00th= 1048external link, 99.50th= 1192external link, 99.90th= 1432external link, 99.95th= 1528external link,
     | 99.99th= 1704external link
    lat (usec) : 50=0.01NaV, 250=0.04NaV, 750=54.53%
    lat (usec) : 1000=3.14%
    lat (msec) : 2=1.33NaV, 10=0.01%
  cpu          : usr=0.27NaV, ctx=13107569, majf=13, minf=84
  IO depths    : 1=100.0NaV, 4=0.0NaV, 16=0.0NaV, >=64=0.0%
     submit    : 0=0.0%, 4=100.0NaV, 16=0.0NaV, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0NaV, 16=0.0NaV, 64=0.0%, >=64=0.0%
     issued    : total=r=13107200/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=51200MB, aggrb=367377KB/s, minb=367377KB/s, maxb=367377KB/s, mint=142711msec, maxt=142711msec

Disk stats (read/write):
  sdb: ios=13082515/0, merge=2776/0, ticks=7032800/0, in_queue=7032400, util=99.93%
root@tpa-pve4:~# fio -filename=/dev/sdb -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=4k -size=1G -numjobs=50 -runtime=180 -group_reporting -name=rand_100write_4k
rand_100write_4k: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 50 threads
Jobs: 50 (f=50): w(50)external link 100.0% doneexternal link 0KB/283.3MB/0KB /sexternal link 0/72.6K/0 iopsexternal link eta 00m:00sexternal link
rand_100write_4k: (groupid=0, jobs=50): err= 0: pid=7741: Thu Aug  8 14:34:36 2019
  write: io=44267MB, bw=251831KB/s, iops=62957, runt=180001msec
    clat (usec): min=33, max=217973, avg=793.36, stdev=2056.59
     lat (usec): min=33, max=217973, avg=793.54, stdev=2056.60
    clat percentiles (usec):
     |  1.00th=  454external link,  5.00th=  474external link, 10.00th=  498external link, 20.00th=  540external link,
     | 30.00th=  580external link, 40.00th=  628external link, 50.00th=  668external link, 60.00th=  708external link,
     | 70.00th=  756external link, 80.00th=  804external link, 90.00th=  876external link, 95.00th=  996external link,
     | 99.00th= 1736external link, 99.50th= 4512external link, 99.90th=19072external link, 99.95th=19584external link,
     | 99.99th=58624external link
    lat (usec) : 50=0.01NaV, 250=0.01NaV, 750=58.75%
    lat (usec) : 1000=25.76%
    lat (msec) : 2=4.20NaV, 10=0.06NaV, 50=0.02%
    lat (msec) : 100=0.01NaV
  cpu          : usr=0.17NaV, ctx=11333091, majf=0, minf=0
  IO depths    : 1=100.0NaV, 4=0.0NaV, 16=0.0NaV, >=64=0.0%
     submit    : 0=0.0%, 4=100.0NaV, 16=0.0NaV, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0NaV, 16=0.0NaV, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=11332443/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=44267MB, aggrb=251830KB/s, minb=251830KB/s, maxb=251830KB/s, mint=180001msec, maxt=180001msec

Disk stats (read/write):
  sdb: ios=76/11329075, merge=0/1963, ticks=0/8910684, in_queue=8919712, util=100.00%
 
what type of ssd's ?

ceph works best with data center grade very low latency ssd's .

i think there is some documentation on this at pve --> help , and plenty of other places..

ceph works phenomenally well , however the hardware needs to be suited for a cluster . we learned that the hard way.

there is plenty of advise to follow on other threads and pve documentation.
 
It's just a consumer-grade SSD( kingston 480G mstat), but the test environment doesn't pay much attention to the hardware requirements. Reading 150M/S is normal, but writing only 800KB/S is not normal, even ordinary HDD is not.
 
if the test environment doesn't pay much attention to the hardware, then why pay attention to the bad result?

for a good result use good hardware . consumer grade ssd's for ceph or zfs will usually cause problems.

get some used Intel DC P3700 SSD , a 10G network for ceph and a separate network for rest.

good luck with your experiments!
 
Thank you for your answer, but the test environment is not so much money, haha! It is enough to write 50M/s ~ 60M/s in a redundant shared storage, but I didn't expect it to be so low.