ceph + lxc + ssd Abnormal performance

majorchen

New Member
May 27, 2019
13
0
1
38
I was deploying a small node and found that write performance using lxc was very poor. My network is 1G, but it is not so bad. Is there any place to check?


my config:
nodes: 4
CPU: i5-8250U
RAM: 32G
HDD1: 128G SSD
HDD2: 480G SSD

pve version:
Code:
root@tpa-pve1:~# pveversion
pve-manager/5.4-3/0a6eaa62 (running kernel: 4.15.18-12-pve)

ceph config:
Code:
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 172.16.1.0/24
     fsid = c244e68c-82e2-4bca-a66c-f3eab7d589d5
     keyring = /etc/pve/priv/$cluster.$name.keyring
     mon allow pool delete = true
     osd journal size = 5120
     osd pool default min size = 2
     osd pool default size = 3
     public network = 172.16.1.0/24

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mds.tpa-pve4]
     host = tpa-pve4
     mds standby for name = pve

[mds.tpa-pve2]
     host = tpa-pve2
     mds standby for name = pve

[mon.tpa-pve2]
     host = tpa-pve2
     mon addr = 172.16.1.136:6789

[mon.tpa-pve1]
     host = tpa-pve1
     mon addr = 172.16.1.135:6789

[mon.tpa-pve3]
     host = tpa-pve3
     mon addr = 172.16.1.137:6789

osd:
Code:
root@tpa-pve4:~# ceph osd tree
ID CLASS WEIGHT  TYPE NAME         STATUS REWEIGHT PRI-AFF
-1       1.74640 root default                             
-3       0.43660     host tpa-pve1                         
 0   ssd 0.43660         osd.0         up  1.00000 1.00000
-5       0.43660     host tpa-pve2                         
 1   ssd 0.43660         osd.1         up  1.00000 1.00000
-7       0.43660     host tpa-pve3                         
 2   ssd 0.43660         osd.2         up  1.00000 1.00000
-9       0.43660     host tpa-pve4                         
 3   ssd 0.43660         osd.3         up  1.00000 1.00000


lxc fio test:
Code:
root@sysbench:~# fio -filename=./test -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=4k -size=1G -numjobs=50 -runtime=180 -group_reporting -name=rand_rw_4k
rand_rw_4k: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.8
Starting 50 threads
Jobs: 50 (f=50): [w(50)][100.0%][r=0KiB/s,w=832KiB/s][r=0,w=208 IOPS][eta 00m:00s]
rand_rw_4k: (groupid=0, jobs=50): err= 0: pid=539: Thu Aug  8 12:55:52 2019
  write: IOPS=208, BW=836KiB/s (856kB/s)(147MiB/180237msec)
    clat (msec): min=3, max=6335, avg=239.18, stdev=928.64
     lat (msec): min=3, max=6335, avg=239.18, stdev=928.64
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    5],
     | 30.00th=[    5], 40.00th=[    5], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    6], 80.00th=[    6], 90.00th=[    6], 95.00th=[ 3037],
     | 99.00th=[ 4530], 99.50th=[ 4799], 99.90th=[ 5537], 99.95th=[ 5738],
     | 99.99th=[ 6208]
   bw (  KiB/s): min=    7, max=  864, per=13.54%, avg=113.10, stdev=152.59, samples=2658
   iops        : min=    1, max=  216, avg=28.22, stdev=38.15, samples=2658
  lat (msec)   : 4=19.59%, 10=73.99%, 20=0.07%, 50=0.03%, 100=0.01%
  lat (msec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.02%
  cpu          : usr=0.00%, sys=0.03%, ctx=75420, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,37652,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=836KiB/s (856kB/s), 836KiB/s-836KiB/s (856kB/s-856kB/s), io=147MiB (154MB), run=180237-180237msec

Disk stats (read/write):
  rbd0: ios=0/37757, merge=0/43, ticks=0/178104, in_queue=178080, util=98.52%
root@sysbench:~# fio -filename=./test -direct=1 -iodepth 1 -thread -rw=randread -ioengine=psync -bs=4k -size=1G -numjobs=50 -runtime=180 -group_reporting -name=rand_rw_4krand_rw_4k: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.8
Starting 50 threads
Jobs: 50 (f=50): [r(50)][100.0%][r=153MiB/s,w=0KiB/s][r=39.1k,w=0 IOPS][eta 00m:00s]
rand_rw_4k: (groupid=0, jobs=50): err= 0: pid=600: Thu Aug  8 13:00:07 2019
   read: IOPS=39.0k, BW=152MiB/s (160MB/s)(26.8GiB/180002msec)
    clat (usec): min=84, max=17139, avg=1279.31, stdev=769.22
     lat (usec): min=84, max=17139, avg=1279.46, stdev=769.22
    clat percentiles (usec):
     |  1.00th=[  117],  5.00th=[  133], 10.00th=[  149], 20.00th=[  190],
     | 30.00th=[  338], 40.00th=[ 1647], 50.00th=[ 1713], 60.00th=[ 1762],
     | 70.00th=[ 1811], 80.00th=[ 1844], 90.00th=[ 1909], 95.00th=[ 1958],
     | 99.00th=[ 2089], 99.50th=[ 2474], 99.90th=[ 4490], 99.95th=[ 5342],
     | 99.99th=[ 6587]
   bw (  KiB/s): min= 2664, max= 3592, per=2.00%, avg=3122.04, stdev=97.54, samples=17984
   iops        : min=  666, max=  898, avg=780.49, stdev=24.39, samples=17984
  lat (usec)   : 100=0.02%, 250=26.85%, 500=4.37%, 750=0.57%, 1000=0.54%
  lat (msec)   : 2=65.09%, 4=2.43%, 10=0.13%, 20=0.01%
  cpu          : usr=0.25%, sys=0.66%, ctx=7026106, majf=0, minf=50
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=7025489,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=152MiB/s (160MB/s), 152MiB/s-152MiB/s (160MB/s-160MB/s), io=26.8GiB (28.8GB), run=180002-180002msec

Disk stats (read/write):
  rbd0: ios=7020502/50, merge=0/2, ticks=8921860/80, in_queue=8986844, util=100.00%


node ssd fio test
Code:
root@tpa-pve4:~# fio -filename=/dev/sdb -direct=1 -iodepth 1 -thread -rw=randread -ioengine=psync -bs=4k -size=1G -numjobs=50 -runtime=180 -group_reporting -name=rand_100read_4k
rand_100read_4k: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 50 threads
Jobs: 50 (f=50): r(50)external link 99.3% doneexternal link 359.2MB/0KB/0KB /sexternal link 91.1K/0/0 iopsexternal link eta 00m:01sexternal link
rand_100read_4k: (groupid=0, jobs=50): err= 0: pid=6307: Thu Aug  8 14:30:01 2019
  read : io=51200MB, bw=367377KB/s, iops=91844, runt=142711msec
    clat (usec): min=27, max=9607, avg=543.47, stdev=134.12
     lat (usec): min=27, max=9608, avg=543.51, stdev=134.12
    clat percentiles (usec):
     |  1.00th=  370external link,  5.00th=  382external link, 10.00th=  402external link, 20.00th=  434external link,
     | 30.00th=  466external link, 40.00th=  498external link, 50.00th=  532external link, 60.00th=  564external link,
     | 70.00th=  596external link, 80.00th=  628external link, 90.00th=  684external link, 95.00th=  740external link,
     | 99.00th= 1048external link, 99.50th= 1192external link, 99.90th= 1432external link, 99.95th= 1528external link,
     | 99.99th= 1704external link
    lat (usec) : 50=0.01NaV, 250=0.04NaV, 750=54.53%
    lat (usec) : 1000=3.14%
    lat (msec) : 2=1.33NaV, 10=0.01%
  cpu          : usr=0.27NaV, ctx=13107569, majf=13, minf=84
  IO depths    : 1=100.0NaV, 4=0.0NaV, 16=0.0NaV, >=64=0.0%
     submit    : 0=0.0%, 4=100.0NaV, 16=0.0NaV, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0NaV, 16=0.0NaV, 64=0.0%, >=64=0.0%
     issued    : total=r=13107200/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=51200MB, aggrb=367377KB/s, minb=367377KB/s, maxb=367377KB/s, mint=142711msec, maxt=142711msec

Disk stats (read/write):
  sdb: ios=13082515/0, merge=2776/0, ticks=7032800/0, in_queue=7032400, util=99.93%
root@tpa-pve4:~# fio -filename=/dev/sdb -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=4k -size=1G -numjobs=50 -runtime=180 -group_reporting -name=rand_100write_4k
rand_100write_4k: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 50 threads
Jobs: 50 (f=50): w(50)external link 100.0% doneexternal link 0KB/283.3MB/0KB /sexternal link 0/72.6K/0 iopsexternal link eta 00m:00sexternal link
rand_100write_4k: (groupid=0, jobs=50): err= 0: pid=7741: Thu Aug  8 14:34:36 2019
  write: io=44267MB, bw=251831KB/s, iops=62957, runt=180001msec
    clat (usec): min=33, max=217973, avg=793.36, stdev=2056.59
     lat (usec): min=33, max=217973, avg=793.54, stdev=2056.60
    clat percentiles (usec):
     |  1.00th=  454external link,  5.00th=  474external link, 10.00th=  498external link, 20.00th=  540external link,
     | 30.00th=  580external link, 40.00th=  628external link, 50.00th=  668external link, 60.00th=  708external link,
     | 70.00th=  756external link, 80.00th=  804external link, 90.00th=  876external link, 95.00th=  996external link,
     | 99.00th= 1736external link, 99.50th= 4512external link, 99.90th=19072external link, 99.95th=19584external link,
     | 99.99th=58624external link
    lat (usec) : 50=0.01NaV, 250=0.01NaV, 750=58.75%
    lat (usec) : 1000=25.76%
    lat (msec) : 2=4.20NaV, 10=0.06NaV, 50=0.02%
    lat (msec) : 100=0.01NaV
  cpu          : usr=0.17NaV, ctx=11333091, majf=0, minf=0
  IO depths    : 1=100.0NaV, 4=0.0NaV, 16=0.0NaV, >=64=0.0%
     submit    : 0=0.0%, 4=100.0NaV, 16=0.0NaV, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0NaV, 16=0.0NaV, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=11332443/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=44267MB, aggrb=251830KB/s, minb=251830KB/s, maxb=251830KB/s, mint=180001msec, maxt=180001msec

Disk stats (read/write):
  sdb: ios=76/11329075, merge=0/1963, ticks=0/8910684, in_queue=8919712, util=100.00%
 
what type of ssd's ?

ceph works best with data center grade very low latency ssd's .

i think there is some documentation on this at pve --> help , and plenty of other places..

ceph works phenomenally well , however the hardware needs to be suited for a cluster . we learned that the hard way.

there is plenty of advise to follow on other threads and pve documentation.
 
It's just a consumer-grade SSD( kingston 480G mstat), but the test environment doesn't pay much attention to the hardware requirements. Reading 150M/S is normal, but writing only 800KB/S is not normal, even ordinary HDD is not.
 
if the test environment doesn't pay much attention to the hardware, then why pay attention to the bad result?

for a good result use good hardware . consumer grade ssd's for ceph or zfs will usually cause problems.

get some used Intel DC P3700 SSD , a 10G network for ceph and a separate network for rest.

good luck with your experiments!
 
Thank you for your answer, but the test environment is not so much money, haha! It is enough to write 50M/s ~ 60M/s in a redundant shared storage, but I didn't expect it to be so low.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!