Ceph + LXC poor performance but OK for KVM

jmartinb · Apr 23, 2018

Hi,
briefly my setup is as follows:
8 nodes with Proxmox 5.1:

3 nodes for virtualization
5 nodes for Ceph storage, all with monitor+manager, with the following disks:
- 2 spinners (250GB) in ZFS RAID1 for the operative system.
- 4 spinners (1 TB) for data
- 1 NVMe (256G)

I've tried several setups: bluestore, with one OSD on each spinner using NVMe for journaling, filestore with the same distribution, bluestore using the NVMe's as OSD. bluestore defining two different crush rules (normal for the spinners and fast for the NVMe), ...
In all the configurations I've found the same problem: after running several benchmarks with rados the results seems to be correct, but the performance of the LXC containers is really poor, compared to KVM. Also I've run tests with fio, sysbench and dd and in all cases the performance of LXC containers is way off the performance of a VM with the same OS.
Also, downloading a big file from a http server (~3GB) inside the container seems to get stuck every few seconds. The same download inside a VM takes place flawlessly.
I'm pretty new with Ceph and I'm not sure where to begin to isolate the problem.

Alwin · Apr 24, 2018

jmartinb said:
5 nodes for Ceph storage, all with monitor+manager, with the following disks:

5 MONs (1000+ clients/nodes) is not needed for a small clusters. 3 MONs for quorum, any more monitors will add extra communication (latency) and load on those server.

Test your installation with fio and rados (share the results

). Tell us more about your hardware and how you set things up.

jmartinb · Apr 24, 2018

Thanks Alwin, following your advice I've removed 2 monitors.

All 5 Ceph servers have the same hardware:

4 x Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz (1 Socket)

8 GB RAM

Linux 4.13.16-2-pve #1 SMP PVE 4.13.16-47 (Mon, 9 Apr 2018 09:58:12 +0200)pve-manager/5.1-51/96be5354

2 Seagate/WD 250GB Disks for Proxmox in ZFS RAID1

4 Seagate/WD 1TB Disks for Ceph OSD

1 Samsung NVMe 960 EVO for Ceph OSD or journaling.

I've tried different setups. Right now I have two different pools, each one using a different crush rule: fast, for the nvme's and normal, for the hdd's. All the following tests are using the fast crush rule.

Here are the results:

RADOS BENCH

Code:

# ceph osd pool create benchpool 100 100 replicated fast
# rados bench -p benchpool 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_san4_2118499
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       121       105   419.975       420    0.242385    0.141797
    2      16       218       202   403.922       388   0.0210253    0.150809
    3      16       342       326   434.592       496   0.0587976    0.143809
    4      16       450       434   433.935       432   0.0215887    0.142473
    5      16       579       563   450.338       516   0.0250991    0.138846
    6      16       683       667   444.587       416   0.0268286    0.140304
    7      16       797       781   446.208       456    0.215452    0.139732
    8      16       917       901   450.424       480   0.0815878    0.141566
    9      16      1035      1019   452.816       472    0.198583    0.140711
   10      16      1147      1131   452.329       448    0.043896    0.140478
Total time run:         10.182685
Total writes made:      1148
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     450.962
Stddev Bandwidth:       39.5283
Max bandwidth (MB/sec): 516
Min bandwidth (MB/sec): 388
Average IOPS:           112
Stddev IOPS:            9
Max IOPS:               129
Min IOPS:               97
Average Latency(s):     0.14159
Stddev Latency(s):      0.0845846
Max latency(s):         0.464115
Min latency(s):         0.015503

# rados bench -p benchpool 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       144       128   511.929       512  0.00663964    0.111327
    2      16       266       250   499.898       488  0.00696572    0.116927
    3      16       346       330   439.907       320  0.00894489     0.12913
    4      16       492       476   475.908       584  0.00910251    0.128387
    5      15       629       614   491.113       552   0.0155886    0.126647
    6      16       720       704   469.257       360   0.0061489    0.128362
    7      16       817       801   457.636       388  0.00622723    0.133388
    8      16       930       914   456.919       452    0.017652    0.137619
    9      16       990       974   432.817       240   0.0114887    0.139314
   10      15      1076      1061   424.324       348    0.409345    0.145274
Total time run:       10.274947
Total reads made:     1077
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   419.272
Average IOPS:         104
Stddev IOPS:          27
Max IOPS:             146
Min IOPS:             60
Average Latency(s):   0.149629
Max latency(s):       0.877641
Min latency(s):       0.00413163

# rados bench -p benchpool 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      15       172       157   627.877       628    0.581246   0.0651203
    2      15       439       424   847.854      1068  0.00774446   0.0720255
    3      16       655       639   851.875       860   0.0205858    0.072434
    4      15       921       906   905.877      1068  0.00824748   0.0676324
    5      16      1111      1095   875.861       756  0.00416771   0.0669811
    6      15      1340      1325     883.2       920    0.423153   0.0692408
    7      15      1567      1552   886.727       908   0.0249559   0.0707504
    8      15      1807      1792   895.859       960   0.0418014   0.0700944
    9      16      2012      1996   886.944       816  0.00408739   0.0688534
   10      16      2193      2177   870.645       724   0.0158867   0.0718316
Total time run:       10.059782
Total reads made:     2193
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   871.987
Average IOPS:         217
Stddev IOPS:          35
Max IOPS:             267
Min IOPS:             157
Average Latency(s):   0.0723455
Max latency(s):       0.753745
Min latency(s):       0.00114913

RBD BENCH

Code:

# rbd create image01 --size 1024 --pool benchpool --image-feature layering
# rbd map image01 --pool benchpool --name client.admin
# mkfs.ext4 -m0 /dev/rbd/rbdbench/image01
# mkfs.ext4 -m0 /dev/rbd/benchpool/image01
# mkdir /mnt/ceph-block-device
# mount /dev/rbd/benchpool/image01 /mnt/ceph-block-device/
# rbd bench-write image01 –pool=benchpool
rbd: bench-write is deprecated, use rbd bench --io-type write ...
bench  type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     56443  56466.98  231288770.08
    2    116303  58163.57  238237969.48
    3    176545  58856.35  241075614.13
    4    235837  58965.29  241521833.73
elapsed:     4  ops:   262144  ops/sec: 58346.60  bytes/sec: 238987654.86

FIO

Code:

# cat fio/examples/rbd.fio
ioengine=rbd
clientname=admin
pool=benchpool
rbdname=image01
rw=randwrite
bs=4k
[rbd_iodepth32]
iodepth=32

# /usr/local/bin/fio fio/examples/rbd.fio
rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32
fio-3.5-109-g4fe72
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=28.2MiB/s][r=0,w=7227 IOPS][eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=2462835: Tue Apr 24 11:49:40 2018
  write: IOPS=6695, BW=26.2MiB/s (27.4MB/s)(1024MiB/39152msec)
    slat (nsec): min=1115, max=982907, avg=4219.44, stdev=7216.44
    clat (usec): min=1347, max=83611, avg=4773.95, stdev=2564.86
     lat (usec): min=1349, max=83616, avg=4778.16, stdev=2564.88
    clat percentiles (usec):
     |  1.00th=[ 2147],  5.00th=[ 2573], 10.00th=[ 2835], 20.00th=[ 3195],
     | 30.00th=[ 3523], 40.00th=[ 3851], 50.00th=[ 4146], 60.00th=[ 4424],
     | 70.00th=[ 4883], 80.00th=[ 5866], 90.00th=[ 7635], 95.00th=[ 9503],
     | 99.00th=[13042], 99.50th=[14353], 99.90th=[23725], 99.95th=[42730],
     | 99.99th=[80217]
   bw (  KiB/s): min=17352, max=32496, per=100.00%, avg=26802.01, stdev=3429.78, samples=78
   iops        : min= 4338, max= 8124, avg=6700.49, stdev=857.43, samples=78
  lat (msec)   : 2=0.46%, 4=44.85%, 10=50.57%, 20=4.00%, 50=0.09%
  lat (msec)   : 100=0.03%
  cpu          : usr=4.20%, sys=2.17%, ctx=145419, majf=10, minf=8454
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=26.2MiB/s (27.4MB/s), 26.2MiB/s-26.2MiB/s (27.4MB/s-27.4MB/s), io=1024MiB (1074MB), run=39152-39152msec

# cat fio/examples/rbd.2.fio
ioengine=rbd
clientname=admin
pool=benchpool
rbdname=image01
rw=write
bs=4M
[rbd_iodepth32]
iodepth=32

# /usr/local/bin/fio fio/examples/rbd.2.fio
rbd_iodepth32: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=rbd, iodepth=32
fio-3.5-109-g4fe72
Starting 1 process
Jobs: 1 (f=1)
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=2465474: Tue Apr 24 11:50:28 2018
  write: IOPS=101, BW=408MiB/s (428MB/s)(1024MiB/2510msec)
    slat (usec): min=823, max=10543, avg=2687.03, stdev=1322.14
    clat (msec): min=51, max=600, avg=303.36, stdev=141.08
     lat (msec): min=54, max=603, avg=306.05, stdev=141.10
    clat percentiles (msec):
     |  1.00th=[   58],  5.00th=[   78], 10.00th=[   90], 20.00th=[  153],
     | 30.00th=[  230], 40.00th=[  266], 50.00th=[  300], 60.00th=[  363],
     | 70.00th=[  418], 80.00th=[  443], 90.00th=[  481], 95.00th=[  502],
     | 99.00th=[  575], 99.50th=[  600], 99.90th=[  600], 99.95th=[  600],
     | 99.99th=[  600]
   bw (  KiB/s): min=237568, max=466944, per=88.24%, avg=368640.00, stdev=100831.51, samples=5
   iops        : min=   58, max=  114, avg=90.00, stdev=24.62, samples=5
  lat (msec)   : 100=12.11%, 250=24.22%, 500=58.20%, 750=5.47%
  cpu          : usr=18.13%, sys=6.62%, ctx=263, majf=0, minf=88114
  IO depths    : 1=0.4%, 2=0.8%, 4=1.6%, 8=3.1%, 16=6.2%, 32=87.9%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,256,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=408MiB/s (428MB/s), 408MiB/s-408MiB/s (428MB/s-428MB/s), io=1024MiB (1074MB), run=2510-2510msec

I've also made some tests inside two similar VM and LXC, both running on the same Proxmox server:

VM

Code:

bootdisk: scsi0
cores: 2
ide2: local:iso/ubuntu-16.04.4-server-amd64.iso,media=cdrom
memory: 512
name: prova
net0: virtio=92:25:FB:07:CC:45,bridge=vmbr0,tag=305
numa: 0
ostype: l26
scsi0: ceph-fast_vm:vm-105-disk-1,size=8G
scsihw: virtio-scsi-pci
smbios1: uuid=b24099a7-0e59-4a32-9067-98403a7319a0
sockets: 1

LXC

Code:

arch: amd64
cores: 2
hostname: prova
memory: 512
net0: name=eth0,bridge=vmbr0,gw=172.30.5.1,hwaddr=BE:2B:28:2E:EE:93,ip=172.30.5.49/24,tag=305,type=veth
ostype: ubuntu
rootfs: ceph-fast_ct:vm-109-disk-1,size=8G
swap: 512

FIO in the VM

Code:

# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.2.10
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [44423KB/14970KB/0KB /s] [11.2K/3742/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2907: Tue Apr 24 12:03:20 2018
  read : io=784996KB, bw=47141KB/s, iops=11785, runt= 16652msec
  write: io=263580KB, bw=15829KB/s, iops=3957, runt= 16652msec
  cpu          : usr=14.39%, sys=40.36%, ctx=58036, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=196249/w=65895/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=784996KB, aggrb=47141KB/s, minb=47141KB/s, maxb=47141KB/s, mint=16652msec, maxt=16652msec
  WRITE: io=263580KB, aggrb=15828KB/s, minb=15828KB/s, maxb=15828KB/s, mint=16652msec, maxt=16652msec

Disk stats (read/write):
  sda: ios=195410/65609, merge=0/3, ticks=629004/382800, in_queue=1013464, util=99.47%

FIO in the LXC

Code:

# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.2.10

Starting 1 process
test: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [55128KB/18592KB/0KB /s] [13.8K/4648/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1191: Tue Apr 24 12:22:06 2018
  read : io=784996KB, bw=54182KB/s, iops=13545, runt= 14488msec
  write: io=263580KB, bw=18193KB/s, iops=4548, runt= 14488msec
  cpu          : usr=10.64%, sys=29.70%, ctx=163343, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=196249/w=65895/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=784996KB, aggrb=54182KB/s, minb=54182KB/s, maxb=54182KB/s, mint=14488msec, maxt=14488msec
  WRITE: io=263580KB, aggrb=18192KB/s, minb=18192KB/s, maxb=18192KB/s, mint=14488msec, maxt=14488msec

Disk stats (read/write):
  rbd2: ios=195811/65719, merge=0/2, ticks=536064/371008, in_queue=909056, util=98.49%

Results are very similar in both VM and LXC. However, downloading an ISO file from the same server or making a local copy of that file shows very different performance. I've run this test several times, all with similar results.

VM

Code:

wget http://172.30.5.2/Programari/Windows%2010/es_windows_10_education_n_x86_dvd_6847579.iso--2018-04-24 12:37:32--  http://172.30.5.2/Programari/Windows%2010/es_windows_10_education_n_x86_dvd_6847579.iso
Connecting to 172.30.5.2:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2818172928 (2.6G) [application/octet-stream]
Saving to: ‘es_windows_10_education_n_x86_dvd_6847579.iso’

es_windows_10_education_n_x 100%[=========================================>]   2.62G  74.9MB/s    in 34s     

2018-04-24 12:38:06 (78.5 MB/s) - ‘es_windows_10_education_n_x86_dvd_6847579.iso’ saved [2818172928/2818172928]


# time sh -c "cp es_windows_10_education_n_x86_dvd_6847579.iso windows2.iso && sync"
real   0m20.316s
user   0m0.040s
sys   0m5.036s

LXC

Code:

# wget http://172.30.5.2/Programari/Windows%2010/es_windows_10_education_n_x86_dvd_6847579.iso
--2018-04-24 10:38:20--  http://172.30.5.2/Programari/Windows%2010/es_windows_10_education_n_x86_dvd_6847579.iso
Connecting to 172.30.5.2:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2818172928 (2.6G) [application/octet-stream]
Saving to: 'es_windows_10_education_n_x86_dvd_6847579.iso'

es_windows_10_education_n_x 100%[=========================================>]   2.62G   107MB/s    in 2m 10s  

2018-04-24 12:40:31 (20.6 MB/s) - 'es_windows_10_education_n_x86_dvd_6847579.iso' saved [2818172928/2818172928]

(*) In this case the download speed is not constant, it goes down every few seconds

# time sh -c "cp es_windows_10_education_n_x86_dvd_6847579.iso windows2.iso && sync"

real   1m46.540s
user   0m0.057s
sys   1m16.925s

Alwin · Apr 24, 2018

jmartinb said:
8 GB RAM

Nice to see that it works that nicely with such low amount of memory.

jmartinb said:
FIO in the VM:
READ: io=784996KB, aggrb=47141KB/s, minb=47141KB/s, maxb=47141KB/s, mint=16652msec, maxt=16652msec WRITE: io=263580KB, aggrb=15828KB/s, minb=15828KB/s, maxb=15828KB/s, mint=16652msec, maxt=16652msec

jmartinb said:
FIO in the LXC:
READ: io=784996KB, aggrb=54182KB/s, minb=54182KB/s, maxb=54182KB/s, mint=14488msec, maxt=14488msec WRITE: io=263580KB, aggrb=18192KB/s, minb=18192KB/s, maxb=18192KB/s, mint=14488msec, maxt=14488msec

Far off?

Leaving alone the other hardware, but there is only 8 GB RAM available for:

ZFS cache, 50% of available free RAM (512 MB ??)
1 GB / 1TB pro OSD (especially during recovery) = 4 GB
1 GB for VM/LXC (I can only see those two mentioned)
~ 2 GB for PVE
...

The rados bench shows around 450 MB/s, with 4 MB objects. The fio test uses 4K, you will see way better results if you go also 4 MB. That is how ceph works, as it has rbd caching and combines small writes to a bigger one. Take one of the 1 TB disks and run the fio with 4K against it, this will show what one disk is capable of and put things into perspective.

Also checkout our ceph benchmark paper and thread: https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

jmartinb · Apr 26, 2018

Hi Alwin, thanks for your help.

I have 5 nodes exclusively for Ceph and 3 nodes exclusively for running VMs/LXCs). Also I forgot to mention that I have two networks, a public Gigabit network and a 10Gigabit network for Ceph.
Both VM and LXC are running in a node with dual Xeon and 16GB of RAM and are the only ones active in the cluster.
Finally, I think I have isolated the problem, but I'm not sure how to solve it: all the tests I've made with fio show similar results for LXC and VM. However a simple wget of a ISO file bigger (2.6GB) than the RAM assigned to the VM/LXC has a very different behaviour:

In the VM (Ubuntu 16.04, 512MB RAM/Swap, 2 cores), the wget is executed normally, with a speed just limited by the network bandwidth.
In the LXC (Ubuntu 16.04, 512MB RAM/Swap, 2 cores), the wget gets stuck around every ~500MB (about the size of the RAM) for a few seconds. Raising the RAM to 1G and re-running the test, the process gets stuck around every 1GB. Sometimes the process doesn't finish and gets killed, and I can see from a dmesg the following message:

Code:

[85342.414567] Memory cgroup out of memory: Kill process 21188 (bash) score 1 or sacrifice child
[85342.414632] Killed process 25895 (wget) total-vm:24996kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[85342.429006] oom_reaper: reaped process 25895 (wget), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

It seems to me that this is LXC related and is not caused by Ceph as in the VM I didn't find this problem. Just to be sure, I move both LXC/VM to local storage and repeated the tests, with the same results.

Alwin · Apr 26, 2018

jmartinb said:
In the LXC (Ubuntu 16.04, 512MB RAM/Swap, 2 cores), the wget gets stuck around every ~500MB (about the size of the RAM) for a few seconds. Raising the RAM to 1G and re-running the test, the process gets stuck around every 1GB. Sometimes the process doesn't finish and gets killed, and I can see from a dmesg the following message:

This is expected behavior in LXC, if a program tries to use more RAM in a cgroup then the set limit, the OOM kill happens.

https://forum.proxmox.com/threads/lxc-oom-when-low-usage.38726/#post-191944
https://serverfault.com/questions/5...mited-lxc-container-writing-large-files-to-di

jmartinb · Apr 27, 2018

Ok, I see what the problem is.
Thank you very much for your help Alwin.

Search

Search

Ceph + LXC poor performance but OK for KVM

jmartinb

Member

Alwin

Proxmox Retired Staff

jmartinb

Member

Alwin

Proxmox Retired Staff

jmartinb

Member

Alwin

Proxmox Retired Staff

jmartinb

Member