Ceph performance worse after complete reinstall.

Dec 26, 2018
138
2
23
35
Hello.
So we have 3 nodes, each with 2 Samsung sm631a 480GB in each.
Ceph is connected via a 10Gbit switch, and the only thing i have tweaked is:

osd_memory_target = 1073741824

So each OSD only uses 1GB, instead of 4GB.

The pool uses size=3, min size=2.


Now what is so frustrating before the reinstall, this is the results i got in the VM:
photo_2019-06-17_15-03-49.jpg

Very good results for the lower range, and very responsive VM's.


After the install: (its like it don't even want to cache anymore)
Selection_019.png

Now i don't remember if used the non subscriber repo for the testing or not.

But with the new system i have bought licenses, and updated using the subscriber repos.


I dont have screenshot of the KRDB enabled on the old system, but on the new system:
Selection_018.png

Still low response on the smaller files.

I have enabled jumbo frames on the Switch and servers.
I have checked with iperf3, the network works great.

rados -p ceph-pool bench 10 write --no-cleanup
rados -p ceph-pool bench 10 seq

1 SSD pr. server.
Write: 350-370MB/sek
Read:1380MB/sek

2 SSD pr.server
Write 550MB/SEK
Read: 1700MB/SEK
 
Yes. In the first image, Atto uses a 256MB test file, so we are below cache size.
To clarify for me, was the osd_memory_target of 1 GiB in use for the test? Just a thought, for example if the setting was set but the cache was still ~4 GiB then it would have influenced the test.

It is interesting that the write was on bar with the read values of the test on the old system. As reads can be done in parallel, the write value should be around a third or up to half of the read value. The second test seems more on bar with what I would suspect of the SSDs in the 3x2 setup.

The third test clearly shows the effect of caching, as the >2 GB/s is more then the theoretical limit of the 10 GbE, even in a bond. The results could come from the page cache and the cache mode 'writeback' for the disk images.

What is the configuration (vmid.conf) of the VM? And on what package version (pveversion -v) is the system?
 
It looks like the osd_memory_target does not matter in this case.
The only difference i can think of is that the previous setup used the non-subscriber repo. And one of the servers got a new LSI 9207-8i controller card.

I will get you the info if i can remote into the PVE network from home, or I'll get it tomorrow :)

This is a little bit frustrating as we had planned to go into production Thursday, as we have done over 3 months of testing already, and everything seemed to be working perfectly.
 
proxmox-ve: 5.4-1 (running kernel: 4.15.18-15-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-3
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

agent: 1
bootdisk: scsi0
cores: 6
ide0: none,media=cdrom
ide2: none,media=cdrom
memory: 4096
name: VM01-DC
net0: virtio=82:F5:EB:23:9B:D2,bridge=vmbr0,firewall=1,tag=10
numa: 0
onboot: 1
ostype: win8
scsi0: ceph-ssd:vm-100-disk-0,size=100G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=559b244d-09cd-4df5-aced-a18eef6428dd
sockets: 1
vmgenid: 7fb0d3fb-a7fb-4cce-88c2-27b9746bbc84
 
It looks like the osd_memory_target does not matter in this case.
How come?

The only difference i can think of is that the previous setup used the non-subscriber repo.
Ceph's version is the same and the kernel is only a couple of minor versions different. Did the setup change from the initial?

And one of the servers got a new LSI 9207-8i controller card.
A HBA is the preferred hardware, see https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
All the others do have RAID controller? I hope not, while it might not be an issue with 2 OSDs (disks), it very well can be later on.

scsi0: ceph-ssd:vm-100-disk-0,size=100G,ssd=1
One thing that can help is to set the cache to writeback. When KRBD is in use then KVM will use its own cache and with librbd, Ceph's client cache is used.

Do you have some FIO tests from the sm631a disks? And rados benchmarks?

See the thread here to find our Ceph benchmark paper and for comparison of user results.
The paper contains also the actual commands used.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
 
The other server has the same controller? Did you checked the FW on all?
Could you paste your HW Configuration here?
We have not yet upgraded the FW. Will try that tomorrow.




How come?
My bad, the setting works, but it don't seem to cache as effective on smaller files like it used to. I have to backtrack and see if i can find what i did different.

Ceph's version is the same and the kernel is only a couple of minor versions different. Did the setup change from the initial?
I am trying to find out what i did different.

A HBA is the preferred hardware,
The LSI controller is HBA.

Do you have some FIO tests from the sm631a disks? And rados benchmarks?
Rados read/write is in the first post. I have no FIO tests. Will see what i can do today.
 
The LSI controller is HBA.
I guess I wrote that a little bit unclear, yes I know that it is a HBA. I meant, it is the preferred hardware for Ceph and as an assumption I thought the other nodes have RAID controllers instead of HBAs. Do they?

Rados read/write is in the first post. I have no FIO tests. Will see what i can do today.
Yes, can you please post the full version?
 
When KRBD is in use then KVM will use its own cache and with librbd, Ceph's client cache is used.
Interesting. Is there any difference in how the system operates when using the subscriber repos and when using the non-subscriber repos?

What are the options for ceph's client cache?
 
root@proxmox1:~# rados -p testpool bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_proxmox1_782272
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 168 152 607.928 608 0.0825797 0.0978525
2 16 317 301 601.908 596 0.122983 0.10315
3 16 471 455 606.578 616 0.0276888 0.102139
4 16 632 616 615.912 644 0.0598993 0.102392
5 16 799 783 626.309 668 0.0890948 0.101132
6 16 961 945 629.909 648 0.0635424 0.100755
7 16 1119 1103 630.198 632 0.0726201 0.101091
8 16 1253 1237 618.41 536 0.064339 0.102918
9 16 1389 1373 610.132 544 0.149746 0.104237
10 16 1526 1510 603.912 548 0.0845137 0.104945
Total time run: 10.076330
Total writes made: 1527
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 606.173
Stddev Bandwidth: 47.1405
Max bandwidth (MB/sec): 668
Min bandwidth (MB/sec): 536
Average IOPS: 151
Stddev IOPS: 11
Max IOPS: 167
Min IOPS: 134
Average Latency(s): 0.105496
Stddev Latency(s): 0.0466042
Max latency(s): 0.301975
Min latency(s): 0.0205945



root@proxmox1:~# rados -p testpool bench 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 15 467 452 1806.47 1808 0.0193237 0.0338417
2 15 944 929 1856.45 1908 0.036654 0.0333522
3 16 1387 1371 1826.9 1768 0.0713852 0.0339684
Total time run: 3.349775
Total reads made: 1527
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1823.41
Average IOPS: 455
Stddev IOPS: 18
Max IOPS: 477
Min IOPS: 442
Average Latency(s): 0.0341957
Max latency(s): 0.120797
 
I guess I wrote that a little bit unclear, yes I know that it is a HBA. I meant, it is the preferred hardware for Ceph and as an assumption I thought the other nodes have RAID controllers instead of HBAs. Do they?


Yes, can you please post the full version?
Can FIO test the block device? or do i have to crate a cephfs pool and test using file?
 
Bandwidth (MB/sec): 606.173
Average Latency(s): 0.105496
Comparing to the benchmark in the paper, the values from your test are close to the ones I had. And the test setup was, 3x nodes with 4x OSD each.

Bandwidth (MB/sec): 1823.41
Average Latency(s): 0.0341957
Those values exceed the compared test by far. There have been some good improvements between 12.2.2 -> 12.2.12.

What are longer rados benchmarks showing? Not that it drops after a very good burst.

Can FIO test the block device? or do i have to crate a cephfs pool and test using file?
FIO can test block, file or even directly with librbd. See the Ceph benchmark paper.
 
I think this may be something in the kernel maybe? Because its just small files that is a problem, the overall performance is great.

IF the non-subscriber repo uses a different kernel.

https://forum.proxmox.com/threads/ceph-low-performance-especially-4k.47235/page-2

So if i understand this right; "The user space implementation of the Ceph block device (i.e., librbd) cannot take advantage of the Linux page cache, so it includes its own in-memory caching, called “RBD caching"

And; krbd
Enforce access to rados block devices through the krbd kernel module.

So the krbd is dependent on how what kernel is used, and how it is complied?


Do you have the complete command to check a ceph block pool with FIO? the document only shoes to test a physical drive.
 
Last edited:
fio --ioengine=libaio --filename=/dev/sdd --direct=1 --sync=1 --rw=write --bs=4K
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio
--output-format=terse,json,normal --output=fio.log --bandwidth-log



fio: (groupid=0, jobs=1): err= 0: pid=54938: Tue Jun 18 11:20:39 2019
write: io=5165.7MB, bw=88159KB/s, iops=22039, runt= 60001msec
slat (usec): min=2, max=341, avg= 3.18, stdev= 1.92
clat (usec): min=1, max=1946, avg=41.62, stdev=19.13
lat (usec): min=35, max=1949, avg=44.81, stdev=19.61
clat percentiles (usec):
| 1.00th=[ 36], 5.00th=[ 36], 10.00th=[ 37], 20.00th=[ 37],
| 30.00th=[ 37], 40.00th=[ 37], 50.00th=[ 37], 60.00th=[ 38],
| 70.00th=[ 38], 80.00th=[ 38], 90.00th=[ 42], 95.00th=[ 57],
| 99.00th=[ 143], 99.50th=[ 145], 99.90th=[ 159], 99.95th=[ 159],
| 99.99th=[ 298]
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=94.03%
lat (usec) : 100=2.90%, 250=3.04%, 500=0.03%
lat (msec) : 2=0.01%
cpu : usr=7.70%, sys=9.43%, ctx=1322422, majf=9, minf=22
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1322404/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: io=5165.7MB, aggrb=88158KB/s, minb=88158KB/s, maxb=88158KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdd: ios=86/1320925, merge=0/0, ticks=24/52064, in_queue=52024, util=86.70%

output from one of the disks.
 
I think this may be something in the kernel maybe? Because its just small files that is a problem, the overall performance is great.
pve-kernel (4.15.18-41) unstable; urgency=medium
* update sources to Ubuntu-4.15.0-52.56
* backport: KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
* backport: net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock()
* bump ABI to 4.15.18-16

-- Proxmox Support Team <support@proxmox.com> Tue, 18 Jun 2019 07:36:54 +0200

pve-kernel (4.15.18-40) unstable; urgency=medium
* update to Ubuntu-4.15.0-51.55
* bump ABI to 4.15.18-15

-- Proxmox Support Team <support@proxmox.com> Tue, 21 May 2019 17:43:20 +0200
Only one version difference, depends of course on how up-to-date the node is. I would more suspect the something was different on the tests that have been made before. If in doubt, you can add the pve-no-subscription repository and install the newer packages to compare.

The Samsung sm631a, seems to do good at single thread performance and the other benchmarks look good too. And tests inside the VM, will have even more layers in between.

EDIT: while a little bit older, the blog show how to use fio with rbd.
http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html
 
Only one version difference, depends of course on how up-to-date the node is. I would more suspect the something was different on the tests that have been made before. If in doubt, you can add the pve-no-subscription repository and install the newer packages to compare.

The Samsung sm631a, seems to do good at single thread performance and the other benchmarks look good too. And tests inside the VM, will have even more layers in between.

EDIT: while a little bit older, the blog show how to use fio with rbd.
http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html
Thanks. Just found out that i sometime during the last week enabled something called flow control on our Unifi XG-16 switch (10Gbps).
With that off again, we are half way there to the first result again.
Selection_034.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!