PVE 5.1-46: ceph bluestore poor performance with smal files

northe

Active Member
Sep 23, 2017
42
1
28
Hi folks,
I have observed that 3 windows servers 2008R2, 2016 each with a big 4TB as second drive do have very poor performance when handling with small files within this
big drive. Of course clients do have this issues too but I want to exclude it because makes troubleshooting easier.

drive c: 140GB NTFS, MBR, 1 partition, VirtIO, Driver 0.141, 50% free
drive f: 4TB NTFS, MBR, 1 partition, VirtIO, Driver 0.141, 30% free

from f: to f:
- copy 500MB with 2.500 files to new direcotry: 800kb-5mb/sec sometimes it stucks to 0mb/s
- applying ACLs on files is extreme slow
- copy one 500mb file 20-200mb/sec
- f: delete 1GB with 5.000 files ~ 5 min. Windows says 10-20 elements per second!

from f: to c:
- copy 500MB with 2.500 files to new direcotry: 17-50mb/sec
- 2nd run: copy 500MB with 2.500 files to new direcotry: 1-5mb/sec
- copy one 500mb file, 1 sek (suppose cache)

from c: to f:
- copy 1gb 5.000 files 3-5mb/sec sometimes 0kb/sec
- copy one 500mb file, 1 sek (suppose cache)
- c: delete 1GB with 5.000 files < 1 min. Windows says 1300-250 elements per second!

No virus scanner installed.
On the ceph backend I have seen continous rebuilds with 500-700mb/sec, peaks with 1.300mb/sec.
A quick check with iperf gives me 9.300mb/sek.
All tests do not fit to the very bad performance on the virtal machines.

I am running the latest versions of pve and ceph on 5 nodes, each on well equipped systems:
Supermicro X10DRI-T, 2x Intel E5-2667v4, 515GB DDR4 EC, Areca Controller 1883IX-12 8GB with BBU,
2x Intel 240GB SSD for OS, 8x HGST 10TB SAS for data, 2x 10GB NIC BCM57840 DualPort and 1x Intel X540-AT2 DualPort

Network layout
1x 1GB copper for management
1 bond 802.3ad of 2x 10GB for VM
1 bond 802.3ad of 4x 10GB for ceph (dedicated hp switch)

Versions:
proxmox-ve: 5.1-42 (running kernel: 4.13.16-1-pve) pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13: 5.1-43 pve-kernel-4.13.16-1-pve: 4.13.16-43 pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.4-1-pve: 4.13.4-26 ceph: 12.2.4-pve1 corosync: 2.4.2-pve3 criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1 ksm-control-daemon: 1.2-2 libjs-extjs: 6.0.1-2 libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28 libpve-guest-common-perl: 2.0-14 libpve-http-server-perl: 2.0-8 libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1 lvm2: 2.02.168-pve6 lxc-pve: 2.1.1-3 lxcfs: 2.0.8-2 novnc-pve: 0.6-4 proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20 pve-container: 2.0-19 pve-docs: 5.1-16 pve-firewall: 3.0-5 pve-firmware: 2.0-4 pve-ha-manager: 2.0-5
pve-i18n: 1.0-4 pve-libspice-server1: 0.12.8-3 pve-qemu-kvm: 2.9.1-9 pve-xtermjs: 1.0-2 qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1 spiceterm: 3.0-5 vncterm: 1.5-3 zfsutils-linux: 0.7.6-pve1~bpo9

Any ideas where to look at?
 
yes, spinning drives are not high performers but I suppose they are not the reason for 0mb/sec.
osd = xfs, data and journal

rados bench -p vm-data 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_node1708-1_2032047
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 130 114 455.963 456 0.0984345 0.0566341
2 16 163 147 293.966 132 0.0148666 0.0778756
3 16 258 242 322.627 380 0.0143889 0.111563
4 16 319 303 302.964 244 0.0120232 0.125687
5 16 337 321 256.77 72 0.0117889 0.123179
6 16 396 380 253.305 236 0.043694 0.14111
7 16 481 465 265.685 340 0.0166266 0.198632
8 16 571 555 277.47 360 0.07888 0.20555
9 16 616 600 266.638 180 0.0140951 0.193907
10 16 710 694 277.57 376 0.0124721 0.209275
11 16 711 695 252.7 4 3.40224 0.21387
12 13 711 698 232.641 12 1.84889 0.220422
Total time run: 12.254092
Total writes made: 711
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 232.086
Stddev Bandwidth: 153.654
Max bandwidth (MB/sec): 456
Min bandwidth (MB/sec): 4
Average IOPS: 58
Stddev IOPS: 38
Max IOPS: 114
Min IOPS: 1
Average Latency(s): 0.272444
Stddev Latency(s): 0.842613
Max latency(s): 6.35914
Min latency(s): 0.0106514
Cleaning up (deleting benchmark objects)
Removed 711 objects
Clean up completed and total clean up time :4.109677
 
osd = xfs, data and journal
Hi,
the thread say bluestore and this posting xfs...

You don't have journaling-SSDs? but the cache from areca-sas-controller?

An perftest with 4MB-blocks is useless if you look for trouble with small files... use 4k and a longer test-time (60sec) and only one thread. This is much more, what your client is doing.

Virtio? driver, or scsi with virtio-scsi?

Udo
 
Hi Udo,
yes, the controllers are configured for ceph without any RAID functionallity. The cache is available for the attached drives. I hat contact with Areca support in advance. Each drive has NVC Quick Cache and 256MB cache too.
I use virtio. For one test I switched to virtio-scsi but saw no improvement so turned back.
It is hard to believe that this cluster which is most of the time nearly idle can be stopped with a copy job of 500GB. Even a 10 year old notebook with 5400rpm spinner will push the data with over 10mb/sec.

Remember: only operations within the big drive is the problem. Same operations on smaller disks are okay.
 
For one server I have created a new drive, formatted with 64kb and copied the data back on it. I will test. First review : a homeopathic drug.
The other one occupies a lot of space on the drive, I'll need a long weekend for migrating it but want to see the behavior of the first one.
 
Formatting the drives with 64kb does not help here for getting more performance.
It is weired because in the evening I see the VMs starting their backup watching ceph status reporting 280mb/sek (read) while on workhours ceph reports ~1-10mb/sek (read) and a few kb to write but getting stucked while copying 500MB. Even a rebuild of the PGs can speed up to 500mb/sek.
The car has the horsepower but can not get it on the road.

Perhaps it is time to create a new and pure SSD pool and moving the busy drives of the VMs to this. Which brands and models are recommended? I have an eye on 20x Micron 5100 max 960GB.