Low disk subsystem performance

ooptimum

New Member
Oct 31, 2022
17
15
3
I've got four HP DL360 G9 servers, which I intend to use in a hyper-converged cluster setup with CEPH. All of them are of the same hardware configuration: two sockets with Intel(R) Xeon(R) CPU E5-2699 v4 processors @ 2.20GHz (88 cores per server in total), 768GiB of registered DDR4 RAM (configured memory speed: 1866 MT/s), eight Samsung PM883/MZ7LH1T9HMLT SATA SSD (data-center grade, 1.92TB, 30K IOPS on Random Write) connected via H240ar controller in HBA mode @ 6Gb/s, and four integrated 1G and one additional dual-port 10G network adapters. All the servers are upgraded to the latest firmware, and have Proxmox 7.2-11 installed. All packages are upgraded to current versions from public repositories. The OS is installed on an internal Kingston DC1000B 240GB NVMe M.2 SSD drive (of data-center grade) via PCIe-to-NVMe adapter. All disks are brand new.

Having such a configuration, we can assume that everything should be fine with disk performance. But, unfortunately, this is not the case. My 3yo average PC beats these servers by several times over on the same drives.

Testing was performed using the fio utility, as follows:
fio --name=/dev/drive --ioengine=libaio --direct=1 --fsync=1 --readwrite=randwrite --blocksize=4k --runtime=120

Since all disks were supposed to be used in CEPH, the scheduler and write caching settings on the host have been made according to the recommendations from the CEPH documentation: /sys/block/dev/queue/scheduler was set to 'none', and write caching was set to 'write through'. I ran 5 tests consecutively, the results are the average of the values obtained in the tests.

The best I managed to get when writing to a raw disk device in the host OS is roughly 37 MiB/s at 9,500 IOPS. It's definitely not the 30K IOPS rating, but it'll do. Then I started testing the performance inside the virtual machine.
Debian 11, CLI only, fully upgraded:

agent: 1
boot: order=scsi0
cores: 2
cpu: host
machine: q35
memory: 8192
meta: creation-qemu=7.0.0,ctime=1667311996
name: test-vm
net0: virtio=2A:*:*:*:65:7B,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-100-disk-0,discard=on,size=32G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=xxxxxxxx-xxxx-4xxx-xxxx-xxxxxxxxxxxx
sockets: 2
It was a long marathon during which I tried almost every possible disk layout: RAID, LVM, LVM-Thin, LVM over RAID, LVM-Thin over RAID, ZFS, qcow2, passthrough raw devices into the virtual machine, SCSI controller, VirtIO controller, etc. The best result I got was when I passed the block device through into the virtual machine:
virtio5: /dev/sdc3,backup=0,discard=on,iothread=1,size=640127671808
I got about 13 MiB/s at 3325 IOPS. Not impressive numbers at all. Sorry to mention it here, but ESXi VM on the same hardware shows triple the performance. Even FreeBSD 13 performs better than Linux when run on such a server at default settings: 45.46MiB/s at 11.64k IOPS in average (ioengine=psync). So I started to blame the disk controller driver, especially since it already had known regressions.

I had no other controller at hand to test this theory, so I decided to abstract away from the hardware disk subsystem altogether and test the performance of the disk in RAM. I created a small drive in RAM (modprobe brd rd_nr=1 rd_size=16777216) and tested its speed first on the host and then in the virtual machine (qm set 100 -virtio0 /dev/ram0 and then set iothread=1 in the vm config file). These are the values I got:
Host: 866,4 MiB/s at 221.8k IOPS
VM: 13.56 MiB/s at 3572,8 IOPS
I have never been able to get values greater than these, even when trying to write to a raw patition on an NVMe drive. So far it looks like the bottle neck is inside QEMU. But maybe I am mistaken? The disk performance in CEPH is even more miserable (I use a 10G network to synchronize nodes).

My inner geek really wants to run a cluster on Proxmox VE. And only disk performance issues stand in the way. Please help me to solve them.
 
Did you test on the ESXI VM with exactly the same command?
Removing the fsync from this benchmark should get your IOPS closer to the 30k, is there a reason why you use fsync=1?
 
Also, regarding the RAM drive, a better comparison would be to create a RAM disk inside the VM. If you pass it through, I think it gets synchronized and whatnot, slowing it down substantially.
 
Did you test on the ESXI VM with exactly the same command?
Yes, I used exactly the same command. Otherwise, the test would have been unfair.

Removing the fsync from this benchmark should get your IOPS closer to the 30k, is there a reason why you use fsync=1?

As I mentioned before, these servers were intended to use with CEPH, and it's a CEPH's way of measuring performance:

1669114173591.png

Also, AFAIK, it's very close to how DBMSes are working with disks to prevent data loss. So, it's quite justified.
 
If i see it right you test the ceph behavior (fsync=1) in the vm. That will be "nested" in some way i think.
Normal processes in the vm will not use fsync normally. DBMS will use fsync=1 maybe, but then most sequential (redo logs e.g.).
 
Also, regarding the RAM drive, a better comparison would be to create a RAM disk inside the VM. If you pass it through, I think it gets synchronized and whatnot, slowing it down substantially.
Since I was evaluating how Qemu works with external storage, there is no point in evaluating the performance of the internal RAM drive, I suppose. If Qemu treats external RAM drives this way, why should it treat disks of other types differently? There will be exactly the same delays. In this case Qemu's poor performance with the disk subsystem was clearly shown by the 60-fold difference in speed. In this experiment, all the parts that could also affect performance - the disk controller, its driver, cables, backplane - were removed.

And as I mentioned before, I have not been able to get performance in a virtual machine with an external storage of any type higher than with a RAM drive. The figures obtained with it are the performance ceiling.

I would be very grateful if you could show me a way to improve the performance of random writing with small blocks. Which is a typical DBMS scenario.
 
What's the performance when setting cache=unsafe? I'll have a better look at this tomorrow
 
If i see it right you test the ceph behavior (fsync=1) in the vm. That will be "nested" in some way i think.
Normal processes in the vm will not use fsync normally. DBMS will use fsync=1 maybe, but then most sequential (redo logs e.g.).
There is probably some truth in what you say, but I started all these tests after my DBA started complaining that the test database (about 1.5TB of data) was very slow in the new cluster with CEPH.

To make the testing fair, I created a drive in the CEPH storage in the test virtual machine, then turned off the virtual machine to eliminate the influence of Qemu at all, and tested the block device on CEPH using the same tool directly from the ProxMox host. Here are the results I got:

Code:
# fio --name=test --ioengine=rbd --pool=ceph-replicated --rbdname=vm-100-disk-0 --direct=1 --fsync=1 --readwrite=write --blocksize=4k --runtime=120

#1 Jobs: 1 (f=1): [W(1)][100.0%][w=1152KiB/s][w=288 IOPS][eta 00m:00s]
#2 Jobs: 1 (f=1): [W(1)][100.0%][w=1420KiB/s][w=355 IOPS][eta 00m:00s]
#3 Jobs: 1 (f=1): [W(1)][100.0%][w=1344KiB/s][w=336 IOPS][eta 00m:00s]
#4 Jobs: 1 (f=1): [W(1)][100.0%][w=1297KiB/s][w=324 IOPS][eta 00m:00s]
#5 Jobs: 1 (f=1): [W(1)][100.0%][w=1416KiB/s][w=354 IOPS][eta 00m:00s]

Inside the virtual machine, it's even slower.
 
so you are benchmarking ceph?
Then there are thousands of parameters to look at. Network settings for ceph. Ceph version. Ceph storage / osd parameter, pool redundancy, number of nodes, switch settings, and many many aspects more.
 
so you are benchmarking ceph?
I started with CEPH, but then I tested all the storage type options. I came to the conclusion that Qemu performs badly in every case. Of course, I am only referring to the mentioned scenario.
 
We use HP G9 also. But with another HBA (P840), and we use raid mode. As far as i remember we did not had these problems with local attached storage (zfs e.g.). But we used the battery backuped write cache from the P840er. Without that, most performance was "ugly".
I do not remember details from HP Smart Path" and other technologies, but you can tune a lot in this field. Maybe something equal is true for the H240ar.
 
4000-6000 iops with 4k block and iodepth=1 is expected with rbd.

(you have network latency but also cpu latenct to process rados algorithm on both client && server side).

if you increase iodepth, it should scale. (but yes, ceph can be bad for random small write with low concurrency).

for sequential small write workload, you can enable writeback on the vm disk. (it'll aggregate small write, send 1 big transaction, so less cpu to compute the crush map).

you can reduce latency/ceph usage by disabling debug in ceph.conf (both on proxmox node && ceph nodes)

Code:
/etc/ceph/ceph.conf    
[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 rbd_skip_partial_discard = true
 perf = true
 mutex_perf_counter = false

you can also disable cephx authent (it need to restart all vms && ceph service

Code:
auth_cluster_required = none
auth_service_required = none
auth_client_required = none


some bluestore tuning (in ceph.conf where osds are located)

Code:
[osd]
bluestore rocksdb options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=671088
64,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_thr
eads=8,compaction_readahead_size=2MB
 
Execute
Code:
ssacli ctrl slot=0 show config  and  ssacli ctrl slot=0 show.
Also check arrayaccelerator and caching is enable or not.
 
I do not remember details from HP Smart Path" and other technologies, but you can tune a lot in this field. Maybe something equal is true for the H240ar.
Unfortunately, in HBA mode there is nothing to configure, there are simply no settings.

4000-6000 iops with 4k block and iodepth=1 is expected with rbd.
I'd be happy to get similar numbers, but mine are, to my deepest regret, an order of magnitude lower, even on a sequential rather than a random recording. Thanks for the recommendations, I will definitely try them.

Execute
Code:
ssacli ctrl slot=0 show config  and  ssacli ctrl slot=0 show.
Code:
# ssacli ctrl slot=0 show config

Smart HBA H240ar in Slot 0 (Embedded)     (sn: PDNLN0BRHAX17N)



   Internal Drive Cage at Port 1I, Box 1, OK



   Internal Drive Cage at Port 2I, Box 1, OK


   Port Name: 1I

   Port Name: 2I


   HBA Drives

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA SSD, 1.9 TB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA SSD, 1.9 TB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA SSD, 1.9 TB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA SSD, 1.9 TB, OK)
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA SSD, 1.9 TB, OK)
      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA SSD, 1.9 TB, OK)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA SSD, 1.9 TB, OK)
      physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA SSD, 1.9 TB, OK)

   SEP (Vendor ID HP, Model H240ar) 379  (WWID: 5001438043176EC0)

# ssacli ctrl slot=0 show

Smart HBA H240ar in Slot 0 (Embedded)
   Bus Interface: PCI
   Slot: 0
   Serial Number: PDNLN0BRHAX17N
   Cache Serial Number: PDNLN0BRHAX17N
   Controller Status: OK
   Hardware Revision: B
   Firmware Version: 7.00
   Firmware Supports Online Firmware Activation: False
   Cache Board Present: False
   Controller Temperature (C): 45
   Cache Module Temperature (C): 39
   Number of Ports: 2 Internal only
   Driver Name: hpsa
   Driver Version: 3.4.20
   HBA Mode Enabled: True
   PCI Address (Domain:Bus:Device.Function): 0000:03:00.0
   Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
   Controller Mode: HBA
   Pending Controller Mode: HBA
   Controller Mode Reboot: Not Required
   Port Max Phy Rate Limiting Supported: False
   Current Power Mode: MaxPerformance
   Survival Mode: Enabled
   Host Serial Number: 6CU550WWC1
   Sanitize Erase Supported: True
   Primary Boot Volume: None
   Secondary Boot Volume: None
Also check arrayaccelerator and caching is enable or not.
I'm afraid I don't quite understand what needs to be checked. Can you explain or give instructions?
 
Although not recommended generally, we use the HP 840 in RAID Mode with JBOD-RAID0, so we can use cache settings and battery/capacitor backuped cache. You must know what you do in this case and check your use case (e.g. data security) against this.
Using the HP controller without it´s features may not give good performance. At least if you have some older one. I do not now yours in detail. This is true for the 840er at least.
 
Although not recommended generally, we use the HP 840 in RAID Mode with JBOD-RAID0, so we can use cache settings and battery/capacitor backuped cache. You must know what you do in this case and check your use case (e.g. data security) against this.
Using the HP controller without it´s features may not give good performance. At least if you have some older one. I do not now yours in detail. This is true for the 840er at least.
The H240 is primarily an HBA controller, but with RAID capability, so it's very natural for it to work as HBA. Anyway, mine has no backup battery/capacitors. Thank you in any case.
 
Code:
for example:

hdparm -W /dev/sda

/dev/sda:
 write-caching =  1 (on)

sdparm --get WCE /dev/sda
    /dev/sda: ATA       TOSHIBA MG07ACA1  0101
WCE           1  [cha: y]
smartctl -g wcache /dev/sda
smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-305.19.1.el8_4.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

Write cache is:   Enabled

cat /sys/class/scsi_disk/0\:0\:0\:0/cache_type
write back

check disks
 
check disks
Disk caching is disabled, I wrote about it in my original post:
Since all disks were supposed to be used in CEPH, the scheduler and write caching settings on the host have been made according to the recommendations from the CEPH documentation: /sys/block/dev/queue/scheduler was set to 'none', and write caching was set to 'write through'.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!