[SOLVED] Ceph - poor write speed - NVME

eliyahuadam · Oct 2, 2024

Hello,

I'm facing poor write (IOPS) performance (TPS as well) on Linux VM with MongoDB Apps.
Cluster:
Nodes: 3
Hardware: HP Gen11
Disks: 4 NVME PM1733 Enterprise NVME ## With latest firmware driver.
Network: BCM57414 NetXtreme-E / 25G
PVE Version: 8.2.4 , 6.8.8-2-pve

Ceph:
Version: 18.2.2 Reef.
4 OSD's per node.
PG: 512
Replica 2/1
Additional ceph config:
bluestore_min_alloc_size_ssd = 4096 ## tried also 8K
osd_memory_target = 8G
osd_op_num_threads_per_shard_ssd = 8
OSD disks cache configured as "write through" ## Ceph recommendation for better latency.
Apply \ Commit latency below 1MS.

Network:
MTU: 8191 ## Maximim value supported.
TX \ RX Ring: 2046 ## Maximum value supported.

Iperf3 results: 24 gig between all Ceph nodes:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 28.3 GBytes 24.3 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 28.3 GBytes 24.3 Gbits/sec receiver

VM:
Rocky 9 (tried also ubuntu 22):
boot: order=scsi0
cores: 32
cpu: host
memory: 4096
name: test-fio-2
net0: virtio=BC:24:11:F9:51:1A,bridge=vmbr2
numa: 0
ostype: l26
scsi0: Data-Pool-1:vm-102-disk-0,size=50G ## OS
scsihw: virtio-scsi-pci
smbios1: uuid=5cbef167-8339-4e76-b412-4fea905e87cd
sockets: 2
tags: templatae
virtio0: sa:vm-103-disk-0,backup=0,cache=writeback,discard=on,iothread=1,size=33G ### Local disk - same NVME
virtio2: db-pool:vm-103-disk-0,backup=0,cache=writeback,discard=on,iothread=1,size=34G ### Ceph - same NVME
virtio23 db-pool:vm-104-disk-0,backup=0,cache=unsafe,discard=on,iothread=1,size=35G ### Ceph - same NVME

Disk1: Local nvme with iothread
Disk2: Ceph disk with Write Cache with iothread
Disk3: Ceph disk with Write Cache Unsafe with iothread

I've made FIO test in one SSH session and IOSTAT on second session:

fio --filename=/dev/vda --sync=1 --rw=write --bs=64k --numjobs=1 --iodepth=1 --runtime=15 --time_based --name=fioa

Results:
Disk1 - Local nvme:
WRITE: bw=74.4MiB/s (78.0MB/s), 74.4MiB/s-74.4MiB/s (78.0MB/s-78.0MB/s), io=1116MiB (1170MB), run=15001-15001msec
TPS: 2500
DIsk2 - Ceph disk with Write Cache:
WRITE: bw=18.6MiB/s (19.5MB/s), 18.6MiB/s-18.6MiB/s (19.5MB/s-19.5MB/s), io=279MiB (292MB), run=15002-15002msec
TPS: 550-600
Disk3 - Ceph disk with Write Cache Unsafe:
WRITE: bw=177MiB/s (186MB/s), 177MiB/s-177MiB/s (186MB/s-186MB/s), io=2658MiB (2788MB), run=15001-15001msec
TPS: 5000-8000

The VM disk cache configured with "Write Cache"
The queue scheduler configured with "none" (Ceph OSD disk as well).

Any suggestion please how to improve the write speed (write cache or none)?
How can find the bottleneck?

I will glad to add more information as needed.

Many Thanks.

gurubert · Oct 3, 2024

Have you tried to attach the virtual disks as SCSI disks with the virtio-scsi-single controller?

eliyahuadam · Oct 4, 2024

gurubert said:
Have you tried to attach the virtual disks as SCSI disks with the virtio-scsi-single controller?

Hi @gurubert

Same results with scsci and virtio-scsi-single controller:
WRITE: bw=18.8MiB/s (19.7MB/s), 18.8MiB/s-18.8MiB/s (19.7MB/s-19.7MB/s), io=282MiB (296MB), run=15003-15003msec

eliyahuadam · Oct 9, 2024

Hi @fabian @fschauer @Chris @shanreich

I've tried lot of things and optimize the linux \ network and ceph configuration.

I will very appreciate your help.

Thanks.

LnxBil · Oct 9, 2024

Have you tried creating more than one OSD per NVMe?

eliyahuadam · Oct 9, 2024

LnxBil said:
Have you tried creating more than one OSD per NVMe?

hI @LnxBil

I never tried to do it.

However, I have another Proxmox cluster, the configuration are identical.
In the other cluster, I'm using PM1735 (almost same write speed like PM1733 3800MB/s <-> 3700MB/s).
There the write speed (and IOWAIT) values are very good, x2.5 better.

The NVME's disk are upgraded to the latest version \ firmware.

I made FIO tests with same VM on different clusters.
vm config: write back prox side and vm side, io_thrade, discard.
I've tried also virto-block and scsi.

results on good cluster:

WRITE: bw=49.2MiB/s (51.6MB/s), 49.2MiB/s-49.2MiB/s (51.6MB/s-51.6MB/s), io=738MiB (773MB), run=15001-15001msec

results on the bad cluster:
WRITE: bw=16.1MiB/s (16.9MB/s), 16.1MiB/s-16.1MiB/s (16.9MB/s-16.9MB/s), io=241MiB (253MB), run=15001-15001msec

On the bad cluster performance, i've remove one of the OSD and made FIO test on the proxmox host (same FIO test):
WRITE: bw=1154MiB/s (1210MB/s), 1154MiB/s-1154MiB/s (1210MB/s-1210MB/s), io=16.9GiB (18.1GB), run=15001-15001msec

I'm suspecting that maybe the bottleneck realted to the network adapter even that i optimize the configuration (buffer,mtu,tx and rx) and with iperf3 I'm receiving 24.6 gig between the proxmox nodes

eliyahuadam · Oct 9, 2024

Another thing, if i remove the --sync flag from the FIO test, the write speed are almost identical for both clusters:
fio --filename=/dev/vda --rw=write --bs=64k --numjobs=1 --iodepth=1 --runtime=15 --time_based --name=fioa

WRITE: bw=1363MiB/s (1429MB/s), 1363MiB/s-1363MiB/s (1429MB/s-1429MB/s), io=19.0GiB (21.4GB), run=15003-15003msec

WRITE: bw=1382MiB/s (1450MB/s), 1382MiB/s-1382MiB/s (1450MB/s-1450MB/s), io=20.3GiB (21.7GB), run=15001-15001msec

Maximiliano · Oct 9, 2024

Hello,

One thing you might use to detect config difference between both hosts is to run our `pvereport > $(hostname)-report.txt` tool to generate a system report and use `diff -u` to spot if there is a difference in how the clusters are configured.

Some users have reported better IO when enabling KRBD on the Ceph storage (located at `/etc/pve/storage.cfg`), you might try that.

I would recommend to perform benchmarks directly against the disks (some enterprise disks will under-report how fast they can perform IO), then against the RBD layer, e.g with

Code:

rados bench 300 write --no-cleanup  -p POOL_NAME

before troubleshooting the performance inside of a guest.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ceph_rados_block_devices

eliyahuadam · Oct 10, 2024

Maximiliano said:
Hello,

One thing you might use to detect config difference between both hosts is to run our `pvereport > $(hostname)-report.txt` tool to generate a system report and use `diff -u` to spot if there is a difference in how the clusters are configured.

Some users have reported better IO when enabling KRBD on the Ceph storage (located at `/etc/pve/storage.cfg`), you might try that.

I would recommend to perform benchmarks directly against the disks (some enterprise disks will under-report how fast they can perform IO), then against the RBD layer, e.g with

Code:

rados bench 300 write --no-cleanup -p POOL_NAME

before troubleshooting the performance inside of a guest.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ceph_rados_block_devices

Hi @Maximiliano

The mean difference between the "good" cluster is the Ceph version but the Ceph parameters are identical.
"good cluster: quincy 17.2.7
"bad" cluster: reef 18.2.2

I've try to enable KRBD on one of the Ceph pool, didn't saw any improve regarding the write speed, do i need to recreate the VM\DISKS or both after enabling KRBD?

i've made the rados bench ontest pool and FIO test to same NVME disk (like the OSD):
testpool, replica 2/1, PG128, OSD 11
rados bench 300 write --no-cleanup -p testpool
Total time run: 300.017
Total writes made: 283288
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 3776.95
Stddev Bandwidth: 99.0029
Max bandwidth (MB/sec): 3940
Min bandwidth (MB/sec): 2536
Average IOPS: 944
Stddev IOPS: 24.7507
Max IOPS: 985
Min IOPS: 634
Average Latency(s): 0.0169354
Stddev Latency(s): 0.00480476
Max latency(s): 0.225156
Min latency(s): 0.00701366

FIO test 64k:
fio --filename=/dev/nvme4n1 --sync=1 --rw=write --bs=64k --numjobs=1 --iodepth=1 --runtime=15 --time_based --name=fioaa
WRITE: bw=1154MiB/s (1210MB/s), 1154MiB/s-1154MiB/s (1210MB/s-1210MB/s), io=16.9GiB (18.1GB), run=15001-15001msec
Min latency(s): 0.00701366

FIO test 4k:
fio --filename=/dev/nvme4n1 --sync=1 --rw=write --bs=64k --numjobs=1 --iodepth=1 --runtime=15 --time_based --name=fioaa
WRITE: bw=321MiB/s (336MB/s), 321MiB/s-321MiB/s (336MB/s-336MB/s), io=4811MiB (5045MB), run=15001-15001msec

In summary, the write speed performance are very good when testing on the Proxmox host, Ceph bench test and FIO comparing to VM guest tests.

Maximiliano · Oct 10, 2024

I've try to enable KRBD on one of the Ceph pool, didn't saw any improve regarding the write speed, do i need to recreate the VM\DISKS or both after enabling KRBD?

If a VM with one or more disks in the Ceph pool is currently running you will need to "Stop" it and then "Start" it again for the change to be applied to this specific VM.

eliyahuadam · Oct 10, 2024

Maximiliano said:
If a VM with one or more disks in the Ceph pool is currently running you will need to "Stop" it and then "Start" it again for the change to be applied to this specific VM.

Hi @Maximiliano

Already tried stop / start and even with new VM, same results.
I'm really frustrated from this.

In addition, not sure if it indicate to something but without io_thread, the results 15% better (scsi and virtio-block).

Thanks.

eliyahuadam · Oct 30, 2024

I've upgraded the network cards follwoing this post:
https://forum.proxmox.com/threads/ceph-configuration-best-practices.143965/post-662025
The new interfaces are Mellanox-connectx-6 25 gig

Unfortunately, the write speed still very poor.

What else can i do?

Thanks.

Maximiliano · Oct 30, 2024

eliyahuadam said:
Hi @Maximiliano

Already tried stop / start and even with new VM, same results.
I'm really frustrated from this.

In addition, not sure if it indicate to something but without io_thread, the results 15% better (scsi and virtio-block).

Thanks.

Did you enable KRBD on the RBD storage? Please share with us the contents of

Code:

cat /etc/pve/storage.cfg

eliyahuadam · Oct 30, 2024

Hi @Maximiliano

I've tried with and without KRBD (pool name db-pool):
root@proxmox-cluster02-apc-1:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,backup,iso

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

rbd: Data-Pool-1
content rootdir,images
krbd 0
pool Data-Pool-1

nfs: Content_Backup-afq02
export /nfs
path /mnt/pve/Content_Backup-afq02
server 10.101.4.14
content vztmpl,snippets,rootdir,images,backup,iso
prune-backups keep-all=1

rbd: db-pool
content rootdir,images
krbd 1
pool db-pool

rbd: testpool
content images
krbd 1
pool testpool

I'm also sharing rados bench results:
rados bench -p testpool 30 write --no-cleanup
Total time run: 30.0137
Total writes made: 28006
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 3732.42
Stddev Bandwidth: 166.574
Max bandwidth (MB/sec): 3892
Min bandwidth (MB/sec): 2900
Average IOPS: 933
Stddev IOPS: 41.6434
Max IOPS: 973
Min IOPS: 725
Average Latency(s): 0.0171387
Stddev Latency(s): 0.00626496
Max latency(s): 0.133125
Min latency(s): 0.00645552

I've also remove one of the OSD and made FIO test:

fio --filename=/dev/nvme4n1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=20 --time_based --name=fioaa
WRITE: bw=297MiB/s (312MB/s), 297MiB/s-297MiB/s (312MB/s-312MB/s), io=5948MiB (6237MB), run=20001-20001msec

Very good results.

Please let me know if more information is needed.

Many Thanks.

eliyahuadam · Nov 2, 2024

Hi All,

I've succeeded to manage the problem.

The CPU c-state issue, i added the following lines to the GRUB:

intel_idle.max_cstate=0 processor.max_cstate=0

The servers configured with "high performance" profile (BIOS) but the c-state settings not reflected to the OS.

After rebooting the servers, the ping latency change to 0.02 instead of 0.2-3.
The write speed IOPS x3.5 better.

Thanks everyone for the help.

Search

Search

[SOLVED] Ceph - poor write speed - NVME

eliyahuadam

Member

gurubert

Distinguished Member

eliyahuadam

Member

eliyahuadam

Member

LnxBil

Distinguished Member

eliyahuadam

Member

eliyahuadam

Member

Maximiliano

Proxmox Staff Member

eliyahuadam

Member

Maximiliano

Proxmox Staff Member

eliyahuadam

Member

eliyahuadam

Member

Maximiliano

Proxmox Staff Member

eliyahuadam

Member

eliyahuadam

Member