performance, ceph cluster, nvme, to slow: 130 MB instead of 900MB, 6 times slower

pille99 · Oct 9, 2022

hello guys

its really a huge problem now. i run 20 VMs on one cluster and its starting to lag (i am not sure the disk are the problem, but this is the only issue i can see right now)

following is the case:
i upgraded ceph pasific to quincy (no improvments)
the speed on Proxmox remains slow - 130 MB/s instead of 900MB read and 6000 mb write. which is the numbers of the nvme drive
i installed the cluster exactly like the video on proxmox site.
what is very strange: the speed in the VM is what i expect - 900 mb write and read 6gb). however, the speed test: rodos bench ... shows 130mb, a copy from one osd to the other (in the same server) shows 8.4 gb in 1 minute, which is 130 MB/s. that is absolutely not acceptable.

the config is nothing out of ordinary

ceph.conf

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.10.10.10/24
fsid = a8939e3c-7fee-484c-826f-29875927cf43
mon_allow_pool_delete = true
mon_host = 10.10.11.10 10.10.11.11 10.10.11.12 10.10.11.13
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.10.11.10/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.hvirt01]
public_addr = 10.10.11.10

[mon.hvirt02]
public_addr = 10.10.11.11

[mon.hvirt03]
public_addr = 10.10.11.12

[mon.hvirt04]
public_addr = 10.10.11.13

root@hvirt01:~# ceph status
cluster:
id: a8939e3c-7fee-484c-826f-29875927cf43
health: HEALTH_OK

services:
mon: 4 daemons, quorum hvirt01,hvirt02,hvirt03,hvirt04 (age 24m)
mgr: hvirt01(active, since 23m), standbys: hvirt02, hvirt04, hvirt03
osd: 8 osds: 8 up (since 20m), 8 in (since 8d)

data:
pools: 3 pools, 65 pgs
objects: 102.42k objects, 377 GiB
usage: 1.1 TiB used, 27 TiB / 28 TiB avail
pgs: 65 active+clean

io:
client: 0 B/s rd, 181 KiB/s wr, 0 op/s rd, 17 op/s wr

i dont understand why this number changes all the time - pgs: 65 active+clean. i entered, like in the video 254 (as i remember).

the ceph cluster is connected to 10gb and ceph public to 1gb (but it should not matter, i guess)

in the ceph logs is nothing special to see
22-10-09T00:02:45.454941+0200 mgr.hvirt01 (mgr.2654568) 1071 : cluster [DBG] pgmap v846: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 2.7 KiB/s rd, 179 KiB/s wr, 19 op/s
2022-10-09T00:02:47.455200+0200 mgr.hvirt01 (mgr.2654568) 1073 : cluster [DBG] pgmap v847: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 0 B/s rd, 145 KiB/s wr, 18 op/s
2022-10-09T00:02:49.455464+0200 mgr.hvirt01 (mgr.2654568) 1074 : cluster [DBG] pgmap v848: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 0 B/s rd, 131 KiB/s wr, 17 op/s
2022-10-09T00:02:51.455786+0200 mgr.hvirt01 (mgr.2654568) 1076 : cluster [DBG] pgmap v849: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 0 B/s rd, 135 KiB/s wr, 18 op/s
2022-10-09T00:02:53.455985+0200 mgr.hvirt01 (mgr.2654568) 1078 : cluster [DBG] pgmap v850: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 0 B/s rd, 116 KiB/s wr, 16 op/s
2022-10-09T00:02:55.456249+0200 mgr.hvirt01 (mgr.2654568) 1079 : cluster [DBG] pgmap v851: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 0 B/s rd, 149 KiB/s wr, 19 op/s
2022-10-09T00:02:57.456490+0200 mgr.hvirt01 (mgr.2654568) 1081 : cluster [DBG] pgmap v852: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 0 B/s rd, 118 KiB/s wr, 12 op/s
2022-10-09T00:02:59.456755+0200 mgr.hvirt01 (mgr.2654568) 1083 : cluster [DBG] pgmap v853: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 2.7 KiB/s rd, 115 KiB/s wr, 9 op/s
2022-10-09T00:03:01.457058+0200 mgr.hvirt01 (mgr.2654568) 1084 : cluster [DBG] pgmap v854: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 2.7 KiB/s rd, 139 KiB/s wr, 10 op/s
2022-10-09T00:03:03.457255+0200 mgr.hvirt01 (mgr.2654568) 1086 : cluster [DBG] pgmap v855: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 2.7 KiB/s rd, 120 KiB/s wr, 8 op/s
2022-10-09T00:03:05.457539+0200 mgr.hvirt01 (mgr.2654568) 1087 : cluster [DBG] pgmap v856: 65 pgs: 65 active+clean; 377 GiB data, 1.1 TiB used, 27 TiB / 28 TiB avail; 2.7 KiB/s rd, 137 KiB/s wr, 10 op/s
2022-

its getting seriously the issue. the performance is more as shit.
any suggestion ?
(i couldnt find any proper post about it)
plz let me know if you need more datas.
thx

Dunuin · Oct 9, 2022

Do you use consumer NVMes instead of the recommended enterprise/datacenter SSDs?
What NIC models is ceph using? How did you setup the network?

pille99 · Oct 9, 2022

the following ssd is in the servers (2x) - mzql23t8hcls
https://semiconductor.samsung.com/ssd/datacenter-ssd/pm9a3/mzql23t8hcls-00a07/
this drive is listed on samsung webiste as datacenter ssd

the models of the NIC are intel.
1x10gb for cluster = they are connected to each other on an own switch.
1x1gb for cluster public = they are connected to each other on an own switch.
1x1gb for corosync = they are connected to each other on an own switch.
1x1gb uplink

pille99 · Oct 10, 2022

i still couldnt find a solution

dcsapak · Oct 10, 2022

without trying to analyze too deeply, i guess this is limited by your

pille99 said:
1x1gb for cluster public = they are connected to each other on an own switch.

the ceph public network is the network where your clients communicate with the ceph cluster (i.e. all disk traffic)
see https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

this is a 1Gbit network => ~130MiB/s

pille99 · Oct 10, 2022

dcsapak said:
without trying to analyze too deeply, i guess this is limited by your

the ceph public network is the network where your clients communicate with the ceph cluster (i.e. all disk traffic)
see https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

this is a 1Gbit network => ~130MiB/s

soo, that need to be configured exaclty the opposite ?

1x10gb for cluster = they are connected to each other on an own switch.
1x1gb for cluster public = they are connected to each other on an own switch.

i really did understand it completly the opposite: the cluster network should be the 10gb for sync and the public for other jobs

can i change it easily ?

does ceph write in real time 2 times (one time the copy the VM itself, the next time the transfer to other osd? )

B.Otto · Oct 10, 2022

Do note that separating Ceph public and Ceph private networks is optional. If you set up a Ceph Cluster via the Proxmox GUI and only define the public network, the private ceph traffic will also use the public network. For a (minimum) ceph cluster with 3 nodes it is not necessarily better to separate them.

dcsapak · Oct 10, 2022

pille99 said:
i really did understand it completly the opposite: the cluster network should be the 10gb for sync and the public for other jobs

the public network is from ceph cluster to ceph clients, the cluster network is for inter osd replication traffic, but if you measure e.g. with rados bench, that will use the public network to communicate with ceph

as @B.Otto said, in smaller setups it's not necessarily better to seperate the two

pille99 · Oct 10, 2022

B.Otto said:
Do note that separating Ceph public and Ceph private networks is optional. If you set up a Ceph Cluster via the Proxmox GUI and only define the public network, the private ceph traffic will also use the public network. For a (minimum) ceph cluster with 3 nodes it is not necessarily better to separate them.

i seperated it
1x for ceph public
1x fo ceph cluster
1 for corosync
i just did the mistake to connect the 10gb to the cluster instead of public. i need to watch the video from proxmox site, i got it completely wrong but i will check. otherwise i just turn it around. the question is - is it possible without big issues

Neobin · Oct 10, 2022

https://forum.proxmox.com/threads/performance-ceph-vs-linu-vm.115238

pille99 · Oct 11, 2022

i just watch the video: and both 10 gbits NIC - are used for the ceph cluster
another zitat: the seond NIC is used for the proxmos VE Cluster communication

pille99 · Oct 11, 2022

it should be ok if i just change the IP on the NICs, to go the easiest way !!

right now its like
10.10.10.10,11,12 and13 = 10GB
10.10.11.10, 11 12 and 13 = 1GB

just change the Network from the opposite and reboot the cluster. it should do the job! can somebody confirm it ?

pille99 · Oct 12, 2022

pille99 said:
it should be ok if i just change the IP on the NICs, to go the easiest way !!

right now its like
10.10.10.10,11,12 and13 = 10GB
10.10.11.10, 11 12 and 13 = 1GB

just change the Network from the opposite and reboot the cluster. it should do the job! can somebody confirm it ?

change the config. didnt got better

copy from one nvme to another in the same server (a vm file)
drive-virtio0: transferred 21.0 GiB of 32.0 GiB (65.59%) in 3m
drive-virtio0: transferred 21.1 GiB of 32.0 GiB (65.96%) in 3m 1s
drive-virtio0: transferred 21.2 GiB of 32.0 GiB (66.30%) in 3m 2s
drive-virtio0: transferred 21.3 GiB of 32.0 GiB (66.62%) in 3m 3s
drive-virtio0: transferred 21.4 GiB of 32.0 GiB (66.98%) in 3m 4s
drive-virtio0: transferred 21.5 GiB of 32.0 GiB (67.31%) in 3m 5s
drive-virtio0: transferred 21.7 GiB of 32.0 GiB (67.72%) in 3m 6s
drive-virtio0: transferred 21.8 GiB of 32.0 GiB (68.09%) in 3m 7s
drive-virtio0: transferred 21.9 GiB of 32.0 GiB (68.46%) in 3m 8s
drive-virtio0: transferred 22.0 GiB of 32.0 GiB (68.85%) in 3m 9s

which comes down roughtly the same 130mb/s

pille99 · Oct 12, 2022

i have seen in bmon that the traffic goes over the 10gbit network
but still it doesnt use the whole bandwidth, it just use the bandwidth like the 1gb

Dunuin · Oct 12, 2022

Are the other nodes still running on Gbit NICs?

pille99 · Oct 13, 2022

all 4 nodes are connected to the 10gb network - i just let check the switch they are connected. it should be a 10gb switch. just to make sure.
i changed following
10 gb is now public ceph
1gb is now the cluster network (monitors are here)
1gb is corosync
1gb is Uplink

Neobin · Oct 13, 2022

pille99 said:
i changed following
10 gb is now public ceph
1gb is now the cluster network (monitors are here)

Read the post from @aaron : [1]

I have no practical experience with Ceph, but as I understand it, you want fast links for both networks (public and private (= cluster)).
If you have only one fast link, do not split the both networks and instead let them go over the same fast line by separating them e.g. with VLANs.

This is at least how I understand it...

[1] https://forum.proxmox.com/threads/performance-ceph-vs-linu-vm.115238/#post-503491

pille99 · Oct 13, 2022

the network looks brilliant

root@hvirt01:~# iperf -c 10.10.11.11 -----------------------------the 10gb network
------------------------------------------------------------
Client connecting to 10.10.11.11, TCP port 5001
TCP window size: 748 KByte (default)
------------------------------------------------------------
[ 3] local 10.10.11.10 port 51320 connected with 10.10.11.11 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0000-10.0006 sec 11.5 GBytes 9.90 Gbits/sec
root@hvirt01:~# iperf -c 10.10.10.11 ------------------------ the 1gb network
------------------------------------------------------------
Client connecting to 10.10.10.11, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.10.10.10 port 36374 connected with 10.10.10.11 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0000-10.0028 sec 1.10 GBytes 942 Mbits/sec

Neobin · Oct 13, 2022

pille99 said:
root@hvirt01:~# iperf -c 10.10.10.11 ------------------------ the 1gb network
------------------------------------------------------------
Client connecting to 10.10.10.11, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.10.10.10 port 36374 connected with 10.10.10.11 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0000-10.0028 sec 1.10 GBytes 942 Mbits/sec

Either you do not want to hear(/read) the facts that several nice and helpful people and I in this and your other thread told you or you simply ignore them...

Anyway, I wish you good luck.

pille99 · Oct 13, 2022

Neobin said:
Either you do not want to hear(/read) the facts that several nice and helpful people and I in this and your other thread told you or you simply ignore them...

Anyway, I wish you good luck.

my friend. no, i dont ignore what you are saying. and i will do it as next. it just doesnt make sense, to be honest
the network hat plenty of resources left, and i expect to see any changes if i change something. anyway - i change it right now and see if anything changes at all.
it makes perfectly sense if i put the "public ceph" on 1gb (the public is for the data replication), that the bandwith isnt more as 1gb, which was represented in the numbers (130mb/s), the cluster network on the other hand does monitoring, status and so on, traffic - which needs much less traffic. i expected to see an increase of bandwidth, but it didnt. it stayed exactly the same - 130mb/s. which makes me believe its not network related. but i will change and put the cluster and public on 10gb and if successfully tested, i put a dual 10gb inside (the datacenter the servers are).

performance, ceph cluster, nvme, to slow: 130 MB instead of 900MB, 6 times slower

Active Member

Distinguished Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Distinguished Member

Active Member

Active Member

Attachments

Active Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member