Hello
I've some performance issues with my Proxmox Cluster.
I use 3* DELL R620 Server with 2 * 2.6 GHZ XEONs and each server has its own dual port 10G Mellanox 3rd gen NIC for CEPH with a MTU of 9000, the 3 servers are directly connected (no 10G Switch for ceph). The setup is simple (used as a test setup) where each server has one Samsung PM883 480GB SSD as OSD connected to a DELL H710p mini RAID controller via RAID-0. Proxmox itself runs at the latest version freshly installed, the firmware of the nics is the latest available. Iperf3 delivers a nice throughput of around 9.9 GB
Problem:
A simple dd or a copy from VM 1 to VM 2 struggles with around 33MB/sec which is absolutely no acceptable speed. While I read much about Ceph an run it also with Mimic on a different cluster I'm clueless at the moment. Sure I hit first the obviously blog articles and tuning guides. However any input it highly appreciated.
Latency stays low between 0-1 while copying.
I've some performance issues with my Proxmox Cluster.
I use 3* DELL R620 Server with 2 * 2.6 GHZ XEONs and each server has its own dual port 10G Mellanox 3rd gen NIC for CEPH with a MTU of 9000, the 3 servers are directly connected (no 10G Switch for ceph). The setup is simple (used as a test setup) where each server has one Samsung PM883 480GB SSD as OSD connected to a DELL H710p mini RAID controller via RAID-0. Proxmox itself runs at the latest version freshly installed, the firmware of the nics is the latest available. Iperf3 delivers a nice throughput of around 9.9 GB
Problem:
A simple dd or a copy from VM 1 to VM 2 struggles with around 33MB/sec which is absolutely no acceptable speed. While I read much about Ceph an run it also with Mimic on a different cluster I'm clueless at the moment. Sure I hit first the obviously blog articles and tuning guides. However any input it highly appreciated.
Code:
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 172.16.1.0/28
fsid = d1222c67-c6e9-4b0b-b8b9-0abe53e5f590
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 10.0.10.0/24
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.server-3]
host = server-3
mon addr = 10.0.10.3:6789
[mon.server-1]
host = server-1
mon addr = 10.0.10.1:6789
[mon.server-2]
host = server-2
mon addr = 10.0.10.2:6789
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host server-1 {
id -3 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 0.436
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.436
}
host server-2 {
id -5 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 0.436
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.436
}
host server-3 {
id -7 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 0.436
alg straw2
hash 0 # rjenkins1
item osd.2 weight 0.436
}
root default {
id -1 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.308
alg straw2
hash 0 # rjenkins1
item server-1 weight 0.436
item server-2 weight 0.436
item server-3 weight 0.436
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
Code:
ceph osd perf
osd commit_latency(ms) apply_latency(ms)
0 0 0
2 0 0
1 0 0
Last edited: