Proxmox VE Ceph Benchmark 2018/02

tuonoazzurro

Member
Oct 28, 2017
63
1
8
29
I installed Proxmox on the first drive and then manually installed the bootloader onto a USB stick.
Then I configured the server to boot from the USB stick since it cannot boot from any drive on the controller when in HBA mode.

This was a lab setup so I didn't bother with Software RAID1 for proxmox installation.
I'm looking for this solution for like a year but Never understood how to do It so i've made a raid 0 of every single disk.
Can you Please explain how you did It?
Thanks
 

Cha0s

New Member
Feb 9, 2018
11
0
1
39
To be honest I don't remember how I did it. It's been quite a while since then and I've dismantled the lab to look it up for you.

Looking into my browser history I see I've visited these two links
https://unix.stackexchange.com/questions/665/installing-grub-2-on-a-usb-flash-drive
https://unix.stackexchange.com/questions/28506/how-do-you-install-grub2-on-a-usb-stick

These should get you started.
Obviously it won't work just by following blind those answers. It needs some customizing to work with Proxmox.

Now that I think of it, I may have installed a vanilla Debian and then proxmox on top of it.
I really don't remember, sorry. I tried a lot during that period of time to finally get it working.
 

Cha0s

New Member
Feb 9, 2018
11
0
1
39
I managed to find some notes I kept back then.
# RAID controller must be in HBA mode. All drives should be exposed to the OS directly.
# No RAID cache is needed.

# Install Debian9.x Netinstall - minimum installation (standard utils + ssh server).

# /boot & bootloader MUST be on USB stick.
....
So I did install a vanilla Debian 9 and judging from my vague notes, I probably had the USB stick inserted during installation and used it to mount /boot and then selected it in the final installation steps to install the bootloader onto.

Then I continued with installing proxmox on debian.

# Setup Proxmox Repo
echo "deb http://download.proxmox.com/debian/pve stretch pve-no-subscription" > /etc/apt/sources.list.d/pve-install-repo.list
wget http://download.proxmox.com/debian/proxmox-ve-release-5.x.gpg -O /etc/apt/trusted.gpg.d/proxmox-ve-release-5.x.gpg


# Update repos and system
apt update && apt dist-upgrade


# Install Proxmox
apt install proxmox-ve postfix open-iscsi pve-headers
These final steps maybe outdated. Better check the official installation guides.
 

Runestone

New Member
Oct 12, 2018
1
0
1
49
Greetings!

We are looking at building a 4 node HA cluster with Ceph storage on all 4 nodes and had some questions on some items in the FAQ. My idea was to install the OS on pro-sumer SSD's, OSD's on enterprise SSD's and extra storage OSD's for low use servers and backups on spinners. I may not be understanding the context of the FAQ's below, so if someone could help me understand if my idea above is workable, that would be great.

Can I create a fast pool with NVMe SSDs, a semi fast pool with SSDs, and a slow pool with spinning disks?
Yes, building several pools can help in situations where budget is limited but big storage is needed.
This answer leads me to believe spinners would be fine if big storage is needed with the caveat that it will be slow.

Can I only use spinning disks in such a small setup (for example 3 nodes)?
No, the performance is very low.
This answer leads me to believe that it is not acceptable to use spinners.

Can I use consumer or pro-sumer SSDs, as these are much cheaper than enterprise class SSDs?
No. Never. These SSDs wont provide the needed performance, nor reliability and endurance. See the fio results from above and/or run your own fio tests.
And this answer leads me to believe that nothing less than an enterprise SSD should be used, including consumer & pro-sumer SSD's and spinners.


Thanks for the help.
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,020
260
88
@Runestone, all of this has to be seen in the context of VM/CT hosting, where usually high IO/s is needed to run the infrastructure.
 

afrugone

Active Member
Nov 26, 2008
101
0
36
Hi, I've configured a 3 server CEPH cluster, using INFINIBAND/IPOIB with a "iperf" of 20GBS, but rados test perform as 1GB, how can I force CEPH traffic to use the INFINIBAND network?
 

udo

Famous Member
Apr 22, 2009
5,865
161
83
Ahrensburg; Germany
Hi, I've configured a 3 server CEPH cluster, using INFINIBAND/IPOIB with a "iperf" of 20GBS, but rados test perform as 1GB, how can I force CEPH traffic to use the INFINIBAND network?
Hi,
use the public network (and mon-ip) inside the infiniband-network (if you have two network seperate the cluster network (traffic between osds))
Code:
public_network = 192.168.2.0/24
cluster_network = 192.168.3.0/24

[mon.0]
host = pve01
mon_addr = 192.168.2.11:6789
Udo
 

afrugone

Active Member
Nov 26, 2008
101
0
36
Many Thanks for your answer, I configured the CEPH from GUI, and the ceph.conf is as show bellow.
ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 172.27.111.0/24
fsid = 6a128c72-3400-430e-9240-9b75b0936015
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 172.27.111.0/24

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.STO1001]
host = STO1001
mon addr = 172.27.111.141:6789
[mon.STO1002]
host = STO1002
mon addr = 172.27.111.142:6789
[mon.STO1003]
host = STO1003
mon addr = 172.27.111.143:6789​

The Infiniband is in separate network 10.10.111.0/24 and the public network is at 172.27.111.0/24, so I've to put the following?

cluster network = 10.10.111.0/24
public network = 172.27.111.0/24
host = STO1001
mon addr = 172.27.111.141:6789
host = STO1002
mon addr = 172.27.111.142:6789
host = STO1003
mon addr = 172.27.111.143:6789​

With this modification the test bench is as follows:

rados bench -p SSDPool 60 write --no-cleanup
Total time run: 60.470899
Total writes made: 2858
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 189.05
Stddev Bandwidth: 24.8311
Max bandwidth (MB/sec): 244
Min bandwidth (MB/sec): 144
Average IOPS: 47
Stddev IOPS: 6
Max IOPS: 61
Min IOPS: 36
Average Latency(s): 0.338518
Stddev Latency(s): 0.418556
Max latency(s): 2.9173
Min latency(s): 0.0226615​
 
Last edited:

udo

Famous Member
Apr 22, 2009
5,865
161
83
Ahrensburg; Germany
Many Thanks for your answer, I configured the CEPH from GUI, and the ceph.conf is as show bellow.
ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 172.27.111.0/24
fsid = 6a128c72-3400-430e-9240-9b75b0936015
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 172.27.111.0/24

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.STO1001]
host = STO1001
mon addr = 172.27.111.141:6789
[mon.STO1002]
host = STO1002
mon addr = 172.27.111.142:6789
[mon.STO1003]
host = STO1003
mon addr = 172.27.111.143:6789​

The Infiniband is in separate network 10.10.111.0/24 and the public network is at 172.27.111.0/24, so I've to put the following?

cluster network = 10.10.111.0/24
public network = 172.27.111.0/24
host = STO1001
mon addr = 172.27.111.141:6789
host = STO1002
mon addr = 172.27.111.142:6789
host = STO1003
mon addr = 172.27.111.143:6789​

With this modification the test bench is as follows:

rados bench -p SSDPool 60 write --no-cleanup
Total time run: 60.470899
Total writes made: 2858
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 189.05
Stddev Bandwidth: 24.8311
Max bandwidth (MB/sec): 244
Min bandwidth (MB/sec): 144
Average IOPS: 47
Stddev IOPS: 6
Max IOPS: 61
Min IOPS: 36
Average Latency(s): 0.338518
Stddev Latency(s): 0.418556
Max latency(s): 2.9173
Min latency(s): 0.0226615​
Hi,
you don't have two ceph-networks!
don't use an cluster network and use 10.10.111.0/24 for the public network. The mons must be also part of this network!

Udo
 

afrugone

Active Member
Nov 26, 2008
101
0
36
Sorry but I'm a little confused with the network configuration, my network is as show bellow, Bond0 is a gigabit and bond0 is infiniband with 40GB interfaces and I'm trying that storage communicate trough infiniband (bond0) interfaces

auto lo
iface lo inet loopback
iface eno3 inet manual
iface enp64s0f1 inet manual
iface eno1 inet manual
iface enp136s0f1 inet manual

auto ib0
iface ib0 inet manual

auto ib1
iface ib1 inet manual

auto bond1
iface bond1 inet manual
slaves eno1 eno3
bond_miimon 100
bond_mode active-backup

auto bond0
iface bond0 inet static
address 10.10.111.111
netmask 255.255.255.0
slaves ib0 ib1
bond_miimon 100
bond_mode active-backup
pre-up modprobe ib_ipoib
pre-up echo connected > /sys/class/net/ib0/mode
pre-up echo connected > /sys/class/net/ib1/mode
pre-up modprobe bond0
mtu 65520

auto vmbr0
iface vmbr0 inet static
address 172.27.111.141
netmask 255.255.252.0
gateway 172.27.110.252
bridge_ports bond1
bridge_stp off
bridge_fd 0
 
Last edited:

udo

Famous Member
Apr 22, 2009
5,865
161
83
Ahrensburg; Germany
Sorry but I'm a little confused with the network configuration, my network is as show bellow, Bond0 is a gigabit and bond0 is infiniband with 40GB interfaces and I'm trying that storage communicate trough infiniband (bond0) interfaces

...
Hi,
you should open an new thread, because this has nothing to do with ceph-benchmarking...

Udo
 

chrone

Active Member
Apr 15, 2015
114
14
38
planet earth
Will there be fio synchronous write benchmark inside a VM running on top of Proxmox and Ceph? Would love to compare numbers.

Is 212 IOPS for synchronous fio 4k write test on a VM acceptable? I know Samsung SM863a SSD could push 6k IOPS as local storage.
 
Last edited:

frantek

Member
May 30, 2009
160
4
18
My Setup:

Initially setup with PVE4, Ceph Hammer and a 10 GE mesh network. Upgraded to 5.3. OSDs are 500GB spinning disks.

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.15.15.0/24
filestore xattr use omap = true
fsid = e9a07274-cba6-4c72-9788-a7b65c93e477
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.15.15.0/24
mon allow pool delete = true
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.2]
host = pve03
mon addr = 10.15.15.7:6789
[mon.1]
host = pve02
mon addr = 10.15.15.6:6789
[mon.0]
host = pve01
mon addr = 10.15.15.5:6789

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host pve01 {
id -2 # do not change unnecessarily
id -5 class hdd # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.450
item osd.1 weight 0.450
item osd.2 weight 0.450
item osd.3 weight 0.450
item osd.16 weight 0.450
item osd.17 weight 0.450
}
host pve03 {
id -3 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.450
item osd.5 weight 0.450
item osd.7 weight 0.450
item osd.14 weight 0.450
item osd.15 weight 0.450
item osd.6 weight 0.450
}
host pve02 {
id -4 # do not change unnecessarily
id -7 class hdd # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.8 weight 0.450
item osd.9 weight 0.450
item osd.11 weight 0.450
item osd.12 weight 0.450
item osd.13 weight 0.450
item osd.10 weight 0.450
}
root default {
id -1 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 8.100
alg straw
hash 0 # rjenkins1
item pve01 weight 2.700
item pve03 weight 2.700
item pve02 weight 2.700
}
# rules
rule replicated_ruleset {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map

Data:

rados bench -p rbd 60 write -b 4M -t 16 --no-cleanup

Code:
Total time run:         60.752370
Total writes made:      1659
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     109.23
Stddev Bandwidth:       35.3805
Max bandwidth (MB/sec): 236
Min bandwidth (MB/sec): 28
Average IOPS:           27
Stddev IOPS:            8
Max IOPS:               59
Min IOPS:               7
Average Latency(s):     0.585889
Stddev Latency(s):      0.286079
Max latency(s):         1.6641
Min latency(s):         0.0752661
rados bench 60 rand -t 16 -p rbd

Code:
Total time run:       60.032432
Total reads made:     25108
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1672.96
Average IOPS:         418
Stddev IOPS:          19
Max IOPS:             465
Min IOPS:             376
Average Latency(s):   0.0362159
Max latency(s):       0.239073
Min latency(s):       0.00460308
Any other suggestion to get better write performance than using SSDs?
 

chrone

Active Member
Apr 15, 2015
114
14
38
planet earth
My Setup:

Initially setup with PVE4, Ceph Hammer and a 10 GE mesh network. Upgraded to 5.3. OSDs are 500GB spinning disks.

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
pve-kernel-4.15: 5.3-1
pve-kernel-4.15.18-10-pve: 4.15.18-32
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-36
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-45
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.15.15.0/24
filestore xattr use omap = true
fsid = e9a07274-cba6-4c72-9788-a7b65c93e477
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.15.15.0/24
mon allow pool delete = true
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.2]
host = pve03
mon addr = 10.15.15.7:6789
[mon.1]
host = pve02
mon addr = 10.15.15.6:6789
[mon.0]
host = pve01
mon addr = 10.15.15.5:6789

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host pve01 {
id -2 # do not change unnecessarily
id -5 class hdd # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.450
item osd.1 weight 0.450
item osd.2 weight 0.450
item osd.3 weight 0.450
item osd.16 weight 0.450
item osd.17 weight 0.450
}
host pve03 {
id -3 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.450
item osd.5 weight 0.450
item osd.7 weight 0.450
item osd.14 weight 0.450
item osd.15 weight 0.450
item osd.6 weight 0.450
}
host pve02 {
id -4 # do not change unnecessarily
id -7 class hdd # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.8 weight 0.450
item osd.9 weight 0.450
item osd.11 weight 0.450
item osd.12 weight 0.450
item osd.13 weight 0.450
item osd.10 weight 0.450
}
root default {
id -1 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 8.100
alg straw
hash 0 # rjenkins1
item pve01 weight 2.700
item pve03 weight 2.700
item pve02 weight 2.700
}
# rules
rule replicated_ruleset {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map

Data:

rados bench -p rbd 60 write -b 4M -t 16 --no-cleanup

Code:
Total time run:         60.752370
Total writes made:      1659
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     109.23
Stddev Bandwidth:       35.3805
Max bandwidth (MB/sec): 236
Min bandwidth (MB/sec): 28
Average IOPS:           27
Stddev IOPS:            8
Max IOPS:               59
Min IOPS:               7
Average Latency(s):     0.585889
Stddev Latency(s):      0.286079
Max latency(s):         1.6641
Min latency(s):         0.0752661
rados bench 60 rand -t 16 -p rbd

Code:
Total time run:       60.032432
Total reads made:     25108
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1672.96
Average IOPS:         418
Stddev IOPS:          19
Max IOPS:             465
Min IOPS:             376
Average Latency(s):   0.0362159
Max latency(s):       0.239073
Min latency(s):       0.00460308
Any other suggestion to get better write performance than using SSDs?

Convert from filestore to bluestore might help reducing the double write penalty.
 

fips

Member
May 5, 2014
148
5
18
Recently I benchmarked Samsungs Enterprise SSD 860DCT with 960GB with my usual benchmark setup and the result was just horrible:
FIO Command:
Code:
fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
Result:
BW: 1030KB/s IOPS: 257

compared with the SM863a:
BW: 67MB/s IOPS: 17,4k

It seems not every enterprise SSD is a good choice for a ceph setup...
 

sg90

Member
Sep 21, 2018
134
19
18
29
Recently I benchmarked Samsungs Enterprise SSD 860DCT with 960GB with my usual benchmark setup and the result was just horrible:
FIO Command:
Code:
fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
Result:
BW: 1030KB/s IOPS: 257

compared with the SM863a:
BW: 67MB/s IOPS: 17,4k

It seems not every enterprise SSD is a good choice for a ceph setup...
According to their sales blurb they are read intensive disks, to be honest looks like a standard 860 with some extra DC features mainly designed for write once read alot, has a very low TBW as-wel.
 

victorhooi

Member
Apr 3, 2018
155
9
18
33
We have a:
  • 3 node cluster running Proxmox/Ceph
  • Node 1 has 48 GB of RAM, and Node 2 and 3 have 32 GB of RAM
  • Ceph drives are Intel Optane 900p (480GB) NVMe.
  • 4 OSDs per node (total of 12 OSDs)
  • NICs are Intel X520-DA2, with 10GBASE-LR going to a Unifi US-XG-16.
  • First 10GB port is for Proxmox VM traffic, second 10GB port is for Ceph traffic.
I created a new pool to store VMs with 512 PGs. When I copy from a local LVM store to Rados - I'm seeing writes stall at around 318 MiB/s:



I then created a second pool with 128 PGs for benchmarking.

Write results:
Code:
root@vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16 --no-cleanup
....
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     12258     12242   816.055       788   0.0856726   0.0783458
Total time run:         60.069008
Total writes made:      12258
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     816.261
Stddev Bandwidth:       17.4584
Max bandwidth (MB/sec): 856
Min bandwidth (MB/sec): 780
Average IOPS:           204
Stddev IOPS:            4
Max IOPS:               214
Min IOPS:               195
Average Latency(s):     0.0783801
Stddev Latency(s):      0.0468404
Max latency(s):         0.437235
Min latency(s):         0.0177178
Sequential read results - I don't know why this only ran for 32 seconds?
Code:
root@vwnode1:~# rados bench -p benchmarking 60 seq -t 16
....
Total time run:       32.608549
Total reads made:     12258
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1503.65
Average IOPS:         375
Stddev IOPS:          22
Max IOPS:             410
Min IOPS:             326
Average Latency(s):   0.0412777
Max latency(s):       0.498116
Min latency(s):       0.00447062
Random read results:
Code:
root@vwnode1:~# rados bench -p benchmarking 60 rand -t 16
....
Total time run:       60.066384
Total reads made:     22819
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1519.59
Average IOPS:         379
Stddev IOPS:          21
Max IOPS:             424
Min IOPS:             320
Average Latency(s):   0.0408697
Max latency(s):       0.662955
Min latency(s):       0.00172077
I then cleaned-up with:
Code:
root@vwnode1:~# rados -p benchmarking cleanup
Removed 12258 objects
I then tested with the normal Ceph pool, that has 512 PGs (instead of the 128 PGs in the benchmarking pool)

Write result:
Code:
root@vwnode1:~# rados bench -p proxmox_vms 60 write -b 4M -t 16 --no-cleanup
....
Total time run:         60.041712
Total writes made:      12132
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     808.238
Stddev Bandwidth:       20.7444
Max bandwidth (MB/sec): 860
Min bandwidth (MB/sec): 744
Average IOPS:           202
Stddev IOPS:            5
Max IOPS:               215
Min IOPS:               186
Average Latency(s):     0.0791746
Stddev Latency(s):      0.0432707
Max latency(s):         0.42535
Min latency(s):         0.0200791
Sequential read result - once again, only ran for 32 seconds:
Code:
root@vwnode1:~# rados bench -p proxmox_vms 60 seq -t 16
....
Total time run:       31.249274
Total reads made:     12132
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1552.93
Average IOPS:         388
Stddev IOPS:          30
Max IOPS:             460
Min IOPS:             320
Average Latency(s):   0.0398702
Max latency(s):       0.481106
Min latency(s):       0.00461585
Random read result:
Code:
root@vwnode1:~# rados bench -p proxmox_vms 60 rand -t 16
....
Total time run:       60.088822
Total reads made:     23626
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1572.74
Average IOPS:         393
Stddev IOPS:          25
Max IOPS:             432
Min IOPS:             322
Average Latency(s):   0.0392854
Max latency(s):       0.693123
Min latency(s):       0.00178545
Code:
root@vwnode1:~# rados -p proxmox_vms cleanup
Removed 12132 objects
root@vwnode1:~# rados df
POOL_NAME   USED   OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD     WR_OPS WR
proxmox_vms 169GiB   43396      0 130188                  0       0        0 909519 298GiB 619697 272GiB

total_objects    43396
total_used       564GiB
total_avail      768GiB
total_space      1.30TiB/
Any ideas on why the original transfer from LVM to Ceph stalled at 371 MiB/s?

And are the above rados bench results in line with what you might expect with this hardware?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!