I can't seem to figure out what is going on here.
9 node cluster of Cisco UCS c240 M3's. (2) x Xeon E5-2667 @ 3.30 ghz. 128 gb ram per host. PVE 8.13, CEPH Reef 18.2.0
node 1 thru 5 each have (10) Seagate rotational drives, some SATA, some SAS, ranging from 10 to 16 TB, As I've gone down the testing rabbit hole, I've made two pools of these drives, using device classes. hdd = host 2 thru 5, 40 drives; hdd2 = host 1, 10 drives. I've changed the CRUSH map to allow replication 3 on the same host.
Nodes 6-9 each have (22) 1.92 TB SSDs, in a device class known as "ssd". Monitors on 1,3,5,7,9.
Networking is dual 40 gb ethernet (LAG/MLAG to a pair of Arista 40 gig switches).
This all started out with me migrating a hyper-v VM to this machine. I mounted that hyper-v server using mount.cifs, and started a qm-import to the hdd class. That part works fine, but is very slow. Like, less than 100 mbytes/sec. So I spend the last 3 hours testing every component, working my way up, and I am not sure what to say here.
I am focused on the hdd2 class, which again is the (10) rotationals on node1. I tested all ten drives with fio as direct devices:
All 10 of them return the following, plus/minus a couple percent:
So, the drives are fast and equal. I make OSD's from them, I do a "ceph tell bench" on all of them. Again, all are about 75 mbytes/sec (plus/minus 3 mbytes/sec), and around 18 iops. No outliers. Example:
So I create some cephfs pools, mount them on the local hypervisor, and do more fio:
Pretty zippy. So then, I create RBD pools, create a VM, and plop it on the hdd2 pool (10 drives on node1). Debian base install, then install fio:
Wow. Slow! So then I follow an example online of doing a rados bench, which I do on hdd2:
and then a rbd bench:
I cannot figure out for the life of me why fio is so slow INSIDE the VM, and why qm import is so slow on the same pool that supports orders of magnitude faster writes in certain aspect. It's like I am missing somethign ridiculously basic.
9 node cluster of Cisco UCS c240 M3's. (2) x Xeon E5-2667 @ 3.30 ghz. 128 gb ram per host. PVE 8.13, CEPH Reef 18.2.0
node 1 thru 5 each have (10) Seagate rotational drives, some SATA, some SAS, ranging from 10 to 16 TB, As I've gone down the testing rabbit hole, I've made two pools of these drives, using device classes. hdd = host 2 thru 5, 40 drives; hdd2 = host 1, 10 drives. I've changed the CRUSH map to allow replication 3 on the same host.
Nodes 6-9 each have (22) 1.92 TB SSDs, in a device class known as "ssd". Monitors on 1,3,5,7,9.
Networking is dual 40 gb ethernet (LAG/MLAG to a pair of Arista 40 gig switches).
This all started out with me migrating a hyper-v VM to this machine. I mounted that hyper-v server using mount.cifs, and started a qm-import to the hdd class. That part works fine, but is very slow. Like, less than 100 mbytes/sec. So I spend the last 3 hours testing every component, working my way up, and I am not sure what to say here.
I am focused on the hdd2 class, which again is the (10) rotationals on node1. I tested all ten drives with fio as direct devices:
Code:
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
All 10 of them return the following, plus/minus a couple percent:
Code:
WRITE: bw=143MiB/s (150MB/s), 8653KiB/s-10.7MiB/s (8861kB/s-11.2MB/s), io=11.8GiB (12.6GB), run=60598-84202msec
So, the drives are fast and equal. I make OSD's from them, I do a "ceph tell bench" on all of them. Again, all are about 75 mbytes/sec (plus/minus 3 mbytes/sec), and around 18 iops. No outliers. Example:
Code:
root@ceph1-hyp:/mnt/pve/cephfs_hdd2# ceph tell osd.5 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 14.066783756,
"bytes_per_sec": 76331721.779828295,
"iops": 18.198900647122453
}
So I create some cephfs pools, mount them on the local hypervisor, and do more fio:
ssd (88 SSDs) | WRITE: bw=12.6GiB/s (13.5GB/s), 721MiB/s-916MiB/s (756MB/s-960MB/s), io=777GiB (834GB), run=61731-61782msec |
hdd (40 HDDs) | WRITE: bw=11.5GiB/s (12.3GB/s), 482MiB/s-881MiB/s (506MB/s-923MB/s), io=735GiB (789GB), run=63391-63967msec |
hdd2 (10 hdds) | WRITE: bw=10.5GiB/s (11.3GB/s), 277MiB/s-1144MiB/s (291MB/s-1200MB/s), io=790GiB (848GB), run=72933-74883msec |
Pretty zippy. So then, I create RBD pools, create a VM, and plop it on the hdd2 pool (10 drives on node1). Debian base install, then install fio:
Code:
root@junkbox2:~# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
[...]
Run status group 0 (all jobs):
WRITE: bw=30.0MiB/s (31.5MB/s), 1726KiB/s-2303KiB/s (1768kB/s-2359kB/s), io=2451MiB (2570MB), run=71822-81624msec
Wow. Slow! So then I follow an example online of doing a rados bench, which I do on hdd2:
Code:
rados bench -p pool_hdd2 30 write --no-cleanup
Total time run: 30.3039
Total writes made: 1506
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 198.786
Stddev Bandwidth: 21.2007
Max bandwidth (MB/sec): 236
Min bandwidth (MB/sec): 132
Average IOPS: 49
Stddev IOPS: 5.30018
Max IOPS: 59
Min IOPS: 33
Average Latency(s): 0.320658
Stddev Latency(s): 0.190432
Max latency(s): 1.76071
Min latency(s): 0.0456256
and then a rbd bench:
Code:
root@ceph1-hyp:/mnt/rbd-mounts# rbd bench --io-type write image01 --pool=pool_hdd2
bench type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 21456 21132.1 83 MiB/s
2 41168 20631.5 81 MiB/s
3 58288 19355.6 76 MiB/s
4 74544 18657.1 73 MiB/s
5 90384 18093 71 MiB/s
6 106416 17004.2 66 MiB/s
7 119824 15717.3 61 MiB/s
8 135696 15505.1 61 MiB/s
9 151568 15342.1 60 MiB/s
10 170160 15865 62 MiB/s
11 188576 16483.4 64 MiB/s
12 204352 16917.7 66 MiB/s
13 220832 17012.2 66 MiB/s
14 236800 17113.4 67 MiB/s
15 254256 16912.5 66 MiB/s
elapsed: 15 ops: 262144 ops/sec: 16627.9 bytes/sec: 65 MiB/s
I cannot figure out for the life of me why fio is so slow INSIDE the VM, and why qm import is so slow on the same pool that supports orders of magnitude faster writes in certain aspect. It's like I am missing somethign ridiculously basic.
Last edited: