Cluster getting really ssssllllloooooowwwwww :-(((((((((((((((((((((((((

proxwolfe

Active Member
Jun 20, 2020
455
39
33
49
Hi,

I have s small 3 node PVE cluster including Ceph with 10GBE for each corosync and Ceph.

I used to have one OSD (NVME) in each node. Everything was nice and fast.

Then I replaced each NVME with two SSDs (as you are not supposed to have so few OSD and each OSD was already beyond the maximum pg count. And now my cluster is super slow.

Obviously, SSDs are slower than NVMEs but this is ridiculous (iodelay goes up to 35%).

So now I am wondering what the issue might be and how to rectify. I believe that before I set up Ceph originally, I read (somewhere) that if you only have one OSD in each nodem Ceph will make sure that all your data is present on every node (given a target 3 replicas).

If this is correct, does this than mean that by switching to two OSDs per node I broke this allocation rule and not every node may have all data available on its "own" OSDs anymore? If data has to be loaded via the network, this could explain the high delays.

And if so, is there a way to tell Ceph to keep all data on every node so that it does not need to load via the network from another node?

Thanks!
 
you didn't write what kind of SSDs you got, my first guess would be your SSDs are slow consumer ones that cannot handle the kind of load Ceph generates..
 
you didn't write what kind of SSDs you got, my first guess would be your SSDs are slow consumer ones that cannot handle the kind of load Ceph generates..
Yes, got me :) these are slow/standard (500MB/s) consumer grade SSDs (the NVMEs were consumer grade as well, albeit faster).

This will depend on your SSD model. You need to use DC grade SSD to get decent perf with Ceph
Okay, but say I buy

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893

They are not (or only minimally) faster at 550 MB/s.

So what kind of load does Ceph generate?

And what speeds would Ceph OSDs need to handle for the cluster to become usable again?

And is it a matter of speed only or are there other criteria as well that need to be considered (for speed; I am aware of the durability argument)?

And what about the network? Is it relevant in my case because data might need to be transferred over the network when a VM loads data (disregarding the need for Ceph to make replicas over the network). In other words: Is a full copy of my data still on each node or could parts be on other nodes only?

Thanks!
 
unless you messed up your crush rules, in general ceph should still distribute data so that one node has a single copy even with multiple OSDs per node (for obvious reliability reasons ;)).

the main difference between enterprise and consumer SSDs (and NVMEs) is that the enterprise variant will have power loss protection, which makes sustained loads and (fast) sync I/O possible. consumer products often have a rather small faster cache (so burst I/O will still be fast), but once that cache is full/saturated, performance will tank.
 
Sequential read/write numbers aren't that relevant. IOPS and, even more, sync writes matters.
the main difference between enterprise and consumer SSDs (and NVMEs) is that the enterprise variant will have power loss protection, which makes sustained loads and (fast) sync I/O possible.
Okay, so is

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893​

a suitable drive for Ceph? Will this restore my cluster to its old glory?
 
Okay, thanks.

So I have ordered some of these SSDs to test. In the meantime, I have switched off all non-essantial VMs which brought down the iodelays substantially and rendered the remaining VMs usable again.

But I still notice iodelay spikes in the GUI every now and then. Is it possible to trace these back to individual VMs?

Thanks!
 
Okay, thanks.

So I have ordered some of these SSDs to test. In the meantime, I have switched off all non-essantial VMs which brought down the iodelays substantially and rendered the remaining VMs usable again.

But I still notice iodelay spikes in the GUI every now and then. Is it possible to trace these back to individual VMs?

Thanks!
cat /sys/fs/cgroup/qemu.slice/*.scope/io.pressure

(where * is the vmid)
 
Dammit!

Today I received my first pair of PM983 (u.2).

The u.2 interface is completely new to me (had never heard of it before), but before I ordered I looked it up, obviously. I found that it is designed to work with SATA, SAS and PCI Express SSDs and one just needs a suitable controller. So I thought I could replace my SATA SSDs with the PM983 (u.2). While it fits mechanically, the PM983 is not being recognized. And while I still don't get what u.2 is and what it is not, I am getting the feeling that this more complicated than expected.

Can someone explain to me

1) why the PM983 (u.2) does not work on my controller (that works with SAS and SATA)?
2) which enterprise grade SSD (with enough iops for Ceph) will work on my controller?

Thanks!
 
cat /sys/fs/cgroup/qemu.slice/*.scope/io.pressure

(where * is the vmid)
Thanks, unfortunately, I don't know how to interpret the output:

some avg10=0.00 avg60=0.00 avg300=0.00 total=19263172693 full avg10=0.00 avg60=0.00 avg300=0.00 total=19236635844 some avg10=0.00 avg60=0.00 avg300=0.00 total=2112331 full avg10=0.00 avg60=0.00 avg300=0.00 total=2110601 some avg10=0.00 avg60=0.00 avg300=0.00 total=101755727 full avg10=0.00 avg60=0.00 avg300=0.00 total=100263167 some avg10=0.00 avg60=0.00 avg300=0.00 total=21366572 full avg10=0.00 avg60=0.00 avg300=0.00 total=20396716 some avg10=0.00 avg60=0.00 avg300=0.00 total=3621805289 full avg10=0.00 avg60=0.00 avg300=0.00 total=3587182270 some avg10=0.00 avg60=0.00 avg300=0.00 total=776538446 full avg10=0.00 avg60=0.00 avg300=0.00 total=772064897 /ICODE] I ran it with the '*', which gave me the info for six VMs. If I replace the '*' with a VM's number, it will only give me two lines. But what does the output mean? Thanks!
 
Having a strange situation again: This time, I replaced the three enterprise grade SSDs with three other, larger enterprise grade SSDs. Two nodes show low iodelays (2%) while one node shows very high iodelays (25%). The three nodes are basically identical (make, model, cpu, memory) and also the (old as well as the) new enterprise grade SSDs are identical (make, model, size). The only difference is the cpu and memory load which is lower (!) on the the node with the high iodelay than on the other two nodes.

So in order to investigate, I came back to this thread to look up how to find the culprit VM but I still don't know how to interpret the output. What do "some" and "full", "average" and "total" mean? How do I identify the VM(s) producing the high iodelay based on this (new) info:

Code:
some avg10=0.00 avg60=0.00 avg300=0.00 total=512013
full avg10=0.00 avg60=0.00 avg300=0.00 total=472429
some avg10=0.00 avg60=0.00 avg300=0.00 total=554814
full avg10=0.00 avg60=0.00 avg300=0.00 total=527242
some avg10=0.00 avg60=0.00 avg300=0.00 total=13307
full avg10=0.00 avg60=0.00 avg300=0.00 total=13287
some avg10=0.00 avg60=0.00 avg300=0.00 total=237323
full avg10=0.00 avg60=0.00 avg300=0.00 total=237051
some avg10=0.00 avg60=0.00 avg300=0.00 total=2057
full avg10=0.00 avg60=0.00 avg300=0.00 total=2028

Thanks!
 
Hmm. So I just migrated / shut down the VMs on the node with high iodelay. Turns out, even with zero VMs running, the iodelay remains (almost) the same. Probably has nothing to do with the VMs then. I'll open a new thread for this topic.

But I'd still be interested how to interpret the output above.

Thanks!
 
Agreed. Originally I thought it was the same topic, so I posted the questions here. But then I realized that it is a different issue, so I started a new thread.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!