Cluster getting really ssssllllloooooowwwwww :-(((((((((((((((((((((((((

proxwolfe · Jul 3, 2023

Hi,

I have s small 3 node PVE cluster including Ceph with 10GBE for each corosync and Ceph.

I used to have one OSD (NVME) in each node. Everything was nice and fast.

Then I replaced each NVME with two SSDs (as you are not supposed to have so few OSD and each OSD was already beyond the maximum pg count. And now my cluster is super slow.

Obviously, SSDs are slower than NVMEs but this is ridiculous (iodelay goes up to 35%).

So now I am wondering what the issue might be and how to rectify. I believe that before I set up Ceph originally, I read (somewhere) that if you only have one OSD in each nodem Ceph will make sure that all your data is present on every node (given a target 3 replicas).

If this is correct, does this than mean that by switching to two OSDs per node I broke this allocation rule and not every node may have all data available on its "own" OSDs anymore? If data has to be loaded via the network, this could explain the high delays.

And if so, is there a way to tell Ceph to keep all data on every node so that it does not need to load via the network from another node?

Thanks!

danielb · Jul 3, 2023

This will depend on your SSD model. You need to use DC grade SSD to get decent perf with Ceph

fabian · Jul 3, 2023

you didn't write what kind of SSDs you got, my first guess would be your SSDs are slow consumer ones that cannot handle the kind of load Ceph generates..

proxwolfe · Jul 3, 2023

fabian said:
you didn't write what kind of SSDs you got, my first guess would be your SSDs are slow consumer ones that cannot handle the kind of load Ceph generates..

Yes, got me

these are slow/standard (500MB/s) consumer grade SSDs (the NVMEs were consumer grade as well, albeit faster).

danielb said:
This will depend on your SSD model. You need to use DC grade SSD to get decent perf with Ceph

Okay, but say I buy

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893

They are not (or only minimally) faster at 550 MB/s.

So what kind of load does Ceph generate?

And what speeds would Ceph OSDs need to handle for the cluster to become usable again?

And is it a matter of speed only or are there other criteria as well that need to be considered (for speed; I am aware of the durability argument)?

And what about the network? Is it relevant in my case because data might need to be transferred over the network when a VM loads data (disregarding the need for Ceph to make replicas over the network). In other words: Is a full copy of my data still on each node or could parts be on other nodes only?

Thanks!

danielb · Jul 3, 2023

Sequential read/write numbers aren't that relevant. IOPS and, even more, sync writes matters.

fabian · Jul 3, 2023

unless you messed up your crush rules, in general ceph should still distribute data so that one node has a single copy even with multiple OSDs per node (for obvious reliability reasons

).

the main difference between enterprise and consumer SSDs (and NVMEs) is that the enterprise variant will have power loss protection, which makes sustained loads and (fast) sync I/O possible. consumer products often have a rather small faster cache (so burst I/O will still be fast), but once that cache is full/saturated, performance will tank.

proxwolfe · Jul 3, 2023

danielb said:
Sequential read/write numbers aren't that relevant. IOPS and, even more, sync writes matters.

fabian said:
the main difference between enterprise and consumer SSDs (and NVMEs) is that the enterprise variant will have power loss protection, which makes sustained loads and (fast) sync I/O possible.

Okay, so is

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893

a suitable drive for Ceph? Will this restore my cluster to its old glory?

spirit · Jul 3, 2023

proxwolfe said:
Okay, so is

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893
a suitable drive for Ceph? Will this restore my cluster to its old glory?

yes, no problem. (I'm running some of them in production, works fine like any datacenter ssd/nvme )

proxwolfe · Jul 6, 2023

Okay, thanks.

So I have ordered some of these SSDs to test. In the meantime, I have switched off all non-essantial VMs which brought down the iodelays substantially and rendered the remaining VMs usable again.

But I still notice iodelay spikes in the GUI every now and then. Is it possible to trace these back to individual VMs?

Thanks!

spirit · Jul 6, 2023

proxwolfe said:
Okay, thanks.

So I have ordered some of these SSDs to test. In the meantime, I have switched off all non-essantial VMs which brought down the iodelays substantially and rendered the remaining VMs usable again.

But I still notice iodelay spikes in the GUI every now and then. Is it possible to trace these back to individual VMs?

Thanks!

cat /sys/fs/cgroup/qemu.slice/*.scope/io.pressure

(where * is the vmid)

proxwolfe · Jul 11, 2023

Dammit!

Today I received my first pair of PM983 (u.2).

The u.2 interface is completely new to me (had never heard of it before), but before I ordered I looked it up, obviously. I found that it is designed to work with SATA, SAS and PCI Express SSDs and one just needs a suitable controller. So I thought I could replace my SATA SSDs with the PM983 (u.2). While it fits mechanically, the PM983 is not being recognized. And while I still don't get what u.2 is and what it is not, I am getting the feeling that this more complicated than expected.

Can someone explain to me

1) why the PM983 (u.2) does not work on my controller (that works with SAS and SATA)?
2) which enterprise grade SSD (with enough iops for Ceph) will work on my controller?

Thanks!

proxwolfe · Jul 11, 2023

spirit said:
cat /sys/fs/cgroup/qemu.slice/*.scope/io.pressure

(where * is the vmid)

Thanks, unfortunately, I don't know how to interpret the output:


some avg10=0.00 avg60=0.00 avg300=0.00 total=19263172693
full avg10=0.00 avg60=0.00 avg300=0.00 total=19236635844
some avg10=0.00 avg60=0.00 avg300=0.00 total=2112331
full avg10=0.00 avg60=0.00 avg300=0.00 total=2110601
some avg10=0.00 avg60=0.00 avg300=0.00 total=101755727
full avg10=0.00 avg60=0.00 avg300=0.00 total=100263167
some avg10=0.00 avg60=0.00 avg300=0.00 total=21366572
full avg10=0.00 avg60=0.00 avg300=0.00 total=20396716
some avg10=0.00 avg60=0.00 avg300=0.00 total=3621805289
full avg10=0.00 avg60=0.00 avg300=0.00 total=3587182270
some avg10=0.00 avg60=0.00 avg300=0.00 total=776538446
full avg10=0.00 avg60=0.00 avg300=0.00 total=772064897
/ICODE]

I ran it with the '*', which gave me the info for six VMs. If I replace the '*' with a VM's number, it will only give me two lines.

But what does the output mean?

Thanks!

proxwolfe · Nov 27, 2023

Having a strange situation again: This time, I replaced the three enterprise grade SSDs with three other, larger enterprise grade SSDs. Two nodes show low iodelays (2%) while one node shows very high iodelays (25%). The three nodes are basically identical (make, model, cpu, memory) and also the (old as well as the) new enterprise grade SSDs are identical (make, model, size). The only difference is the cpu and memory load which is lower (!) on the the node with the high iodelay than on the other two nodes.

So in order to investigate, I came back to this thread to look up how to find the culprit VM but I still don't know how to interpret the output. What do "some" and "full", "average" and "total" mean? How do I identify the VM(s) producing the high iodelay based on this (new) info:

Code:

some avg10=0.00 avg60=0.00 avg300=0.00 total=512013
full avg10=0.00 avg60=0.00 avg300=0.00 total=472429
some avg10=0.00 avg60=0.00 avg300=0.00 total=554814
full avg10=0.00 avg60=0.00 avg300=0.00 total=527242
some avg10=0.00 avg60=0.00 avg300=0.00 total=13307
full avg10=0.00 avg60=0.00 avg300=0.00 total=13287
some avg10=0.00 avg60=0.00 avg300=0.00 total=237323
full avg10=0.00 avg60=0.00 avg300=0.00 total=237051
some avg10=0.00 avg60=0.00 avg300=0.00 total=2057
full avg10=0.00 avg60=0.00 avg300=0.00 total=2028

Thanks!

proxwolfe · Nov 27, 2023

Hmm. So I just migrated / shut down the VMs on the node with high iodelay. Turns out, even with zero VMs running, the iodelay remains (almost) the same. Probably has nothing to do with the VMs then. I'll open a new thread for this topic.

But I'd still be interested how to interpret the output above.

Thanks!

sb-jw · Nov 27, 2023

Let's continue in the new thread. In any case, two threads on the same topic are not optimal.

https://forum.proxmox.com/threads/high-iodelay-on-one-of-three-identical-nodes.137204/

proxwolfe · Nov 27, 2023

Agreed. Originally I thought it was the same topic, so I posted the questions here. But then I realized that it is a different issue, so I started a new thread.

Search

Search

Cluster getting really ssssllllloooooowwwwww :-(((((((((((((((((((((((((

proxwolfe

Well-Known Member

danielb

Renowned Member

fabian

Proxmox Staff Member

proxwolfe

Well-Known Member

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893

danielb

Renowned Member

fabian

Proxmox Staff Member

proxwolfe

Well-Known Member

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893

spirit

Distinguished Member

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893

proxwolfe

Well-Known Member

spirit

Distinguished Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

proxwolfe

Well-Known Member

sb-jw

Famous Member

proxwolfe

Well-Known Member

Cluster getting really ssssllllloooooowwwwww :-(((((((((((((((((((((((((

Well-Known Member

Renowned Member

Proxmox Staff Member

Well-Known Member

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893​

Renowned Member

Proxmox Staff Member

Well-Known Member

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893​

Distinguished Member

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893​

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Famous Member

Well-Known Member

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893

Samsung OEM Datacenter SSD PM893 / Enterprise SSD PM893