High iodelay on one of three identical nodes

proxwolfe · Nov 27, 2023

Hi,

I have a three node PVE cluster with identical nodes. Each node has an SSD that is part of a Ceph pool (I know, I should have more SSDs in the pool). And each node also has two HDDs that are part of another Ceph pool.

I replaced the three enterprise grade SSDs with three other, larger enterprise grade SSDs. Two nodes show low iodelays (2%) while one node shows very high iodelays (25%). The three nodes are basically identical (make, model, cpu, memory) and also the (old as well as the) new enterprise grade SSDs are identical (make, model, size). The only difference is the cpu and memory load which is lower (!) on the the node with the high iodelay than on the other two nodes. I even migrated / shut down all VMs on the offending node but the iodelay is still there.

IIRC, the high iodelay was not there before I replaced the SSDs (but I did not check before). So it could -- somehow -- be the one SSD in the offending node. Or something else I'm missing. How can I figure out where this is coming from?

Thanks!

spirit · Nov 27, 2023

Could be a bad ssd indeed.

what is the ssd model ?

About your hdd && ssd, do you have created 2 differents crush rules, and set 1 rule for each pool ?

I even migrated / shut down all VMs on the offending node but the iodelay is still there.

Vms read/write to all osd in the cluster, not only the local osd. So migrate/shutdown vm from local node have no impact. (or you need to shutdown all vms in the cluster)

proxwolfe · Nov 27, 2023

spirit said:
Vms read/write to all osd in the cluster, not only the local osd. So migrate/shutdown vm from local node have no impact. (or you need to shutdown all vms in the cluster)

True. I guess there could have been a VM that uses a local drive - but there wasn't.

spirit said:
About your hdd && ssd, do you have created 2 differents crush rules, and set 1 rule for each pool ?

Yes, two separate rules for each pool.

spirit said:
what is the ssd model ?

It's a Samsung PM863a with 3.84TB capacity.

sb-jw · Nov 27, 2023

Is your CEPH cluster OK or is something currently running? Are you sure the I/O delay is coming from the SSDs and not local storage or the HDDs? Can you maybe take a screenshot from PVE with the overview of the OSDs? And tell me a little more about your hardware, your network, layout, PGs, replicas etc. Everything can have an impact on your performance.

proxwolfe · Nov 27, 2023

sb-jw said:
Is your CEPH cluster OK or is something currently running? Are you sure the I/O delay is coming from the SSDs and not local storage or the HDDs? Can you maybe take a screenshot from PVE with the overview of the OSDs? And tell me a little more about your hardware, your network, layout, PGs, replicas etc. Everything can have an impact on your performance.

I'd say it's definitely no local storage (because there are no VMs running on the node anymore).

It could be the HDDs for sure. But shouldn't that affect the other nodes as well? (They are all practically identical).

I will take a snapshot later and post the technical details.

spirit · Nov 28, 2023

where/how do you check i/o delay ?

because, with ceph and qemu using librbd, you'll not be able to see latency of the vms from the host.
(but inside the vm you can see it).

here a useful command to see real vms latency from the host
# rbd perf image iotop

proxwolfe · Nov 28, 2023

spirit said:
where/how do you check i/o delay ?

The PVE GUI

spirit said:
because, with ceph and qemu using librbd, you'll not be able to see latency of the vms from the host.

Yeah, that is something that would be useful.

spirit said:
here a useful command to see real vms latency from the host
# rbd perf image iotop

Cool. Thanks. Will try that.

But in the case at hand, my issue does not seem to be coming from a VM, because there are no VMs on the node and the VMs running on the other node do not affect the iodelay of the other nodes (so badly).

My guess is that this is coming from one of the drives, likely the recently replaced SSD (which, however, is identical to the drive recently replaced SSDs on the other nodes). I'd try another SSD but I don't have another identical one at hand.

So I'm wondering how I could test the drive(s) for causing the delay...

sb-jw · Nov 28, 2023

proxwolfe said:
I'd say it's definitely no local storage (because there are no VMs running on the node anymore).

VMs don't have to be running on the disks. That's enough if the disk is simply broken.

proxwolfe said:
So I'm wondering how I could test the drive(s) for causing the delay...

You can use iostat for that.

Search

Search

High iodelay on one of three identical nodes

proxwolfe

Well-Known Member

spirit

Distinguished Member

proxwolfe

Well-Known Member

sb-jw

Famous Member

proxwolfe

Well-Known Member

spirit

Distinguished Member

proxwolfe

Well-Known Member

sb-jw

Famous Member

We value your privacy