Long IOwait on CEPH, rbd device stuck on 100%

Veikko

Active Member
Dec 4, 2017
17
0
41
Finland
Hi there!
I think I am having a configuration issue on my cluster. It is a 3-node system with 12 OSDs and a separate DB SSD per node. The iowait on one of the hosts is way up, to 15%, while the others have around 2%. Using iostat -x, I see 5 additional devices on that node, which are "rbd0-4", and one of those devices have a static 100% utilization.

What are these devices? On the WebGUI all the disk views, mounts and CEPH settings are identical. I also did "rados -p rbd cleanup" to try to stop any ongoing / stuck performance benchmarks.

I'm using 2-ring 2x1G cluster network, a LAGG group of 2x1G for the VM's, and a separate 1x10G network for storage.

I am on all the newest versions and packages using Community subscription plan. I'll post additional information if needed.
 
The iowait on one of the hosts is way up, to 15%, while the others have around 2%.
iowait is nothing bad per se, it indicates how much outstanding IO is on the system. One of your VM/CT is doing heavy IO.

What are these devices?
Mapped RBD images. CT/VM running?
 
OK, I see. There had to be something wrong because there was 5 rbd devices mapped and only 2 containers running on the host. It seems that a CT maps rbd like that, a VM does not.

One of the CT machines was hung up, so I think that was the reason of the 100% utilization of the one rbd mount. I'll follow up if there's something else, but killing the CT:s and restarting the host solved the inconsistency of the IOwaits.

Thanks Alwin!
 
OK, I see. There had to be something wrong because there was 5 rbd devices mapped and only 2 containers running on the host. It seems that a CT maps rbd like that, a VM does not.
Yes, container need krbd to map its disks (multiple mappings) and VMs can use krbd too but by default use librbd directly.