Ceph Slow Requests

TecScott

Active Member
Mar 30, 2017
25
0
41
33
I've currently got a 4 node cluster running Ceph on Proxmox 5.1 and noticed recently I'm getting a lot of blocked requests due to request_slow.

For example:

2019-02-12 11:47:33 cluster [WRN] Health check failed: 6 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-02-12 11:47:47 cluster [WRN] Health check update: 4 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-02-12 11:47:53 cluster [WRN] Health check update: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2019-02-12 11:48:03 cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are blocked > 32 sec)

There are currently 2 OSD's per node, 4TB each at 7.2K RPM. These use a journal disk with is a NVME SSD drive.

Latency is always shown as 0 for commit and 0-2 for apply.

Any suggestions on ways to investigate what's causing issues? It's causing noticeable performance issues on VM's, particularly Windows Server 2016.
 
Check the ceph logs '/var/log/ceph/' to verify which parts of Ceph are involved. And please describe your system in more detail, so we can get a better picture of your cluster.
 
The logs only show what I've said really, the main log (ceph.log) shows health check failed: 2 slow requests are blocked > 32 sec (REQUEST_SLOW) then 1 slow request, then 3 slow, then 4 slow, then health check cleared after around 30 seconds and it's back to healthy.

ceph-mon.x.log shows at the same time a log_channel(cluster) log message with the same detail and sending the message to other monitors.

I don't find anything useful in the osd logs (just level-0 table started, so many bytes okay, delete type=0)

What other detail would you like to know? There are 4 nodes, 2 OSD's per node, NVMe SSD for journal, 7.2k disks for storage, 10GbE network for ceph.
 
Did you check all ceph logs on all hosts? In some of the logs you may find entries that indicate which OSDs where involved with the slow requests.

What other detail would you like to know? There are 4 nodes, 2 OSD's per node, NVMe SSD for journal, 7.2k disks for storage, 10GbE network for ceph.
How are the disks connected (HBA/Raid)? What CPU and RAM?
 
Checked all logs on all nodes and there doesn't seem to be anything indicating what OSD's were the cause.

It may be unrelated but I've noticed that the SWAP usage on the hosts was pretty high (3GB+) although the RAM usage is only at 40-50%.

The nodes have 80GB RAM and 1x Xeon E5-2620 v4's each.

They're passed straight through as JBOD (no RAID), each physical disk is an OSD.
 
It may be unrelated but I've noticed that the SWAP usage on the hosts was pretty high (3GB+) although the RAM usage is only at 40-50%.
Depends on what was swapped out and if it was used at the time. If you have a performance monitoring it may show this. Seems like there may have been a resource spike that caused the delay.

They're passed straight through as JBOD (no RAID), each physical disk is an OSD.
As a side note putting disk through as JBOD is not the same, as using a HBA.
 
The IO delay on all nodes seems to sit around 10% too.

The SWAP usage is also consistent (i.e. it's not spiking to 3GB, it's consistently sitting around 3GB) even though the RAM is 40-50%.
 
Are you talking about now, or when the slow requests showed up?

But in general, two spinners will do roughly ~160 MB/s (good ones), so it could well be that there are just not enough OSDs to cope with the load. But for now everything is guesswork.
 
At all times, the SWAP seems to have been a result of the swappiness setting (at 40% it'll start using SWAP?).

The IO delay is always around 10% on each of the 4 hosts, however.

Any recommendation (i.e. what logs, debug logs, etc) to try to get to the bottom of what's causing this would be appreciated, if we can determine it's due to the slow disks then we can look into getting faster disks, or are you suggested more OSD's if that'll resolve the issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!