Hello,
I'm receiving the message "slow requests are blocked" several times a day, and I am having trouble :
identifying the root cause. I have tried various troubleshooting guides, such as:
https://access.redhat.com/documenta...ml/troubleshooting_guide/troubleshooting-osds
and
https://forum.proxmox.com/threads/ceph-slow-requests-are-blocked.48955/
but, sadly, running through the tweaks (e.g. tuning for SSD performance) and network tests haven't helped me identify or resolve the problem.
I was on Proxmox 5.3 which did not exhibit the issue (unless heavy workloads were underway). I wiped the servers and installed 5.4, used the new Wizard Ceph tools to get up and running.
Hardware-wise, I'm using four servers, each server has 2 x Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 1TB ram, 5 x Crucial MX300 2TB (two servers have 4 Crucial drives, two have 5). 18 OSDs altogether. Networking-wise, there is a 10gb management network, and a 40gb ceph network for the osds to communicate. I'll be adding a fifth note later (though not to this spec, just for quorum!). Running the network throughput tests as described in the redhat guide shows expected throughput for the 10gb and 40gb networks.
I'm not running many VMs right now, only 6 or so to get things going (windows AD server, pxe server etc) and a few test workstation VMs. When the cluster has health ok, it works well, but often the slow requests are blocked message appears and causes the few VMs to hang.
Leaving the cluster alone for a long while will clear the log, but I cannot figure out what is the root cause, considering the Proxmox usage tends to be light. I'd rather clear it with half a dozen VMs, before hundreds of VMs cause major bottlenecks!
Any advice anyone can provide would be greatly appreciated.
I'm receiving the message "slow requests are blocked" several times a day, and I am having trouble :
identifying the root cause. I have tried various troubleshooting guides, such as:
https://access.redhat.com/documenta...ml/troubleshooting_guide/troubleshooting-osds
and
https://forum.proxmox.com/threads/ceph-slow-requests-are-blocked.48955/
but, sadly, running through the tweaks (e.g. tuning for SSD performance) and network tests haven't helped me identify or resolve the problem.
I was on Proxmox 5.3 which did not exhibit the issue (unless heavy workloads were underway). I wiped the servers and installed 5.4, used the new Wizard Ceph tools to get up and running.
Hardware-wise, I'm using four servers, each server has 2 x Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 1TB ram, 5 x Crucial MX300 2TB (two servers have 4 Crucial drives, two have 5). 18 OSDs altogether. Networking-wise, there is a 10gb management network, and a 40gb ceph network for the osds to communicate. I'll be adding a fifth note later (though not to this spec, just for quorum!). Running the network throughput tests as described in the redhat guide shows expected throughput for the 10gb and 40gb networks.
I'm not running many VMs right now, only 6 or so to get things going (windows AD server, pxe server etc) and a few test workstation VMs. When the cluster has health ok, it works well, but often the slow requests are blocked message appears and causes the few VMs to hang.
Leaving the cluster alone for a long while will clear the log, but I cannot figure out what is the root cause, considering the Proxmox usage tends to be light. I'd rather clear it with half a dozen VMs, before hundreds of VMs cause major bottlenecks!
Any advice anyone can provide would be greatly appreciated.
Code:
[global]
auth client required = none
auth cluster required = none
auth service required = none
cluster network = 10.10.10.0/24
debug_asok = 0/0
debug_auth = 0/0
debug_buffer = 0/0
debug_client = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_filer = 0/0
debug_filestore = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_journal = 0/0
debug_journaler = 0/0
debug_lockdep = 0/0
debug_mon = 0/0
debug_monc = 0/0
debug_ms = 0/0
debug_objclass = 0/0
debug_objectcatcher = 0/0
debug_objecter = 0/0
debug_optracker = 0/0
debug_osd = 0/0
debug_paxos = 0/0
debug_perfcounter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_rgw = 0/0
debug_throttle = 0/0
debug_timer = 0/0
debug_tp = 0/0
fsid = c86a15fc-8c29-42f4-a332-3d7b76822502
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 2
public network = 192.168.1.0/24
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mds.pve2]
host = pve2
mds standby for name = pve
[mds.pve3]
host = pve3
mds standby for name = pve
[mds.pve1]
host = pve1
mds standby for name = pve
[mds.pve4]
host = pve4
mds standby for name = pve
[mon.pve2]
host = pve2
mon addr = 192.168.1.102:6789
[mon.pve4]
host = pve4
mon addr = 192.168.1.104:6789
[mon.pve3]
host = pve3
mon addr = 192.168.1.103:6789
[mon.pve1]
host = pve1
mon addr = 192.168.1.101:6789
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10
root
#buckets
host pve1 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 9.096
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.819
item osd.1 weight 1.819
item osd.2 weight 1.819
item osd.4 weight 1.819
item osd.3 weight 1.819
}
host pve2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 9.096
alg straw2
hash 0 # rjenkins1
item osd.7 weight 1.819
item osd.8 weight 1.819
item osd.9 weight 1.819
item osd.10 weight 1.819
item osd.11 weight 1.819
}
host pve3 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 7.277
alg straw2
hash 0 # rjenkins1
item osd.14 weight 1.819
item osd.15 weight 1.819
item osd.16 weight 1.819
item osd.17 weight 1.819
}
host pve4 {
id -13 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 7.277
alg straw2
hash 0 # rjenkins1
item osd.5 weight 1.819
item osd.6 weight 1.819
item osd.12 weight 1.819
item osd.13 weight 1.819
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 32.747
alg straw2
hash 0 # rjenkins1
item pve1 weight 9.096
item pve2 weight 9.096
item pve3 weight 7.277
item pve4 weight 7.277
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map