Ceph error: slow and stuck requests are blocked

klowet · Apr 20, 2019

Hi

I'm running a hyper converged cluster of 4 nodes, with each 3 disks for Ceph. In the Ceph logs, I'm getting these errors. Sometimes the errors disappears for a few hours and then start coming back again.

Does someone knows where these errors are coming from? I have looked in the OSD logs, but they don't have much more information. All disks are new SSDs.

Code:

2019-04-20 20:46:48.621869 osd.1 osd.1 10.0.15.1:6808/4634 45 : cluster [WRN] slow request 30.781274 seconds old, received at 2019-04-20 20:46:17.840538: osd_op(client.330079.0:2826 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 1635328~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:46:48.621872 osd.1 osd.1 10.0.15.1:6808/4634 46 : cluster [WRN] slow request 30.780455 seconds old, received at 2019-04-20 20:46:17.841356: osd_op(client.330079.0:2827 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2159616~516096] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:46:48.621874 osd.1 osd.1 10.0.15.1:6808/4634 47 : cluster [WRN] slow request 30.779906 seconds old, received at 2019-04-20 20:46:17.841905: osd_op(client.330079.0:2828 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2675712~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:46:48.621877 osd.1 osd.1 10.0.15.1:6808/4634 48 : cluster [WRN] slow request 30.779369 seconds old, received at 2019-04-20 20:46:17.842442: osd_op(client.330079.0:2829 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3200000~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:46:48.621879 osd.1 osd.1 10.0.15.1:6808/4634 49 : cluster [WRN] slow request 30.778854 seconds old, received at 2019-04-20 20:46:17.842957: osd_op(client.330079.0:2830 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3724288~8192] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484178 osd.1 osd.1 10.0.15.1:6808/4634 50 : cluster [WRN] 8 slow requests, 5 included below; oldest blocked for > 2398.468450 secs
2019-04-20 20:47:18.484187 osd.1 osd.1 10.0.15.1:6808/4634 51 : cluster [WRN] slow request 60.643557 seconds old, received at 2019-04-20 20:46:17.840538: osd_op(client.330079.0:2826 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 1635328~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484205 osd.1 osd.1 10.0.15.1:6808/4634 52 : cluster [WRN] slow request 60.642739 seconds old, received at 2019-04-20 20:46:17.841356: osd_op(client.330079.0:2827 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2159616~516096] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484214 osd.1 osd.1 10.0.15.1:6808/4634 53 : cluster [WRN] slow request 60.642190 seconds old, received at 2019-04-20 20:46:17.841905: osd_op(client.330079.0:2828 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2675712~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484222 osd.1 osd.1 10.0.15.1:6808/4634 54 : cluster [WRN] slow request 60.641652 seconds old, received at 2019-04-20 20:46:17.842442: osd_op(client.330079.0:2829 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3200000~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484227 osd.1 osd.1 10.0.15.1:6808/4634 55 : cluster [WRN] slow request 60.641138 seconds old, received at 2019-04-20 20:46:17.842957: osd_op(client.330079.0:2830 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3724288~8192] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:44.131137 osd.0 osd.0 10.0.15.1:6804/4458 21 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 480.162172 secs
2019-04-20 20:47:44.131141 osd.0 osd.0 10.0.15.1:6804/4458 22 : cluster [WRN] slow request 480.162172 seconds old, received at 2019-04-20 20:39:43.968933: osd_op(client.317399.0:83947 1.13e 1:7ce8dce9:::rbd_data.cade6b8b4567.000000000000047d:head [write 3903488~8192] snapc 0=[] ondisk+write+known_if_redirected e128) currently sub_op_commit_rec from 8
2019-04-20 20:47:38.969191 osd.5 osd.5 10.0.15.2:6800/4117 47 : cluster [WRN] 3 slow requests, 1 included below; oldest blocked for > 2418.973605 secs
2019-04-20 20:47:38.969206 osd.5 osd.5 10.0.15.2:6800/4117 48 : cluster [WRN] slow request 480.606888 seconds old, received at 2019-04-20 20:39:38.362245: osd_op(client.317399.0:83856 1.b5 1:ad3befa2:::rbd_data.cade6b8b4567.0000000000000aff:head [write 3215360~8192] snapc 0=[] ondisk+write+known_if_redirected e128) currently sub_op_commit_rec from 7
2019-04-20 20:48:10.564216 osd.11 osd.11 10.0.15.4:6800/3880 17 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 480.697435 secs
2019-04-20 20:48:10.564223 osd.11 osd.11 10.0.15.4:6800/3880 18 : cluster [WRN] slow request 480.697435 seconds old, received at 2019-04-20 20:40:09.866755: osd_op(client.317399.0:83959 1.167 1:e68d1179:::rbd_data.cade6b8b4567.0000000000000b03:head [write 1552384~4096] snapc 0=[] ondisk+write+known_if_redirected e128) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209207 osd.1 osd.1 10.0.15.1:6808/4634 56 : cluster [WRN] 8 slow requests, 5 included below; oldest blocked for > 2458.193526 secs
2019-04-20 20:48:18.209210 osd.1 osd.1 10.0.15.1:6808/4634 57 : cluster [WRN] slow request 120.368633 seconds old, received at 2019-04-20 20:46:17.840538: osd_op(client.330079.0:2826 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 1635328~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209213 osd.1 osd.1 10.0.15.1:6808/4634 58 : cluster [WRN] slow request 120.367814 seconds old, received at 2019-04-20 20:46:17.841356: osd_op(client.330079.0:2827 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2159616~516096] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209215 osd.1 osd.1 10.0.15.1:6808/4634 59 : cluster [WRN] slow request 120.367266 seconds old, received at 2019-04-20 20:46:17.841905: osd_op(client.330079.0:2828 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2675712~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209219 osd.1 osd.1 10.0.15.1:6808/4634 60 : cluster [WRN] slow request 120.366728 seconds old, received at 2019-04-20 20:46:17.842442: osd_op(client.330079.0:2829 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3200000~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209222 osd.1 osd.1 10.0.15.1:6808/4634 61 : cluster [WRN] slow request 120.366214 seconds old, received at 2019-04-20 20:46:17.842957: osd_op(client.330079.0:2830 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3724288~8192] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:31.307597 osd.7 osd.7 10.0.15.3:6800/4070 46 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 480.182633 secs
2019-04-20 20:48:31.307617 osd.7 osd.7 10.0.15.3:6800/4070 47 : cluster [WRN] slow request 480.182633 seconds old, received at 2019-04-20 20:40:31.124905: osd_op(client.316949.0:27762 1.20 1:0478ad06:::rbd_data.a41f6b8b4567.000000000000105c:head [write 3428352~4096] snapc 0=[] ondisk+write+known_if_redirected e128) currently sub_op_commit_rec from 2

Code:

proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
ceph: 12.2.11-pve1

Code:

ID CLASS WEIGHT   TYPE NAME                STATUS REWEIGHT PRI-AFF
-1       20.95789 root default                                
-3        5.23947     host anr1-a-hcz01s01                    
 0   ssd  1.74649         osd.0                up  1.00000 1.00000
 1   ssd  1.74649         osd.1                up  1.00000 1.00000
 2   ssd  1.74649         osd.2                up  1.00000 1.00000
-5        5.23947     host anr1-a-hcz01s02                    
 3   ssd  1.74649         osd.3                up  1.00000 1.00000
 4   ssd  1.74649         osd.4                up  1.00000 1.00000
 5   ssd  1.74649         osd.5                up  1.00000 1.00000
-7        5.23947     host anr1-a-hcz01s03                    
 6   ssd  1.74649         osd.6                up  1.00000 1.00000
 7   ssd  1.74649         osd.7                up  1.00000 1.00000
 8   ssd  1.74649         osd.8                up  1.00000 1.00000
-9        5.23947     host anr1-a-hcz01s04                    
 9   ssd  1.74649         osd.9                up  1.00000 1.00000
10   ssd  1.74649         osd.10               up  1.00000 1.00000
11   ssd  1.74649         osd.11               up  1.00000 1.00000

Code:

HEALTH_ERR 10 slow requests are blocked > 32 sec. Implicated osds 0,7,11; 6 stuck requests are blocked > 4096 sec. Implicated osds 1,5,8
REQUEST_SLOW 10 slow requests are blocked > 32 sec. Implicated osds 0,7,11
    9 ops are blocked > 2097.15 sec
    1 ops are blocked > 262.144 sec
    osds 0,7,11 have blocked requests > 2097.15 sec
REQUEST_STUCK 6 stuck requests are blocked > 4096 sec. Implicated osds 1,5,8
    6 ops are blocked > 4194.3 sec
    osds 1,5,8 have stuck requests > 4194.3 sec

Code:

[global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    cluster network = 10.0.15.0/24
    fsid = 7d745d6c-5523-4c5e-89dd-a19a4306f8f0
    keyring = /etc/pve/priv/$cluster.$name.keyring
    mon allow pool delete = true
    osd journal size = 5120
    osd pool default min size = 2
    osd pool default size = 3
    public network = 10.0.15.0/24

[osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.anr1-a-hcz01s02]
    host = anr1-a-hcz01s02
    mon addr = 10.0.15.2:6789

[mon.anr1-a-hcz01s03]
    host = anr1-a-hcz01s03
    mon addr = 10.0.15.3:6789

[mon.anr1-a-hcz01s01]
    host = anr1-a-hcz01s01
    mon addr = 10.0.15.1:6789

aychprox · Apr 21, 2019

how you define your pools?

erasure coded with cache tier or replicated?

klowet · Apr 21, 2019

Code:

root@anr1-a-hcz01s01:/var/log/ceph# ceph osd pool get rbd all
size: 3
min_size: 2
crash_replay_interval: 0
pg_num: 512
pgp_num: 512
crush_rule: replicated_rule
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
auid: 0
fast_read: 0

Code:

root@anr1-a-hcz01s01:/var/log/ceph# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

sb-jw · Apr 21, 2019

Is it across all Nodes or is it more related to specify Nodes / OSDs? Is it occuring during Backup or other heavy load? Do you have only one Pool with 512 PG and a replica of 3? Could you send us the output of "ceph -s" (im interessting in the workload of the Cluster) and "ceph pg dump"?

What Hardware do you use for the CEPH Cluster?
Do you have some Metrics (Check_MK, Graphite / Grafana, etc)?

klowet · Apr 21, 2019

I think I may have found the cause and the solution. The nodes are connected with multiple networks: management, backup and Ceph. The ceph public (and sync) network have their own physical network. The management network have the default gateway and a dns lookup server defined. So If I ping another node, it replies with its management network. So when Ceph is looking for its colleagues, it does not get the replies from the Ceph interface but the management IP. I have added now the Ceph IP's from all nodes in all the /etc/hosts file so that replies are coming from the Ceph interface.

Since that change - yesterday evening, just after my first post here - I haven't got errors anymore.
I'll come back tomorrow or later next week.

RokaKen · May 22, 2019

klowet said:
I have added now the Ceph IP's from all nodes in all the /etc/hosts file so that replies are coming from the Ceph interface.

@klowet I would like more information on your solution, please. I understand your separate networks (we have the same), but if you add the Ceph interface IP -> FQDN -> hostname to /etc/hosts does this affect corosync or other services using FQDN or hostname on another interface, e.g., ring_addr interface or management interface? Do you specify hostname or IP address in your corosync.conf? What about ceph.conf?

I appreciate you helping better understand your implementation and solution.

Search

Search

Ceph error: slow and stuck requests are blocked

klowet

Well-Known Member

aychprox

Renowned Member

klowet

Well-Known Member

sb-jw

Famous Member

klowet

Well-Known Member

RokaKen

Active Member

We value your privacy