Ceph error: slow and stuck requests are blocked

klowet

Active Member
Jun 22, 2018
43
1
28
Hi

I'm running a hyper converged cluster of 4 nodes, with each 3 disks for Ceph. In the Ceph logs, I'm getting these errors. Sometimes the errors disappears for a few hours and then start coming back again.

Does someone knows where these errors are coming from? I have looked in the OSD logs, but they don't have much more information. All disks are new SSDs.

Code:
2019-04-20 20:46:48.621869 osd.1 osd.1 10.0.15.1:6808/4634 45 : cluster [WRN] slow request 30.781274 seconds old, received at 2019-04-20 20:46:17.840538: osd_op(client.330079.0:2826 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 1635328~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:46:48.621872 osd.1 osd.1 10.0.15.1:6808/4634 46 : cluster [WRN] slow request 30.780455 seconds old, received at 2019-04-20 20:46:17.841356: osd_op(client.330079.0:2827 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2159616~516096] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:46:48.621874 osd.1 osd.1 10.0.15.1:6808/4634 47 : cluster [WRN] slow request 30.779906 seconds old, received at 2019-04-20 20:46:17.841905: osd_op(client.330079.0:2828 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2675712~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:46:48.621877 osd.1 osd.1 10.0.15.1:6808/4634 48 : cluster [WRN] slow request 30.779369 seconds old, received at 2019-04-20 20:46:17.842442: osd_op(client.330079.0:2829 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3200000~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:46:48.621879 osd.1 osd.1 10.0.15.1:6808/4634 49 : cluster [WRN] slow request 30.778854 seconds old, received at 2019-04-20 20:46:17.842957: osd_op(client.330079.0:2830 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3724288~8192] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484178 osd.1 osd.1 10.0.15.1:6808/4634 50 : cluster [WRN] 8 slow requests, 5 included below; oldest blocked for > 2398.468450 secs
2019-04-20 20:47:18.484187 osd.1 osd.1 10.0.15.1:6808/4634 51 : cluster [WRN] slow request 60.643557 seconds old, received at 2019-04-20 20:46:17.840538: osd_op(client.330079.0:2826 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 1635328~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484205 osd.1 osd.1 10.0.15.1:6808/4634 52 : cluster [WRN] slow request 60.642739 seconds old, received at 2019-04-20 20:46:17.841356: osd_op(client.330079.0:2827 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2159616~516096] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484214 osd.1 osd.1 10.0.15.1:6808/4634 53 : cluster [WRN] slow request 60.642190 seconds old, received at 2019-04-20 20:46:17.841905: osd_op(client.330079.0:2828 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2675712~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484222 osd.1 osd.1 10.0.15.1:6808/4634 54 : cluster [WRN] slow request 60.641652 seconds old, received at 2019-04-20 20:46:17.842442: osd_op(client.330079.0:2829 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3200000~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:18.484227 osd.1 osd.1 10.0.15.1:6808/4634 55 : cluster [WRN] slow request 60.641138 seconds old, received at 2019-04-20 20:46:17.842957: osd_op(client.330079.0:2830 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3724288~8192] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:47:44.131137 osd.0 osd.0 10.0.15.1:6804/4458 21 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 480.162172 secs
2019-04-20 20:47:44.131141 osd.0 osd.0 10.0.15.1:6804/4458 22 : cluster [WRN] slow request 480.162172 seconds old, received at 2019-04-20 20:39:43.968933: osd_op(client.317399.0:83947 1.13e 1:7ce8dce9:::rbd_data.cade6b8b4567.000000000000047d:head [write 3903488~8192] snapc 0=[] ondisk+write+known_if_redirected e128) currently sub_op_commit_rec from 8
2019-04-20 20:47:38.969191 osd.5 osd.5 10.0.15.2:6800/4117 47 : cluster [WRN] 3 slow requests, 1 included below; oldest blocked for > 2418.973605 secs
2019-04-20 20:47:38.969206 osd.5 osd.5 10.0.15.2:6800/4117 48 : cluster [WRN] slow request 480.606888 seconds old, received at 2019-04-20 20:39:38.362245: osd_op(client.317399.0:83856 1.b5 1:ad3befa2:::rbd_data.cade6b8b4567.0000000000000aff:head [write 3215360~8192] snapc 0=[] ondisk+write+known_if_redirected e128) currently sub_op_commit_rec from 7
2019-04-20 20:48:10.564216 osd.11 osd.11 10.0.15.4:6800/3880 17 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 480.697435 secs
2019-04-20 20:48:10.564223 osd.11 osd.11 10.0.15.4:6800/3880 18 : cluster [WRN] slow request 480.697435 seconds old, received at 2019-04-20 20:40:09.866755: osd_op(client.317399.0:83959 1.167 1:e68d1179:::rbd_data.cade6b8b4567.0000000000000b03:head [write 1552384~4096] snapc 0=[] ondisk+write+known_if_redirected e128) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209207 osd.1 osd.1 10.0.15.1:6808/4634 56 : cluster [WRN] 8 slow requests, 5 included below; oldest blocked for > 2458.193526 secs
2019-04-20 20:48:18.209210 osd.1 osd.1 10.0.15.1:6808/4634 57 : cluster [WRN] slow request 120.368633 seconds old, received at 2019-04-20 20:46:17.840538: osd_op(client.330079.0:2826 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 1635328~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209213 osd.1 osd.1 10.0.15.1:6808/4634 58 : cluster [WRN] slow request 120.367814 seconds old, received at 2019-04-20 20:46:17.841356: osd_op(client.330079.0:2827 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2159616~516096] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209215 osd.1 osd.1 10.0.15.1:6808/4634 59 : cluster [WRN] slow request 120.367266 seconds old, received at 2019-04-20 20:46:17.841905: osd_op(client.330079.0:2828 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 2675712~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209219 osd.1 osd.1 10.0.15.1:6808/4634 60 : cluster [WRN] slow request 120.366728 seconds old, received at 2019-04-20 20:46:17.842442: osd_op(client.330079.0:2829 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3200000~524288] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:18.209222 osd.1 osd.1 10.0.15.1:6808/4634 61 : cluster [WRN] slow request 120.366214 seconds old, received at 2019-04-20 20:46:17.842957: osd_op(client.330079.0:2830 1.a5 1:a51d67d6:::rbd_data.cade6b8b4567.0000000000001181:head [write 3724288~8192] snapc 0=[] ondisk+write+known_if_redirected e129) currently sub_op_commit_rec from 6
2019-04-20 20:48:31.307597 osd.7 osd.7 10.0.15.3:6800/4070 46 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 480.182633 secs
2019-04-20 20:48:31.307617 osd.7 osd.7 10.0.15.3:6800/4070 47 : cluster [WRN] slow request 480.182633 seconds old, received at 2019-04-20 20:40:31.124905: osd_op(client.316949.0:27762 1.20 1:0478ad06:::rbd_data.a41f6b8b4567.000000000000105c:head [write 3428352~4096] snapc 0=[] ondisk+write+known_if_redirected e128) currently sub_op_commit_rec from 2

Code:
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
ceph: 12.2.11-pve1

Code:
ID CLASS WEIGHT   TYPE NAME                STATUS REWEIGHT PRI-AFF
-1       20.95789 root default                                
-3        5.23947     host anr1-a-hcz01s01                    
 0   ssd  1.74649         osd.0                up  1.00000 1.00000
 1   ssd  1.74649         osd.1                up  1.00000 1.00000
 2   ssd  1.74649         osd.2                up  1.00000 1.00000
-5        5.23947     host anr1-a-hcz01s02                    
 3   ssd  1.74649         osd.3                up  1.00000 1.00000
 4   ssd  1.74649         osd.4                up  1.00000 1.00000
 5   ssd  1.74649         osd.5                up  1.00000 1.00000
-7        5.23947     host anr1-a-hcz01s03                    
 6   ssd  1.74649         osd.6                up  1.00000 1.00000
 7   ssd  1.74649         osd.7                up  1.00000 1.00000
 8   ssd  1.74649         osd.8                up  1.00000 1.00000
-9        5.23947     host anr1-a-hcz01s04                    
 9   ssd  1.74649         osd.9                up  1.00000 1.00000
10   ssd  1.74649         osd.10               up  1.00000 1.00000
11   ssd  1.74649         osd.11               up  1.00000 1.00000

Code:
HEALTH_ERR 10 slow requests are blocked > 32 sec. Implicated osds 0,7,11; 6 stuck requests are blocked > 4096 sec. Implicated osds 1,5,8
REQUEST_SLOW 10 slow requests are blocked > 32 sec. Implicated osds 0,7,11
    9 ops are blocked > 2097.15 sec
    1 ops are blocked > 262.144 sec
    osds 0,7,11 have blocked requests > 2097.15 sec
REQUEST_STUCK 6 stuck requests are blocked > 4096 sec. Implicated osds 1,5,8
    6 ops are blocked > 4194.3 sec
    osds 1,5,8 have stuck requests > 4194.3 sec

Code:
[global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    cluster network = 10.0.15.0/24
    fsid = 7d745d6c-5523-4c5e-89dd-a19a4306f8f0
    keyring = /etc/pve/priv/$cluster.$name.keyring
    mon allow pool delete = true
    osd journal size = 5120
    osd pool default min size = 2
    osd pool default size = 3
    public network = 10.0.15.0/24

[osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.anr1-a-hcz01s02]
    host = anr1-a-hcz01s02
    mon addr = 10.0.15.2:6789

[mon.anr1-a-hcz01s03]
    host = anr1-a-hcz01s03
    mon addr = 10.0.15.3:6789

[mon.anr1-a-hcz01s01]
    host = anr1-a-hcz01s01
    mon addr = 10.0.15.1:6789
 
Last edited:
Code:
root@anr1-a-hcz01s01:/var/log/ceph# ceph osd pool get rbd all
size: 3
min_size: 2
crash_replay_interval: 0
pg_num: 512
pgp_num: 512
crush_rule: replicated_rule
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
auid: 0
fast_read: 0

Code:
root@anr1-a-hcz01s01:/var/log/ceph# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]
 
Is it across all Nodes or is it more related to specify Nodes / OSDs? Is it occuring during Backup or other heavy load? Do you have only one Pool with 512 PG and a replica of 3? Could you send us the output of "ceph -s" (im interessting in the workload of the Cluster) and "ceph pg dump"?

What Hardware do you use for the CEPH Cluster?
Do you have some Metrics (Check_MK, Graphite / Grafana, etc)?
 
I think I may have found the cause and the solution. The nodes are connected with multiple networks: management, backup and Ceph. The ceph public (and sync) network have their own physical network. The management network have the default gateway and a dns lookup server defined. So If I ping another node, it replies with its management network. So when Ceph is looking for its colleagues, it does not get the replies from the Ceph interface but the management IP. I have added now the Ceph IP's from all nodes in all the /etc/hosts file so that replies are coming from the Ceph interface.

Since that change - yesterday evening, just after my first post here - I haven't got errors anymore.
I'll come back tomorrow or later next week.
 
I have added now the Ceph IP's from all nodes in all the /etc/hosts file so that replies are coming from the Ceph interface.

@klowet I would like more information on your solution, please. I understand your separate networks (we have the same), but if you add the Ceph interface IP -> FQDN -> hostname to /etc/hosts does this affect corosync or other services using FQDN or hostname on another interface, e.g., ring_addr interface or management interface? Do you specify hostname or IP address in your corosync.conf? What about ceph.conf?

I appreciate you helping better understand your implementation and solution.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!