CEPH - osd.xxx [WRN] map yyyy wrongly marked me down

pdan

New Member
Sep 8, 2014
17
0
1
Italy
www.unicas.it
Hello forum,

we are running a subscripted 5-nodes proxmox and ceph cluster for a while, in last hours our mon log includes entries like the following ones:

Code:
2014-09-30 16:24:20.599818 mon.0 10.10.0.75:6789/0 2198 : [INF] pgmap v5025678: 640 pgs: 640 active+clean; 1401 GB data, 4209 GB used, 11596 GB / 15806 GB avail; 2154 B/s wr, 0 op/s
2014-09-30 16:24:35.083070 osd.9 10.10.0.79:6803/8950 459 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.131652 secs
2014-09-30 16:24:35.083077 osd.9 10.10.0.79:6803/8950 460 : [WRN] slow request 30.131652 seconds old, received at 2014-09-30 16:24:04.951384: osd_op(client.3337472.0:47152 rbd_data.4fbcc2ae8944a.0000000000000324 [set-alloc-hint object_size 4194304 write_size 4194304,write 3255808~16384] 2.6f4005e7 ack+ondisk+write e8761) v4 currently waiting for subops from 5,14
2014-09-30 16:24:39.674463 mon.0 10.10.0.75:6789/0 2343 : [INF] osd.12 10.10.0.77:6800/4418 failed (3 reports from 2 peers after 34.001207 >= grace 20.878662)
2014-09-30 16:24:39.674922 mon.0 10.10.0.75:6789/0 2345 : [INF] osd.13 10.10.0.77:6803/4625 failed (3 reports from 2 peers after 34.001424 >= grace 20.878662)
2014-09-30 16:24:39.675364 mon.0 10.10.0.75:6789/0 2347 : [INF] osd.14 10.10.0.77:6806/4869 failed (3 reports from 2 peers after 34.001730 >= grace 20.878662)
2014-09-30 16:24:39.675750 mon.0 10.10.0.75:6789/0 2349 : [INF] osd.15 10.10.0.77:6809/5207 failed (3 reports from 2 peers after 34.001986 >= grace 20.878662)
2014-09-30 16:24:40.426452 mon.0 10.10.0.75:6789/0 2402 : [INF] osdmap e8762: 15 osds: 11 up, 15 in
2014-09-30 16:24:40.432713 mon.0 10.10.0.75:6789/0 2403 : [INF] pgmap v5025679: 640 pgs: 640 active+clean; 1401 GB data, 4209 GB used, 11596 GB / 15806 GB avail; 3962 B/s wr, 0 op/s
2014-09-30 16:24:40.451743 mon.0 10.10.0.75:6789/0 2404 : [INF] pgmap v5025680: 640 pgs: 151 stale+active+clean, 489 active+clean; 1401 GB data, 4209 GB used, 11596 GB / 15806 GB avail; 5781 B/s wr, 0 op/s
2014-09-30 16:24:40.805781 osd.1 10.10.0.75:6800/4528 3 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.698548 secs
2014-09-30 16:24:40.805798 osd.1 10.10.0.75:6800/4528 4 : [WRN] slow request 30.698548 seconds old, received at 2014-09-30 16:24:10.107177: osd_op(client.3336927.0:4636 rbd_data.1975eb2ae8944a.0000000000000a14 [set-alloc-hint object_size 4194304 write_size 4194304,write 2609152~8192] 2.d61f0b59 ack+ondisk+write e8761) v4 currently reached pg
2014-09-30 16:24:41.440616 mon.0 10.10.0.75:6789/0 2409 : [INF] osd.12 10.10.0.77:6800/4418 boot
2014-09-30 16:24:41.440780 mon.0 10.10.0.75:6789/0 2410 : [INF] osd.14 10.10.0.77:6806/4869 boot
2014-09-30 16:24:41.440953 mon.0 10.10.0.75:6789/0 2411 : [INF] osdmap e8763: 15 osds: 13 up, 15 in
2014-09-30 16:24:39.085681 osd.0 10.10.0.75:6803/4729 5 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.105718 secs
2014-09-30 16:24:39.085693 osd.0 10.10.0.75:6803/4729 6 : [WRN] slow request 30.105718 seconds old, received at 2014-09-30 16:24:08.979886: osd_op(client.3147237.0:1968458 rbd_data.1985d22ae8944a.00000000000002f5 [set-alloc-hint object_size 4194304 write_size 4194304,write 2178560~12288] 2.220c9e08 ack+ondisk+write e8761) v4 currently waiting for subops from 7,8
2014-09-30 16:24:40.777186 osd.12 10.10.0.77:6800/4418 33 : [WRN] map e8762 wrongly marked me down
2014-09-30 16:24:41.446618 mon.0 10.10.0.75:6789/0 2412 : [INF] pgmap v5025681: 640 pgs: 151 stale+active+clean, 489 active+clean; 1401 GB data, 4209 GB used, 11596 GB / 15806 GB avail
2014-09-30 16:24:42.448706 mon.0 10.10.0.75:6789/0 2413 : [INF] osd.15 10.10.0.77:6809/5207 boot
2014-09-30 16:24:42.448999 mon.0 10.10.0.75:6789/0 2414 : [INF] osd.13 10.10.0.77:6803/4625 boot
2014-09-30 16:24:42.449243 mon.0 10.10.0.75:6789/0 2415 : [INF] osdmap e8764: 15 osds: 15 up, 15 in
2014-09-30 16:24:42.453775 mon.0 10.10.0.75:6789/0 2416 : [INF] pgmap v5025682: 640 pgs: 151 stale+active+clean, 489 active+clean; 1401 GB data, 4209 GB used, 11596 GB / 15806 GB avail
2014-09-30 16:24:43.458588 mon.0 10.10.0.75:6789/0 2417 : [INF] osdmap e8765: 15 osds: 15 up, 15 in
2014-09-30 16:24:43.463012 mon.0 10.10.0.75:6789/0 2418 : [INF] pgmap v5025683: 640 pgs: 151 stale+active+clean, 489 active+clean; 1401 GB data, 4209 GB used, 11596 GB / 15806 GB avail
2014-09-30 16:24:40.573054 osd.15 10.10.0.77:6809/5207 44 : [WRN] map e8762 wrongly marked me down
2014-09-30 16:24:41.377720 osd.13 10.10.0.77:6803/4625 38 : [WRN] map e8762 wrongly marked me down
2014-09-30 16:24:40.993859 osd.14 10.10.0.77:6806/4869 78 : [WRN] map e8762 wrongly marked me down
2014-09-30 16:24:47.483629 mon.0 10.10.0.75:6789/0 2429 : [INF] pgmap v5025684: 640 pgs: 75 stale+active+clean, 565 active+clean; 1401 GB data, 4209 GB used, 11596 GB / 15806 GB avail; 15716 B/s wr, 2 op/s

Meanwhile, all seems OK in the proxmox and ceph clusters as

Code:
root@c00:~# ceph status
    cluster 59127979-f5d9-4eae-ace2-e29dda08525f
     health HEALTH_OK
     monmap e3: 3 mons at {0=10.10.0.77:6789/0,1=10.10.0.75:6789/0,2=10.10.0.76:6789/0}, election epoch 120, quorum 0,1,2 1,2,0
     osdmap e8761: 15 osds: 15 up, 15 in
      pgmap v5025655: 640 pgs, 3 pools, 1401 GB data, 350 kobjects
            4209 GB used, 11596 GB / 15806 GB avail
                 639 active+clean
                   1 active+clean+scrubbing
  client io 3050 B/s rd, 82351 B/s wr, 28 op/s
root@c00:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M    888   2014-09-30 15:53:59  c00
   2   M    892   2014-09-30 15:54:02  c01
   3   M    892   2014-09-30 15:54:02  c02
   4   M    892   2014-09-30 15:54:02  c03
   5   M    908   2014-09-30 16:23:02  c04
root@c00:~# pvecm status
Version: 6.2.0
Config Version: 9
Cluster Name: cctest
Cluster Id: 6430
Cluster Member: Yes
Cluster Generation: 908
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Node votes: 1
Quorum: 3  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: c00
Node ID: 1
Multicast addresses: 239.192.25.55 
Node addresses: 10.10.0.75

and

Code:
root@c00:~# pveversion -v
proxmox-ve-2.6.32: 3.2-132 (running kernel: 2.6.32-31-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-31-pve: 2.6.32-132
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-15
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1
root@c00:~#

We are experimenting a laggish GUI too

Any idea on the reason of those [WRN] entries and how to make them disappear?

Thanks for your attention

Pasquale
 
Seems one node was not reachable, or load was to high:

Code:
2014-09-30 16:24:39.674463 mon.0 10.10.0.75:6789/0 2343 : [INF] osd.12 10.10.0.77:6800/4418 failed (3 reports from 2 peers after 34.001207 >= grace 20.878662)
2014-09-30 16:24:39.674922 mon.0 10.10.0.75:6789/0 2345 : [INF] osd.13 10.10.0.77:6803/4625 failed (3 reports from 2 peers after 34.001424 >= grace 20.878662)
2014-09-30 16:24:39.675364 mon.0 10.10.0.75:6789/0 2347 : [INF] osd.14 10.10.0.77:6806/4869 failed (3 reports from 2 peers after 34.001730 >= grace 20.878662)
2014-09-30 16:24:39.675750 mon.0 10.10.0.75:6789/0 2349 : [INF] osd.15 10.10.0.77:6809/5207 failed (3 reports from 2 peers after 34.001986 >= grace 20.878662)
2014-09-30 16:24:40.426452 mon.0 10.10.0.75:6789/0 2402 : [INF] osdmap e8762: 15 osds: 11 up, 15 in

Please try to inspect syslog on that node, maybe there is more information.
 
Seems one node was not reachable, or load was to high:

Code:
2014-09-30 16:24:39.674463 mon.0 10.10.0.75:6789/0 2343 : [INF] osd.12 10.10.0.77:6800/4418 failed (3 reports from 2 peers after 34.001207 >= grace 20.878662)
2014-09-30 16:24:39.674922 mon.0 10.10.0.75:6789/0 2345 : [INF] osd.13 10.10.0.77:6803/4625 failed (3 reports from 2 peers after 34.001424 >= grace 20.878662)
2014-09-30 16:24:39.675364 mon.0 10.10.0.75:6789/0 2347 : [INF] osd.14 10.10.0.77:6806/4869 failed (3 reports from 2 peers after 34.001730 >= grace 20.878662)
2014-09-30 16:24:39.675750 mon.0 10.10.0.75:6789/0 2349 : [INF] osd.15 10.10.0.77:6809/5207 failed (3 reports from 2 peers after 34.001986 >= grace 20.878662)
2014-09-30 16:24:40.426452 mon.0 10.10.0.75:6789/0 2402 : [INF] osdmap e8762: 15 osds: 11 up, 15 in

Please try to inspect syslog on that node, maybe there is more information.

Infact we found some glitches in c04 and tried to fix them yesterday evening:
Code:
root@c04:/var/log/ceph# zcat ceph-osd.9.log.1.gz  | tail -30 
2014-09-30 16:28:09.165872 7ff0dbf05700 -1 osd.9 8775 heartbeat_check: no reply from osd.3 since back 2014-09-30 16:27:13.896151 front 2014-09-30 16:28:07.506472 (cutoff 2014-09-30 16:27:49.165857)
2014-09-30 16:28:09.207832 7ff0c5aaf700 -1 osd.9 8775 heartbeat_check: no reply from osd.2 since back 2014-09-30 16:27:13.896151 front 2014-09-30 16:27:13.896151 (cutoff 2014-09-30 16:27:49.207830)
2014-09-30 16:28:09.207864 7ff0c5aaf700 -1 osd.9 8775 heartbeat_check: no reply from osd.3 since back 2014-09-30 16:27:13.896151 front 2014-09-30 16:28:07.506472 (cutoff 2014-09-30 16:27:49.207830)
2014-09-30 16:28:10.166224 7ff0dbf05700 -1 osd.9 8775 heartbeat_check: no reply from osd.2 since back 2014-09-30 16:28:09.207256 front 2014-09-30 16:27:13.896151 (cutoff 2014-09-30 16:27:50.166222)
2014-09-30 16:28:11.166469 7ff0dbf05700 -1 osd.9 8775 heartbeat_check: no reply from osd.2 since back 2014-09-30 16:28:09.207256 front 2014-09-30 16:27:13.896151 (cutoff 2014-09-30 16:27:51.166466)
2014-09-30 16:28:12.166827 7ff0dbf05700 -1 osd.9 8775 heartbeat_check: no reply from osd.2 since back 2014-09-30 16:28:09.207256 front 2014-09-30 16:27:13.896151 (cutoff 2014-09-30 16:27:52.166825)
2014-09-30 16:28:13.167038 7ff0dbf05700 -1 osd.9 8775 heartbeat_check: no reply from osd.2 since back 2014-09-30 16:28:09.207256 front 2014-09-30 16:27:13.896151 (cutoff 2014-09-30 16:27:53.167035)
2014-09-30 16:28:13.908600 7ff0c5aaf700 -1 osd.9 8775 heartbeat_check: no reply from osd.2 since back 2014-09-30 16:28:09.207256 front 2014-09-30 16:27:13.896151 (cutoff 2014-09-30 16:27:53.908599)
2014-09-30 16:28:27.951349 7ff0c6eb1700  0 log [INF] : 2.1b3 scrub ok
2014-09-30 16:28:34.591142 7ff0c6eb1700  0 log [INF] : 2.1e7 scrub ok
2014-09-30 16:41:59.935199 7ff0b7126700  0 -- 192.168.0.79:6800/1008950 >> 192.168.0.79:6810/2007688 pipe(0x82b9680 sd=163 :50549 s=2 pgs=414 cs=1 l=0 c=0x1ea53080).fault with nothing to send, going to standby
2014-09-30 16:42:01.081065 7ff0b7025700  0 -- 192.168.0.79:6800/1008950 >> 192.168.0.79:6804/2010264 pipe(0x1bfd5b80 sd=236 :57937 s=2 pgs=515 cs=1 l=0 c=0xb180dc0).fault with nothing to send, going to standby
2014-09-30 16:53:45.838058 7ff0c6eb1700  0 log [INF] : 2.151 scrub ok
2014-09-30 17:34:33.281106 7ff0c6eb1700  0 log [INF] : 2.1f6 scrub ok
2014-09-30 17:55:43.342357 7ff0c6eb1700  0 log [INF] : 2.75 scrub ok
2014-09-30 17:58:03.108049 7ff0c6eb1700  0 log [INF] : 2.19e scrub ok
2014-09-30 20:26:44.692520 7ff0c6eb1700  0 log [INF] : 2.71 scrub ok
2014-09-30 20:39:59.360898 7ff0c6eb1700  0 log [INF] : 2.49 scrub ok
2014-09-30 20:50:23.622245 7ff0c6eb1700  0 log [INF] : 2.1e3 scrub ok
2014-09-30 21:05:31.517670 7ff0c6eb1700  0 log [INF] : 2.81 scrub ok
2014-09-30 22:41:49.385233 7ff0c6eb1700  0 log [INF] : 2.13b scrub ok
2014-09-30 22:50:15.894195 7ff0c6eb1700  0 log [INF] : 2.46 scrub ok
2014-09-30 23:01:29.621086 7ff0c6eb1700  0 log [INF] : 2.3f scrub ok
2014-09-30 23:07:51.819962 7ff0c6eb1700  0 log [INF] : 2.27 scrub ok
2014-09-30 23:21:53.277909 7ff0c6eb1700  0 log [INF] : 0.10 scrub ok
2014-09-30 23:28:01.345236 7ff0c6eb1700  0 log [INF] : 1.a scrub ok
2014-10-01 00:23:56.292509 7ff0c6eb1700  0 log [INF] : 2.f2 scrub ok
2014-10-01 00:30:57.849360 7ff0c6eb1700  0 log [INF] : 2.1c3 scrub ok
2014-10-01 00:52:25.593387 7ff0c6eb1700  0 log [INF] : 2.c2 scrub ok
2014-10-01 01:15:56.983984 7ff0c6eb1700  0 log [INF] : 2.106 scrub ok

since then and until now it seems all goes ok; we'll continue to monitor cluster behavior.
Thank you for supporting us!

Pasquale