Hi,
I my three node cephcluster with three osds I have OSD crashes only on host and on the same osd.
The OSD is then "out" and i cannot restart it and take it "in" again.
The only way to heal this is, destroy the OSD and recreate it so it replicates the data again.
There are no "ceph crash" entries.
In my understanding the monitor mon.myhost2 is crashing but is restarting again.
But the osd1. failed "cluster [INF] osd.1 failed (root=default,host=myhost2) (2 reporters from different host after 25.000193 >= grace 23.072176)"
Can somebody explain me what is going on here:
I my three node cephcluster with three osds I have OSD crashes only on host and on the same osd.
The OSD is then "out" and i cannot restart it and take it "in" again.
The only way to heal this is, destroy the OSD and recreate it so it replicates the data again.
There are no "ceph crash" entries.
In my understanding the monitor mon.myhost2 is crashing but is restarting again.
But the osd1. failed "cluster [INF] osd.1 failed (root=default,host=myhost2) (2 reporters from different host after 25.000193 >= grace 23.072176)"
Can somebody explain me what is going on here:
Code:
2021-07-26 00:18:15.018481 mgr.myhost5 (mgr.2539233) 141763 : cluster [DBG] pgmap v141823: 128 pgs: 128 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 8.0 KiB/s rd, 889 KiB/s wr, 69 op/s
2021-07-26 00:18:16.330427 mon.myhost5 (mon.2) 28671 : cluster [INF] mon.myhost5 calling monitor election
2021-07-26 00:18:16.424500 mon.myhost1 (mon.0) 84045 : cluster [INF] mon.myhost1 calling monitor election
2021-07-26 00:18:17.018977 mgr.myhost5 (mgr.2539233) 141764 : cluster [DBG] pgmap v141824: 128 pgs: 128 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 9.0 KiB/s rd, 1.2 MiB/s wr, 97 op/s
2021-07-26 00:18:19.019247 mgr.myhost5 (mgr.2539233) 141765 : cluster [DBG] pgmap v141825: 128 pgs: 128 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 1.7 KiB/s rd, 1.0 MiB/s wr, 81 op/s
2021-07-26 00:18:21.680310 mon.myhost1 (mon.0) 84046 : cluster [INF] mon.myhost1 is new leader, mons myhost1,myhost5 in quorum (ranks 0,2)
2021-07-26 00:18:21.815016 mon.myhost1 (mon.0) 84048 : cluster [DBG] monmap e3: 3 mons at {myhost1=[v2: XXX.YYY.99.81:3300/0,v1: XXX.YYY.99.81:6789/0],myhost2=[v2: XXX.YYY.99.82:3300/0,v1: XXX.YYY.99.82:6789/0],myhost5=[v2: XXX.YYY.99.83:3300/0,v1: XXX.YYY.99.83:6789/0]}
2021-07-26 00:18:21.815044 mon.myhost1 (mon.0) 84049 : cluster [DBG] fsmap
2021-07-26 00:18:21.815057 mon.myhost1 (mon.0) 84050 : cluster [DBG] osdmap e1036: 3 total, 3 up, 3 in
2021-07-26 00:18:21.815225 mon.myhost1 (mon.0) 84051 : cluster [DBG] mgrmap e23: myhost5(active, since 3d), standbys: myhost1, myhost2
2021-07-26 00:18:21.815336 mon.myhost1 (mon.0) 84052 : cluster [WRN] Health check failed: 1/3 mons down, quorum myhost1,myhost5 (MON_DOWN)
2021-07-26 00:18:21.962711 mon.myhost1 (mon.0) 84054 : cluster [WRN] Health detail: HEALTH_WARN 1/3 mons down, quorum myhost1,myhost5
2021-07-26 00:18:21.962731 mon.myhost1 (mon.0) 84055 : cluster [WRN] MON_DOWN 1/3 mons down, quorum myhost1,myhost5
2021-07-26 00:18:21.962737 mon.myhost1 (mon.0) 84056 : cluster [WRN] mon.myhost2 (rank 1) addr [v2: XXX.YYY.99.82:3300/0,v1: XXX.YYY.99.82:6789/0] is down (out of quorum)
2021-07-26 00:18:24.396094 mon.myhost1 (mon.0) 84057 : cluster [INF] mon.myhost1 calling monitor election
2021-07-26 00:18:24.403415 mon.myhost5 (mon.2) 28673 : cluster [INF] mon.myhost5 calling monitor election
2021-07-26 00:18:24.992674 mon.myhost1 (mon.0) 84058 : cluster [INF] mon.myhost1 calling monitor election
2021-07-26 00:18:25.293054 mon.myhost1 (mon.0) 84059 : cluster [INF] mon.myhost1 is new leader, mons myhost1,myhost2,myhost5 in quorum (ranks 0,1,2)
2021-07-26 00:18:25.603402 mon.myhost1 (mon.0) 84060 : cluster [DBG] monmap e3: 3 mons at {myhost1=[v2: XXX.YYY.99.81:3300/0,v1: XXX.YYY.99.81:6789/0],myhost2=[v2: XXX.YYY.99.82:3300/0,v1: XXX.YYY.99.82:6789/0],myhost5=[v2: XXX.YYY.99.83:3300/0,v1: XXX.YYY.99.83:6789/0]}
2021-07-26 00:18:25.603432 mon.myhost1 (mon.0) 84061 : cluster [DBG] fsmap
2021-07-26 00:18:25.603444 mon.myhost1 (mon.0) 84062 : cluster [DBG] osdmap e1036: 3 total, 3 up, 3 in
2021-07-26 00:18:25.603604 mon.myhost1 (mon.0) 84063 : cluster [DBG] mgrmap e23: myhost5(active, since 3d), standbys: myhost1, myhost2
2021-07-26 00:18:25.603704 mon.myhost1 (mon.0) 84064 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum myhost1,myhost5)
2021-07-26 00:18:25.603714 mon.myhost1 (mon.0) 84065 : cluster [INF] Cluster is now healthy
2021-07-26 00:18:17.095568 mon.myhost2 (mon.1) 37488 : cluster [INF] mon.myhost2 calling monitor election
2021-07-26 00:18:24.603900 mon.myhost2 (mon.1) 37489 : cluster [INF] mon.myhost2 calling monitor election
2021-07-26 00:18:26.250008 mon.myhost1 (mon.0) 84066 : cluster [INF] overall HEALTH_OK
2021-07-26 00:18:21.019508 mgr.myhost5 (mgr.2539233) 141766 : cluster [DBG] pgmap v141826: 128 pgs: 128 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 1.7 KiB/s rd, 1.0 MiB/s wr, 81 op/s
...
...
2021-07-26 00:20:34.764190 mon.myhost1 (mon.0) 84085 : cluster [DBG] osd.1 reported failed by osd.2
2021-07-26 00:20:34.782595 mon.myhost1 (mon.0) 84086 : cluster [DBG] osd.1 reported failed by osd.0
2021-07-26 00:20:34.795535 mon.myhost1 (mon.0) 84087 : cluster [INF] osd.1 failed (root=default,host=myhost2) (2 reporters from different host after 25.000193 >= grace 23.072176)
2021-07-26 00:20:35.182918 mon.myhost1 (mon.0) 84088 : cluster [WRN] Health check failed: 0 slow ops, oldest one blocked for 30 sec, osd.2 has slow ops (SLOW_OPS)
2021-07-26 00:20:35.216635 mon.myhost1 (mon.0) 84089 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2021-07-26 00:20:35.216655 mon.myhost1 (mon.0) 84090 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2021-07-26 00:20:37.229068 mon.myhost1 (mon.0) 84091 : cluster [DBG] osdmap e1037: 3 total, 2 up, 3 in
2021-07-26 00:20:31.044071 mgr.myhost5 (mgr.2539233) 141831 : cluster [DBG] pgmap v141891: 128 pgs: 128 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 1.7 MiB/s rd, 11 KiB/s wr, 15 op/s
2021-07-26 00:20:33.044504 mgr.myhost5 (mgr.2539233) 141832 : cluster [DBG] pgmap v141892: 128 pgs: 128 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 1.7 MiB/s rd, 18 KiB/s wr, 17 op/s
2021-07-26 00:20:35.098315 mgr.myhost5 (mgr.2539233) 141833 : cluster [DBG] pgmap v141893: 128 pgs: 128 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 1.7 MiB/s rd, 16 KiB/s wr, 16 op/s
2021-07-26 00:20:35.755584 osd.2 (osd.2) 504 : cluster [WRN] slow request osd_op(client.2557189.0:2426181 4.5 4:a1297b28:::rbd_data.2d3688ac1cf02.000000000000042f:head [write 2994176~8192] snapc d6=[] ondisk+write+known_if_redirected e1036) initiated 2021-07-26 00:20:05.549734 currently waiting for sub ops
2021-07-26 00:20:36.758081 osd.2 (osd.2) 505 : cluster [WRN] slow request osd_op(client.2557189.0:2426181 4.5 4:a1297b28:::rbd_data.2d3688ac1cf02.000000000000042f:head [write 2994176~8192] snapc d6=[] ondisk+write+known_if_redirected e1036) initiated 2021-07-26 00:20:05.549734 currently waiting for sub ops
2021-07-26 00:20:37.098840 mgr.myhost5 (mgr.2539233) 141834 : cluster [DBG] pgmap v141894: 128 pgs: 128 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 1.7 MiB/s rd, 20 KiB/s wr, 17 op/s
2021-07-26 00:20:39.099231 mgr.myhost5 (mgr.2539233) 141835 : cluster [DBG] pgmap v141896: 128 pgs: 48 stale+active+clean, 80 active+clean; 219 GiB data, 625 GiB used, 2.0 TiB / 2.6 TiB avail; 814 B/s rd, 13 KiB/s wr, 2 op/s
2021-07-26 00:20:40.429039 mon.myhost1 (mon.0) 84093 : cluster [DBG] osdmap e1038: 3 total, 2 up, 3 in
2021-07-26 00:20:41.751742 mon.myhost1 (mon.0) 84094 : cluster [WRN] Health check update: 1 slow ops, oldest one blocked for 34 sec, osd.2 has slow ops (SLOW_OPS)
2021-07-26 00:20:44.245682 mon.myhost1 (mon.0) 84095 : cluster [WRN] Health check failed: Degraded data redundancy: 56321/168963 objects degraded (33.333%), 128 pgs degraded (PG_DEGRADED)
2021-07-26 00:20:44.245709 mon.myhost1 (mon.0) 84096 : cluster [INF] Health check cleared: SLOW_OPS (was: 1 slow ops, oldest one blocked for 34 sec, osd.2 has slow ops)