Ceph problem

teicee

New Member
Dec 14, 2018
6
0
1
125
Hello,
I've got the following errors on my ceph cluster :
Code:
2018-12-14 17:34:07.109670 osd.3 osd.3 192.168.99.5:6804/663740 39 : cluster [WRN] Monitor daemon marked osd.3 down, but it is still running
2018-12-14 17:34:07.113561 osd.7 osd.7 192.168.99.5:6800/663643 19 : cluster [WRN] Monitor daemon marked osd.7 down, but it is still running
2018-12-14 17:34:07.113811 osd.5 osd.5 192.168.99.5:6802/663691 27 : cluster [WRN] Monitor daemon marked osd.5 down, but it is still running
2018-12-14 17:34:16.225168 osd.6 osd.6 192.168.99.6:6800/10967 9 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still running
2018-12-14 17:34:57.774856 osd.3 osd.3 192.168.99.5:6804/663740 45 : cluster [WRN] Monitor daemon marked osd.3 down, but it is still running

And about two minutes later all the OSDs goes down. Is it a problem of disk performance ?

Thanks
 
And about two minutes later all the OSDs goes down. Is it a problem of disk performance ?
Not much do go on with the above snippet, but it is unlikely that all disks suffer from overloading. Ceph is smart enough to throttle performance.

Please give us more insight into your ceph cluster. And check the ceph logs under /var/log for more details.
 
Hello,
Thanks for your reply. How can i verify overloading on OSDs ?
What information do you need to help us ?
Thanks
 
Hello,
New diagnostic. On my node1, the three OSDs are listening to the following ports :
  • OSD 3 -> 192.168.99.5:6804
  • OSD 5 -> 192.168.99.5:6802
  • OSD 7 -> 192.168.99.5:6800
But on my second node i have the following errors :
Code:
2018-12-17 21:48:49.837091 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6805 osd.3 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)
2018-12-17 21:48:49.837118 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6803 osd.5 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)
2018-12-17 21:48:49.837124 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6801 osd.7 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)

Why the second node try to contact the OSD of the first node on the wrong port ?
 
Problem solved.
Direct link between node 1 and node 2 was down and i haven't checked it.