Ceph problem

teicee · Dec 14, 2018

Hello,
I've got the following errors on my ceph cluster :

Code:

2018-12-14 17:34:07.109670 osd.3 osd.3 192.168.99.5:6804/663740 39 : cluster [WRN] Monitor daemon marked osd.3 down, but it is still running
2018-12-14 17:34:07.113561 osd.7 osd.7 192.168.99.5:6800/663643 19 : cluster [WRN] Monitor daemon marked osd.7 down, but it is still running
2018-12-14 17:34:07.113811 osd.5 osd.5 192.168.99.5:6802/663691 27 : cluster [WRN] Monitor daemon marked osd.5 down, but it is still running
2018-12-14 17:34:16.225168 osd.6 osd.6 192.168.99.6:6800/10967 9 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still running
2018-12-14 17:34:57.774856 osd.3 osd.3 192.168.99.5:6804/663740 45 : cluster [WRN] Monitor daemon marked osd.3 down, but it is still running

And about two minutes later all the OSDs goes down. Is it a problem of disk performance ?

Thanks

Alwin · Dec 17, 2018

teicee said:
And about two minutes later all the OSDs goes down. Is it a problem of disk performance ?

Not much do go on with the above snippet, but it is unlikely that all disks suffer from overloading. Ceph is smart enough to throttle performance.

Please give us more insight into your ceph cluster. And check the ceph logs under /var/log for more details.

teicee · Dec 17, 2018

Hello,
Thanks for your reply. How can i verify overloading on OSDs ?
What information do you need to help us ?
Thanks

teicee · Dec 17, 2018

Hello,
New diagnostic. On my node1, the three OSDs are listening to the following ports :

OSD 3 -> 192.168.99.5:6804
OSD 5 -> 192.168.99.5:6802
OSD 7 -> 192.168.99.5:6800

But on my second node i have the following errors :

Code:

2018-12-17 21:48:49.837091 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6805 osd.3 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)
2018-12-17 21:48:49.837118 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6803 osd.5 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)
2018-12-17 21:48:49.837124 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6801 osd.7 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)

Why the second node try to contact the OSD of the first node on the wrong port ?

teicee · Dec 17, 2018

Problem solved.
Direct link between node 1 and node 2 was down and i haven't checked it.

Search

Search

Ceph problem

teicee

Active Member

Alwin

Proxmox Retired Staff

teicee

Active Member

teicee

Active Member

teicee

Active Member

We value your privacy