Ceph problem

teicee

New Member
Dec 14, 2018
6
0
1
124
Hello,
I've got the following errors on my ceph cluster :
Code:
2018-12-14 17:34:07.109670 osd.3 osd.3 192.168.99.5:6804/663740 39 : cluster [WRN] Monitor daemon marked osd.3 down, but it is still running
2018-12-14 17:34:07.113561 osd.7 osd.7 192.168.99.5:6800/663643 19 : cluster [WRN] Monitor daemon marked osd.7 down, but it is still running
2018-12-14 17:34:07.113811 osd.5 osd.5 192.168.99.5:6802/663691 27 : cluster [WRN] Monitor daemon marked osd.5 down, but it is still running
2018-12-14 17:34:16.225168 osd.6 osd.6 192.168.99.6:6800/10967 9 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still running
2018-12-14 17:34:57.774856 osd.3 osd.3 192.168.99.5:6804/663740 45 : cluster [WRN] Monitor daemon marked osd.3 down, but it is still running

And about two minutes later all the OSDs goes down. Is it a problem of disk performance ?

Thanks
 
And about two minutes later all the OSDs goes down. Is it a problem of disk performance ?
Not much do go on with the above snippet, but it is unlikely that all disks suffer from overloading. Ceph is smart enough to throttle performance.

Please give us more insight into your ceph cluster. And check the ceph logs under /var/log for more details.
 
Hello,
Thanks for your reply. How can i verify overloading on OSDs ?
What information do you need to help us ?
Thanks
 
Hello,
New diagnostic. On my node1, the three OSDs are listening to the following ports :
  • OSD 3 -> 192.168.99.5:6804
  • OSD 5 -> 192.168.99.5:6802
  • OSD 7 -> 192.168.99.5:6800
But on my second node i have the following errors :
Code:
2018-12-17 21:48:49.837091 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6805 osd.3 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)
2018-12-17 21:48:49.837118 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6803 osd.5 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)
2018-12-17 21:48:49.837124 7f9c4619a700 -1 osd.4 2839 heartbeat_check: no reply from 192.168.99.5:6801 osd.7 ever on either front or back, first ping sent 2018-12-17 21:48:07.325112 (cutoff 2018-12-17 21:48:29.837087)

Why the second node try to contact the OSD of the first node on the wrong port ?
 
Problem solved.
Direct link between node 1 and node 2 was down and i haven't checked it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!