help OSD no reply issue

Andrew Holybee

Well-Known Member
Mar 27, 2017
52
1
46
43
We were running all of our interfaces for storage and lan over a dell power connect but wanted to separate out ceph on its on switch so we got a netgear smartplus jgs524e we installed it on the 6th. Randomly starting at 10:33a I started getting a now reply on osd5 which is on my pm04 node. When I go to the node it shows ethernet 3 going up and down. For now should I take out eth3 off the bond. This is messing with ceph and seems to be starting to affect the other osd's with the constant healing. I am using round robin on this bond and all the other nodes should I just use active backup on this node and will that mess with the other nodes that are all using round robin?

May 12 10:18:29 PM01 systemd-timesyncd[1655]: interval/delta/delay/jitter/drift 2048s/+0.002s/0.034s/0.005s/-26ppm
May 12 10:33:55 PM01 bash[12869]: 2017-05-12 10:33:55.850942 7f19ba912700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:55.226460 (cutoff 2017-05-12 10:33:35.850940)
May 12 10:33:56 PM01 bash[12869]: 2017-05-12 10:33:56.851119 7f19ba912700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:55.226460 (cutoff 2017-05-12 10:33:36.851116)
May 12 10:33:57 PM01 bash[12869]: 2017-05-12 10:33:57.526970 7f19a00bd700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:55.226460 (cutoff 2017-05-12 10:33:37.526969)
May 12 10:33:57 PM01 bash[12869]: 2017-05-12 10:33:57.851371 7f19ba912700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:55.226460 (cutoff 2017-05-12 10:33:37.851368)
May 12 10:33:58 PM01 bash[12869]: 2017-05-12 10:33:58.851575 7f19ba912700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:57.526822 (cutoff 2017-05-12 10:33:38.851573)
May 12 10:33:59 PM01 bash[12869]: 2017-05-12 10:33:59.827420 7f19a00bd700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:57.526822 (cutoff 2017-05-12 10:33:39.827418)
May 12 10:33:59 PM01 bash[12869]: 2017-05-12 10:33:59.851730 7f19ba912700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:59.827222 (cutoff 2017-05-12 10:33:39.851728)
May 12 10:34:00 PM01 bash[12869]: 2017-05-12 10:34:00.851904 7f19ba912700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:59.827222 (cutoff 2017-05-12 10:33:40.851901)
May 12 10:34:01 PM01 bash[12869]: 2017-05-12 10:34:01.852117 7f19ba912700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:59.827222 (cutoff 2017-05-12 10:33:41.852114)
May 12 10:34:02 PM01 bash[12869]: 2017-05-12 10:34:02.127829 7f19a00bd700 -1 osd.3 2210 heartbeat_check: no reply from osd.5 since back 2017-05-12 10:33:35.424086 front 2017-05-12 10:33:59.827222 (cutoff 2017-05-12 10:33:42.127828)
May 12 10:34:02 PM01 bash[12869]: 2017-0
 
First of all, 1g switches for ceph network is very bad idea, especially this netgear`s 256k buffer, u ll get tail drop and a lot of problems.
In your case, just try to change port at switch and continue to use round robin. There`s option to use 802.3ad, if it supports by your switche, try it.
 
I should have mentioned this is v2 of this switch I hope we don't have issue on it as our servers don't do much traffic.
In case anyone else has the issue turned out the eth3 was bad on this machine so I changed to active passive as round robin will keep trying eth3 which was slowing down one of my critical vms (will be de-commissioning this server as well). For the dev's it would be nice for round robin to time out after a certain amount of fails within a given time period. The other odd thing was when I rebooted this server it took down all the other nodes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!