Proxmox Ceph Crash Troubleshooting

cyrus104

Active Member
Feb 10, 2020
52
1
28
39
I setup a proxmox cluster with 3 identical Xeon-D systems with 64GB ram, each one has a 1TB Samsung 970 Pro commercial drive and a 6.4TB Intel DC P4600 drive. Based on feedback here I, I had no issues setting up the cluster and installing ceph. Once I setup Ceph and started to added the OSDs, I got one that went Down but stayed in.

I've seen a couple of messages in the logs about heartbeat not being received but I've run ping on all of the nodes for 20 minutes and only lost 1 on one of the nodes, the rest were at 100% with latency at .1 - .15 ms.

One of the messages is Health check failed: 2 slow ops, I'm not sure what those are. I haven't had any issues with these system and ran a 12 hour burn in on the nodes and drive to see if the "limited" test discovered any issues.

While typing this, there was a state change and then osd came back up for about 5 minutes. I included an updated picture and a snip of the syslog.

Please let me know if other logs are more helpful and/or which commands to run to troubleshoot it.
 

Attachments

  • ceph_health.JPG
    ceph_health.JPG
    70.1 KB · Views: 18
  • ceph_mon_man.jpg
    ceph_mon_man.jpg
    87.9 KB · Views: 16
  • ceph_osd.JPG
    ceph_osd.JPG
    30.4 KB · Views: 13
  • proxmox-3_ceph_log.txt
    48.9 KB · Views: 2
  • proxmox-3_syslog.txt
    19.7 KB · Views: 1
  • ceph_health_while_typing.JPG
    ceph_health_while_typing.JPG
    69.5 KB · Views: 14
  • ceph_osd_while_typing.JPG
    ceph_osd_while_typing.JPG
    36.2 KB · Views: 14
So I've made some progress, while pings were going through on the cluster network, one of the hosts was being routed to my pfsense and back. This was due to me improperly creating a Linux VLAN attached to the public IP interface instead of the interface plugged into the newly created storage VLAN.

This resolved a bunch of my issues and the pool is now green. The only issue still existing in the logs preventing the Ceph from being healthy are these:
2020-04-07 11:14:45.171188 mon.proxmox-1 (mon.0) 16312 : cluster [WRN] Health check update: Long heartbeat ping times on back interface seen, longest is 3308.712 msec (OSD_SLOW_PING_TIME_BACK)
2020-04-07 11:14:45.171206 mon.proxmox-1 (mon.0) 16313 : cluster [WRN] Health check update: Long heartbeat ping times on front interface seen, longest is 3160.332 msec (OSD_SLOW_PING_TIME_FRONT)

They are happening every minute and I found the following thread that I am now using to troubleshoot.
https://forum.proxmox.com/threads/ceph-14-2-5-get_health_metrics-reporting-1-slow-ops.61869/

It looks like it pretty close to being healthy now that all host can communicate without a weird latency issue.
 
So I've made some progress, while pings were going through on the cluster network, one of the hosts was being routed to my pfsense and back. This was due to me improperly creating a Linux VLAN attached to the public IP interface instead of the interface plugged into the newly created storage VLAN.
Please post your network config and also make sure that all nodes have the same network settings.

As reference.
https://forum.proxmox.com/threads/small-homelab-cluster-ceph-help.67952/#post-304869
 
After reboot each host node, I have not gotten this error in 24hrs and it looks like Ceph is stable. I have another issue and will start a new forum post which will include the network settings.

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!