Proxmox Ceph Crash Troubleshooting

cyrus104 · Apr 6, 2020

I setup a proxmox cluster with 3 identical Xeon-D systems with 64GB ram, each one has a 1TB Samsung 970 Pro commercial drive and a 6.4TB Intel DC P4600 drive. Based on feedback here I, I had no issues setting up the cluster and installing ceph. Once I setup Ceph and started to added the OSDs, I got one that went Down but stayed in.

I've seen a couple of messages in the logs about heartbeat not being received but I've run ping on all of the nodes for 20 minutes and only lost 1 on one of the nodes, the rest were at 100% with latency at .1 - .15 ms.

One of the messages is Health check failed: 2 slow ops, I'm not sure what those are. I haven't had any issues with these system and ran a 12 hour burn in on the nodes and drive to see if the "limited" test discovered any issues.

While typing this, there was a state change and then osd came back up for about 5 minutes. I included an updated picture and a snip of the syslog.

Please let me know if other logs are more helpful and/or which commands to run to troubleshoot it.

cyrus104 · Apr 7, 2020

So I've made some progress, while pings were going through on the cluster network, one of the hosts was being routed to my pfsense and back. This was due to me improperly creating a Linux VLAN attached to the public IP interface instead of the interface plugged into the newly created storage VLAN.

This resolved a bunch of my issues and the pool is now green. The only issue still existing in the logs preventing the Ceph from being healthy are these:
2020-04-07 11:14:45.171188 mon.proxmox-1 (mon.0) 16312 : cluster [WRN] Health check update: Long heartbeat ping times on back interface seen, longest is 3308.712 msec (OSD_SLOW_PING_TIME_BACK)
2020-04-07 11:14:45.171206 mon.proxmox-1 (mon.0) 16313 : cluster [WRN] Health check update: Long heartbeat ping times on front interface seen, longest is 3160.332 msec (OSD_SLOW_PING_TIME_FRONT)

They are happening every minute and I found the following thread that I am now using to troubleshoot.
https://forum.proxmox.com/threads/ceph-14-2-5-get_health_metrics-reporting-1-slow-ops.61869/

It looks like it pretty close to being healthy now that all host can communicate without a weird latency issue.

Alwin · Apr 7, 2020

cyrus104 said:
So I've made some progress, while pings were going through on the cluster network, one of the hosts was being routed to my pfsense and back. This was due to me improperly creating a Linux VLAN attached to the public IP interface instead of the interface plugged into the newly created storage VLAN.

Please post your network config and also make sure that all nodes have the same network settings.

As reference.
https://forum.proxmox.com/threads/small-homelab-cluster-ceph-help.67952/#post-304869

cyrus104 · Apr 8, 2020

After reboot each host node, I have not gotten this error in 24hrs and it looks like Ceph is stable. I have another issue and will start a new forum post which will include the network settings.

Thanks

Search

Search

Proxmox Ceph Crash Troubleshooting

cyrus104

Active Member

Attachments

cyrus104

Active Member

Alwin

Proxmox Retired Staff

cyrus104

Active Member