[SOLVED] VM crashing on CEPH OSD, fine on ZFS disk

solv

Member
Jul 29, 2020
15
2
8
43
Hi,
I'm trying to figure out which logs to use to try and diagnose the issue I'm having.
I have a lab/learning cluster with 3 nodes and a Ceph cluster using bluestore storage pool.
The ceph cluster uses a 1gbe interface for public network, and a dedicated 10gbe for the cluster network.
I previously just had 1 gbe all around, but I wanted to test more thoroughly so added some 10gb nics.
I cloned a Ubuntu server I had running on the local ZFS disks of the cluster, and put it on CEPH.
Almost immediately I noticed while trying to rsync stuff over to it the VM would crash (and HA would restart it). This happened continuously.
Eventually I finished configuraiton and it stopped crashing for a while, but during it's backup window it only got part way in before crashing again.
I'm not running out of RAM, disk space is fine etc etc - so I decided to test by moving it back onto the local disk with ZFS, and it has been rock solid.
So, the issue must be CEPH, I've probably misconfigured it or done something stupid as I'm only learning - but I don't know which logs to start looking at that can tell me why the VM is suddenly crashing.
I have had a look at ceph.log and ceph-osd.log and there are no errors being spat out that I can see obviously. Promxox summary pages for Ceph say all is healthy and happy.
Are there any other proxmox or qemu logs or something that might give me a clue as to where to start looking for the issue?

thanks
 
For basic setup, never run corosync with any other service, since corosync needs low and stable latency. Secondly, use the 10GbE for Ceph public and cluster network. Since all clients run their traffic on the public network.
 
Hi,
Thanks for your help. I was a bit worried about the latency issue for the CEPH cluster network which is why I didn't use the same 10GbE connections for both, but I'll try that.
As for corosync, do you mean that I should have that running on a separate VLAN, or do you mean I shouldn't be using corosync and CEPH at the same time? If the latter then how do I go about having a cluster and CEPH? If the former, I can't see in the documentation how to specify a separate network for corosync, but I have setup primary with the 10GbE and secondary with 1GbE as below:

1598917523348.png

Does corosync default to using Link 0 or Link 1?

Also just in case it helps, this is my CEPH OSD conf:
[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.10.10.1/24 fsid = 093810e8-7e16-40a6-873d-f23c2018aa3e mon_allow_pool_delete = true mon_host = 10.245.173.245 10.245.173.246 10.245.173.247 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.245.173.245/24

Thanks again for your help - I'm evaluating whether I can roll out a cluster solution for my client SMB's rather than a standard single server with hardware redundancy. I can see that I'll defintely be needing a support subscription if I take the plunge...so much to learn =)
 
Just an update:
I switched over to using the 10GbE for CEPH public and private and for Proxmox GUI, which was super easy as I just changed the bridged interface and had ifupdown2 installed so no rebooting necessary. I then dedicated the 1GbE with static IP's as the link0 for corosync on a spare switch.

I think also the main issue was caused by having Jumbo frames enabled on the 10GbE interfaces, so I've removed them and everything running beautifully now
 
Also just in case it helps, this is my CEPH OSD conf:
Please post configs in CODE tags, the symbol (</>) can be found in the editor bar.


Does corosync default to using Link 0 or Link 1?
As for corosync, do you mean that I should have that running on a separate VLAN, or do you mean I shouldn't be using corosync and CEPH at the same time? If the latter then how do I go about having a cluster and CEPH?
See here. Especially the section with the requirments.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

I think also the main issue was caused by having Jumbo frames enabled on the 10GbE interfaces, so I've removed them and everything running beautifully now
If all ends use jumbo frames, the performance of Ceph will increase. See our Ceph benchmark paper and the forum thread.
https://proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
 
just wanted to share the final update to this thread in case it helps someone.
The issue all along was nothing to do with the cluster or network, it was the fact I had used ZFS as the base filesystem on the nodes which only had 8GB RAM each (this is just a test cluster).
Basically the hosts were running out of RAM due to the high requirements of ZFS, and thus shutting down the VM's! (also no swap space was provisioned so that didn't help).
I redid the entire cluster and used standard ext4 for the hosts, added swap space for safety and deployed my CEPH OSD's and Pool again it's been rock solid ever since!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!