Ceph / Cluster Networking Question

XN-Matt

Well-Known Member
Aug 21, 2017
91
7
48
42
We've been using a traditional SAN with iscsi for over 10 years, it has been ultra reliable.

Now looking at ceph and have built a 3-server ceph cluster with Dell R740xds.

The device has six interfaces, three to one switch, three to another.

One port is public internet
One port is public ceph
One port is internal ceph

This all works fine.

Our plan was to add these servers to the existing cluster, build ceph and then migrate storage.

Now the servers that already exist only have four interfaces, one public internet and one iscsi one to each switch. It does not have another nic for the internal ceph which I don't believe it needs. They share the same subnets i.e 10.0.0.0/24 but not the same IPs, i.e old hosts are 10.0.0.10-20 and new ones are 10.0.0.30-40, all within the same /24.

When we added these, it was fine, as soon as we even just installed ceph, it caused things to go crazy, all host nodes starting shutting down and rebooting.

Before I waste a lot of time on this, nodes that are not actively doing the storage part (i.e ceph clients only), don't need access to the internal ceph range, only the public ceph range, if someone can confirm? These client nodes only have minimal boot storage to start and then currently, connect to iscsi.
 
When we added these, it was fine, as soon as we even just installed ceph, it caused things to go crazy, all host nodes starting shutting down and rebooting.
Was this after Ceph was installed or during?

Do you use HA?
If so, which network(s) are configured for the Proxmox VE cluster (Corosync)?
 
Ceph installed OK but it was during the configuration. Near immediately after nodes starting rebooting.

Yes we use HA.

Public interface is 95.x.x.x
Ceph Public/Cluster is 10.0.0.0/24 (we also set a backup ring on the public)
Ceph Internal is 192.168.0.0/24

There was no saturation of any link during the process which would rule out timeouts.
 
If Corosync had all three networks configured as links, chances are low, that all became unusable. But you can check the journal for the Corosync logs to see what happened.
Code:
journalctl -u corosync
It will log when it loses the connection to another host on a link.
 
I will check when we try again but the main question remains.

Do the nodes that are exclusively running virtual machines need to see/access the internal ceph network. Should they only require, at a minimum, interface for internet/VM traffic to the world and cluster traffic (which is shared with ceph public)?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!