[SOLVED] Setting up ceph, getting an error in the gui

rsr911

Member
Nov 16, 2023
60
5
8
Three node cluster. Setting up ceph on first node I get this error:

"Multiple IPs for ceph public network '192.168.1.204/24' detected on host1:192.168.1.141192.168.1.204 use 'mon-address' to specify one of them. (500)"

I've added "mon-address = 192.168.1.141:6789" to my ceph.conf see below:

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.0.1/24
fsid = 538384c8-1350-4653-90da-ceeab8fda7d5
mon allow pool delete = true
osd pool default min size = 2
osd pool default size = 3
public network = 192.168.1.204/24
mon-address = 192.168.1.141:6789

#[mon]
# mon host = 192.168.1.141:6789

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

My network setup is three linux bridges: management 1g, VM's 10g, isolated cluster 10g
All of these have bonded ports, 2 for management and 4 each for VMs and cluster.
I've tested all the network connections with ping, nmap, and iperf3, everything communicates between hosts.

Seem like no matter what I do the gui install fails with that error. I then do the following to completely remove ceph and start over:

systemctl stop ceph-mon.target
systemctl stop ceph-mgr.target
systemctl stop ceph-mds.target
systemctl stop ceph-osd.target
rm -rf /etc/systemd/system/ceph*
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /var/lib/ceph/mon/ /var/lib/ceph/mgr/ /var/lib/ceph/mds/
pveceph purge
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds
apt purge ceph-base ceph-mgr-modules-core
rm -rf /etc/ceph/*
rm -rf /etc/pve/ceph.conf
rm -rf /etc/pve/priv/ceph.*

I'm stuck. After it fails on the configuration tab crush map shows: "Error rados_connect failed - no such file or directory"
I can't create a monitor either, same error. It does this on the other nodes as well.
 
Last edited:
This is a example ceph.conf that was created from the ui-wizard:


Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.99.34/24
     fsid = 40ec99ac-1fcb-496e-a1b9-ed7d8742fcff
     mon_allow_pool_delete = true
     mon_host = 192.168.99.34 192.168.99.35 192.168.99.36
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.99.34/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.PMX4]
     host = PMX4
     mds standby for name = pve

[mds.PMX5]
     host = PMX5
     mds_standby_for_name = pve

[mds.PMX6]
     host = PMX6
     mds_standby_for_name = pve

[mon.PMX4]
     public_addr = 192.168.99.34

[mon.PMX5]
     public_addr = 192.168.99.35

[mon.PMX6]
     public_addr = 192.168.99.36


You messed arround a lot with your installation, I would recommend reinstalling proxmox os and use the ceph ui wizard.
 
  • Like
Reactions: rsr911
This is a example ceph.conf that was created from the ui-wizard:


Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.99.34/24
     fsid = 40ec99ac-1fcb-496e-a1b9-ed7d8742fcff
     mon_allow_pool_delete = true
     mon_host = 192.168.99.34 192.168.99.35 192.168.99.36
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.99.34/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.PMX4]
     host = PMX4
     mds standby for name = pve

[mds.PMX5]
     host = PMX5
     mds_standby_for_name = pve

[mds.PMX6]
     host = PMX6
     mds_standby_for_name = pve

[mon.PMX4]
     public_addr = 192.168.99.34

[mon.PMX5]
     public_addr = 192.168.99.35

[mon.PMX6]
     public_addr = 192.168.99.36


You messed arround a lot with your installation, I would recommend reinstalling proxmox os and use the ceph ui wizard.
I'm not sure what you mean by messed around but I'll reinstall all three and start over.

Basically I installed each server then one by one set up identical networks on all three with bonds where I needed them. Once that was done I created the proxmox cluster which went fine. Then I tried to install ceph. I'm not saying you're wrong, I am only a newbie, I'm just wondering what you think I might have done wrong. Which brings this question. What if I install the servers again and instead of all the bonds just do three bridges with one nic port each aka no bonds, get it working hopefully, and then I can create and add bonds after it's working? I do believe you are right, it's something I didn't do right, I just can't figure out where and don't want to repeat the same mistake.

That said we are so impressed with Proxmox over VMware the decision has been made to dump our unused VMware license and subscribe to Proxmox. It's easier to use, makes getting a shell easier, allows tools like iperf3 and the like to be installed, and doesnt seem to care what hardware you run it on. Vsphere had me upgrading hardware at every turn, even adding optanes which are required cache for vsan etc. It was a nightmare experience. Proxmox is light years ahead with its interface and usability in my opinion. It doesnt hurt that Ive been using Debian flavors for over 13 years.

Thank you for this advice. I will give it a go and see what happens.
 
I did just notice something in your config. You only have one network. When I did a test install I used one network and things worked fine. Now I have an isolated cluster 10g network, management 1g network, and VM 10g network. Where do you recommend the public ceph network be located. I am assuming on the VM network?
 
I did just notice something in your config. You only have one network. When I did a test install I used one network and things worked fine. Now I have an isolated cluster 10g network, management 1g network, and VM 10g network. Where do you recommend the public ceph network be located. I am assuming on the VM network?
Hello rsr911. OK Ill break down the things:

Networking:
  • Seperation of Ceph Public and Ceph Cluster Network is not state of the art anymore
    • but in your case, only having 10Gigabit for Ceph, I would go for 2x 10 Gbit Public and 2x 10 Gbit Cluster for Ceph
    • but this means you need 4x 10Gbit Ports
  • Best option would be to have a Full Meshed Network for Ceph with 4x25 Gigabit Card per Node
  • Next you dont need vmbridges for network ports, just make a bond over two 10gbit ports and use that for ceph
    • You only need to use vmbridges for networks that should be available inside your vm
    • means you typically need only one vmbridge per cluster (usually)
  • Ceph can be a bond0 with Active/Passive or LACP (if your switches support that)
  • We also recommend ceph with Jumbo Frames enabled (switch must have set max mtu for that)
  • Dont mix VM-Network with Ceph in any kind of way, if you dont have enough ports go for one ceph network with 10gbit or buy a network card
Regarding messing around I meant:

  • dont manually delete configs or packages you wont have fun with it
  • Use the web-ui for everything, you dont need cli for installing ceph
  • Only very few reasons actually require cli (except configuring timeserver in /etc/chrony/chrony.conf)

If you want further help, please post your hardware-stack with all details, so I can give some recommendations based on my experiences (we sell lot of Proxmox VE Servers at Thomas-Krenn.AG).

Edit: Just in case you dont know, NEVER use apt upgrade for upgrading packages, this can break your system. use apt dist-upgrade

Greetings Jonas
 
Last edited:
  • Like
Reactions: rsr911
Jonas thank you so much!

Here is my configuration:

Three Lenovo RD440 servers.
Samsung sata ssd for boot in the DVD bay. 8 sandisk 1tb ssds in the hot swap bays connected to Avago 9400 series raid cards in JBOD mode. Two dual port Intel dual port RJ45 10g NICs (used for VM network). Two dual port Intel 10g SFP+ cards used for ceph. Dual Xeon 8 core CPUs, 128gb ram.

For networking I have two 8 port Netgear pro L2/L3 SFP+ switches for ceph redundancy and two 28 port L2+ XS728T 10g switches for main network. I also have a 50 port smart Netgear 1g switch tied to one of the 10g switches.

Layout:

Office building server room:

One Lenovo as host1
Second Lenovo go become PBS
8 port ceph switch
28 port 10g switch
50 port 1g switch
Fortigate router/gateway.

Plant building server room:

Two Lenovos as hosts 1 and 2.
Buildings connected via OM4 buried cable 12 pairs and 2 cat6a. I have the large 10g switches connected with the 2 cat6a and two OM4 bonded pairs. The other 28 port 10g Netgear switch. A small 10/100/1000 8 port switch for some legacy hardware. Second 8 port SFP+ 10g for the redundant ceph bond.

Unused (due to VMware) I have 4 118gb Optane drives on PCIe risers I am thinking of using in the PBS as cache since it has an HDD Raid 5 array for backup storage. I'll cross that bridge when I come to it.

Backup server is also an RD440 but with dual 6 core CPUs and 96gb ram.

The three servers for the cluster can go up go 196gb ram. VMware limited me by license to only 128gb. I'd be happy to add more ram if needed.

Planned VMs:
Windows server 2019 primary DC and main windows share on second VMDK (currently running on ESXI alone)
Windows server 2019 secondary DC
Windows 10 workstation because Sage 50 accounting software doesnt like to run on servers.
Very soon will be adding a Windows SQL server.
Ubuntu utility workstation (normally off)
Windows 10 utility machine (normally off) mainly to test updates before network roll out.
Eventually a Windows 11 VM for the same purpose.
Down the road maybe a Linux mail server is being considered to host our own email.

Total of 17 users.

Major reason is the need for HA to run a database to track batch records for OTC medicated patches. Once we move from paper to digital any downtime means lost production time. FDA doesn't like mixed records so once we are digital we are tied to it. Expected use of the database is up to 8-9 users at a time. We run two shifts. Secondary reason is my personal time. I'm tired of fixing computer issues in the middle of the night.

We are a small but growing family owned business.

As you can see I should have all the hardware needed for what I think is a well planned professional layout. No real reason to go to a mesh since I already have switches (except maybe latency?)

You guys are the pros here. How would you layout this hardware?

Lastly both racks have almost new UPSs big enough to handle up to four servers each. This way I can add servers later if needed.

I know everything is kind of overkill. But it's what we have. Hoping to get at least 5+ years more out of the servers before they need upgraded. I'm betting I can.

Thanks again,

Christian
 
8 sandisk 1tb ssds
For CEPH you should definitely use enterprise SSDs, they have a significantly lower latency than consumer ones. When it comes to storage, latency is always the most critical thing.

For networking I have two 8 port Netgear pro L2/L3 SFP+ switches for ceph redundancy and two 28 port L2+ XS728T 10g switches for main network.
It doesn't seem to me that these switches are stackable, so you can't use 802.3ad (LACP) across multiple devices. In addition, L2 hashing is rather unsuitable for CEPH. Your switches should at best be capable of MLAG (not something like VC from Juniper) and definitely Layer3+4 hashing. This is the only way you can optimally distribute the load from CEPH across all links and nodes and thus also benefit from the bandwidth. With Layer 2, on the other hand, there is a very high probability that individual links will be significantly more heavily loaded or even overloaded.

The latency of enterprise switches with MLAG support is usually significantly lower than probably your models. As with SSDs, the latency of the storage is crucial for your overall performance. The higher the latency, the worse the performance.

Two Lenovos as hosts 1 and 2.
Where is the third server that is mandatory for quorum for both PVE and CEPH?

Major reason is the need for HA to run a database to track batch records for OTC medicated patches. Once we move from paper to digital any downtime means lost production time. FDA doesn't like mixed records so once we are digital we are tied to it. Expected use of the database is up to 8-9 users at a time. We run two shifts. Secondary reason is my personal time. I'm tired of fixing computer issues in the middle of the night.
But do you know the difference between High Availability (HA) and Fault Tolerance (FT)?

With HA, if the node fails, the VM will also fail and will be restarted on another node. This can also lead to an inconsistent state, especially with a database.
The MSSQL servers should also be equipped with always-on and replicate themselves, then the failure of a node does not have a dramatic impact on the consistency of the data.

The databases in particular will thank you if you use Enterprise SSDs. Because databases and CEPH are often a problem, if you also use consumer SSDs, you will get regular complaints.
 
  • Like
Reactions: rsr911 and jsterr
For CEPH you should definitely use enterprise SSDs, they have a significantly lower latency than consumer ones. When it comes to storage, latency is always the most critical thing.


It doesn't seem to me that these switches are stackable, so you can't use 802.3ad (LACP) across multiple devices. In addition, L2 hashing is rather unsuitable for CEPH. Your switches should at best be capable of MLAG (not something like VC from Juniper) and definitely Layer3+4 hashing. This is the only way you can optimally distribute the load from CEPH across all links and nodes and thus also benefit from the bandwidth. With Layer 2, on the other hand, there is a very high probability that individual links will be significantly more heavily loaded or even overloaded.

The latency of enterprise switches with MLAG support is usually significantly lower than probably your models. As with SSDs, the latency of the storage is crucial for your overall performance. The higher the latency, the worse the performance.


Where is the third server that is mandatory for quorum for both PVE and CEPH?


But do you know the difference between High Availability (HA) and Fault Tolerance (FT)?

With HA, if the node fails, the VM will also fail and will be restarted on another node. This can also lead to an inconsistent state, especially with a database.
The MSSQL servers should also be equipped with always-on and replicate themselves, then the failure of a node does not have a dramatic impact on the consistency of the data.

The databases in particular will thank you if you use Enterprise SSDs. Because databases and CEPH are often a problem, if you also use consumer SSDs, you will get regular complaints.
Thank you. Noted on the SSDs. I will look into that, especially since my backplanes are SAS capable.

The switches have LACP options in them. Although I'm still uncertain how to set it up properly. One video I watched said to set the switches to passive LACP, but I don't see that option. I have admin and STP, RSTP, CST, MST available, but don't know enough about this higher level networking to know how to best set that up.

Ideally, at least in my mind, I would have two RJ45 10g ports bonded as active/active with another pair bonded as failover and the same with the SFP+ for ceph.

Chatgpt says the following about the XS728T switches:

"The Netgear XS728T switches support stacking through two dedicated 10G SFP+ ports, but they do not natively support Link Aggregation Control Protocol (LACP) for stacking. Stacking usually involves connecting multiple switches to form a single logical unit for management and increased capacity. LACP, on the other hand, is often used for link aggregation within a single switch to increase bandwidth and provide redundancy."

The actual reason for these two switches is having two buildings and not wanting to run lots of cable underground. They are currently connected with a pair of SFP+ over fiber and "talk" fine but are separate aka not combined like one big switch. For ceph cluster traffic it would be a SFP+ switch in each building. One as active and the other as failover. Again this is my thinking, I'm no expert. This was recommended for VMware.

"Hosts 1 and 2" was a mistake, I meant hosts 2 and 3.

I take it you're recommending not enabling HA at least for the MSSQL but instead have 2-3 instances replicating in FT? Maybe just have HA set for the DCs?

Back to networking. On the VM to public side of things I really don't need 10g. We ran fine for years with a DC each on standalone ESXI hosts on 1g bonded. We did 10g fiber because we had a fire two years ago and everything needed replaced, I felt going to cat6a and underground fiber was just sort of future proofing things for not much more money. Originally this was going to be a Vsphere vsan with HA with the SSDs in raid 5, which is why I went with cheaper consumer SSDs, they speed tested close to a gen 3 NVME.
 
If you want further help, please post your hardware-stack with all details, so I can give some recommendations based on my experiences (we sell lot of Proxmox VE Servers at Thomas-Krenn.AG).

Greetings Jonas
Jonas,

Did you get a chance to look at my hardware stack? I'm doing clean installs on the cluster servers today. I figured I would wait for your advice before going further.

-Christian
 
Quick question regarding NVMe storage for my database. I have a 118GB optane m.2 that could go in each server. Ive done the math and that much space (118GB) will last at least several years worth of data, which could be, and most likely will be, archived each year anyway. Does it make sense to run my database on Optane drives as a pool?

Also all my switches are "enterprise grade L2+ 10gbe" so they have limited L3 and L4 ability.

Main switches are Netgear XS728T. SFP+ switches planned for cluster back end are TPlink 8 port TL-SX3008F.
 
The switches have LACP options in them. Although I'm still uncertain how to set it up properly. One video I watched said to set the switches to passive LACP, but I don't see that option.
This switches are not stackable, but can do LACP (but only on one switch) they also cant do mlag. They only thing you can do is using one switch with LACP or use two switches (in one room) with LACP.
Ideally, at least in my mind, I would have two RJ45 10g ports bonded as active/active with another pair bonded as failover and the same with the SFP+ for ceph.
Just summarized you usually need:

* 1 Port UI (1Gbit/s or more)
* 1 Port Corosync (1Gbit/s or more)
* 2 Ports for VM-Traffic (vmbr0) (bond0 Active/Backup or LACP)
* 2 Ports für Ceph (bond1) for Storage (Active/Backuo or LACP)
** LACP is always prefered because of Active Active mode

* If you want to use Proxmox Backupserver, use vmbr0 and put a IP on it
** or put another 10Gbit/s card in your server

The actual reason for these two switches is having two buildings and not wanting to run lots of cable underground. They are currently connected with a pair of SFP+ over fiber and "talk" fine but are separate aka not combined like one big switch. For ceph cluster traffic it would be a SFP+ switch in each building. One as active and the other as failover. Again this is my thinking, I'm no expert. This was recommended for VMware.
How many nodes do you have? With 3 Nodes you cant do ceph replica per room, just per host. If the room with the two nodes (out of three nodes total) fail, you have a complete storage downtime). Ceph Room Replica Setup start with 3 Rooms (and SIZE=3) or with 2 rooms (SIZE=4) + one quorum-node-room (that does not need to have osds).
"Hosts 1 and 2" was a mistake, I meant hosts 2 and 3.

I take it you're recommending not enabling HA at least for the MSSQL but instead have 2-3 instances replicating in FT? Maybe just have HA set for the DCs?
Best would be to build application based clustering, means a active/active mssql database, that does not loose any data when the primary node fails and can instant-switch. otherwise your vm would be down for 3-5 minutes and data in database could have some problems.
Back to networking. On the VM to public side of things I really don't need 10g. We ran fine for years with a DC each on standalone ESXI hosts on 1g bonded. We did 10g fiber because we had a fire two years ago and everything needed replaced, I felt going to cat6a and underground fiber was just sort of future proofing things for not much more money. Originally this was going to be a Vsphere vsan with HA with the SSDs in raid 5, which is why I went with cheaper consumer SSDs, they speed tested close to a gen 3 NVME.
Yeah dont compare consumer ssds with enterprise ssds in any way, you just simply cant. they work totally different and as another user said already, its absoluty not good to use consumer ssds in any way. they dont have powerloss protection. I mean you can stick with them, but just be warned. You should have also never used them for vsan or anything that is enterprise related ...

Using 1G for VM-Traffic is ok.
Quick question regarding NVMe storage for my database. I have a 118GB optane m.2 that could go in each server. Ive done the math and that much space (118GB) will last at least several years worth of data, which could be, and most likely will be, archived each year anyway. Does it make sense to run my database on Optane drives as a pool?
Only if you make sure you have a backup or a zfs replication set, because otherwise m.2 failure would lead to dataloss.
 
  • Like
Reactions: rsr911
This switches are not stackable, but can do LACP (but only on one switch) they also cant do mlag. They only thing you can do is using one switch with LACP or use two switches (in one room) with LACP.
So I LACP just one switch to the other switch which has LACP turned off on those ports? Right now I have LACP turned on on both sides and the connection shows up. STP is turned off on that LAG.
Just summarized you usually need:

* 1 Port UI (1Gbit/s or more)
* 1 Port Corosync (1Gbit/s or more)
* 2 Ports for VM-Traffic (vmbr0) (bond0 Active/Backup or LACP)
* 2 Ports für Ceph (bond1) for Storage (Active/Backuo or LACP)
** LACP is always prefered because of Active Active mode

* If you want to use Proxmox Backupserver, use vmbr0 and put a IP on it
** or put another 10Gbit/s card in your server
I would assume the UI port needs an IP. How about corosync? Just assign IPs to these adapters? Same for bond1 for ceph? On the ceph side any thing wrong with two separate switches and networks (since I have the equipment) like this:

1) ceph bond1 bonding 2 ports on each card. Using LACP connect with one TPlink switch
2) ceph bond2 bonding the other cards ports using LACP connect with other TPlink switch
3) ceph bond3 bonding both bond1 and bond2 either as active/backup or LACP if that would work across two switches.

Anything wrong with putting the management UI on vmbr0 with 4 10g ports bonded this would free up a second 1g that could be bonded for corosync? How do I point corosync to this port or bond?

Not sure what you mean about Proxmox backup server. I have a standalone machine for that. It has 4 10g ports and 2 1g ports. I was going to use 1g for management and the 10gs bonded for backup traffic as vmbr0 on that machine OR are you saying "put another card in the nodes for backup"?

How many nodes do you have? With 3 Nodes you cant do ceph replica per room, just per host. If the room with the two nodes (out of three nodes total) fail, you have a complete storage downtime). Ceph Room Replica Setup start with 3 Rooms (and SIZE=3) or with 2 rooms (SIZE=4) + one quorum-node-room (that does not need to have osds).
3 nodes for now.
Best would be to build application based clustering, means a active/active mssql database, that does not loose any data when the primary node fails and can instant-switch. otherwise your vm would be down for 3-5 minutes and data in database could have some problems.
Duly noted. I am looking into the software side of this.
Yeah dont compare consumer ssds with enterprise ssds in any way, you just simply cant. they work totally different and as another user said already, its absoluty not good to use consumer ssds in any way. they dont have powerloss protection. I mean you can stick with them, but just be warned. You should have also never used them for vsan or anything that is enterprise related ...
Noted as well. My raid cards are running in HBA mode. They can handle 2 U.2 drives each or 8 SAS drives each. Would that be preferable over 8 data ssds? In either case are refurbished enterprise drives acceptable? I need to check what my backplanes can handle. They are handling the consumer SSDs fine and did well when setup as 7 drive raid 5 with HS. I'm thinking refurb should be ok, I'm not generating that much data.

Using 1G for VM-Traffic is ok.
I currently have two two port 10g NICs I had planned for this. Would it be better to bond two ports for VM traffic on vmbr0 and bond the other card for backup traffic? Backups will be nightly but in off hours. In that case ok to use all four on vmbr0 (I've installed but not configured Prox backup yet)
Only if you make sure you have a backup or a zfs replication set, because otherwise m.2 failure would lead to dataloss.
Ok very confused here. These would be a ceph pool and backed up nightly. One drive per server. As I understand it these Optanes are enterprise grade. Plugged into riser cards. Can ZFS run on top of ceph? Or will my backup suffice?

Sorry for the newbie questions. I'm an old SCSI raid guy and cat5e to a simple switch. Rather new to all this network bonding and LACP etc. But I want to learn.

I'm doing clean installs on each node today. Will then see what you say before I proceed. Mainly I am unsure how to separate corosync from management. Initial setup everything just runs on one port over vmbr0 so I really need guidance on utilizing my network hardware in the most correct manner.

Idk if you saw but I did list all my hardware a few comments up for you.

I do think my servers would have to run SAS enterprise SSDs. I don't think I have the PCIe lanes to run very many u.2 drives. What I'm fidning online are refurbished drives. I'm pretty sure the backplanes are only 6gbit, the raid/HBA cards are 12gbit or two u.2 but I do not know if those will run in my backplanes.
 
So I LACP just one switch to the other switch which has LACP turned off on those ports? Right now I have LACP turned on on both sides and the connection shows up. STP is turned off on that LAG.
Im not sure what you are asking for. You have two seperated switches, those are connected to each other. But you cant do LACP with one port on switch1 and one port on switch2. So you can either put 2 Ceph ports per node on one of those switches and use lacp (2x 10Gbit), but this would also mean that if a switch fails, your ceph node is down (no network). Thats why Active Backup bond is recommend, because you dont need switch-stack for that. means. 1 ceph port on switch1 and one ceph port on switch 2. But this is 10Gbit only (less performance) because of active/backup (passive) --> but you can loose a switch without loosing ceph network.

For the interconnect between the two switches you can use lacp (crosslink).

I would assume the UI port needs an IP. How about corosync? Just assign IPs to these adapters? Same for bond1 for ceph? On the ceph side any thing wrong with two separate switches and networks (since I have the equipment) like this:
Usually in best practice deployments corosync gets its own physical switchport for the primary corosync_link0, which is used for proxmox-cluster-communication. corosync_1 fallback link can be set to the ceph storage-ip (it will only be used if link0 is down). Putting corosync on the same interface of the ui could possibly lead to a error, which causes the node to reboot if the timestamps (that proxmox ve cluster) writes are not intime. If you dont have enough ports, use Linux-VLANs inside the hypervisor to seperate UI-traffic and corosync-traffic.

1) ceph bond1 bonding 2 ports on each card. Using LACP connect with one TPlink switch
2) ceph bond2 bonding the other cards ports using LACP connect with other TPlink switch
3) ceph bond3 bonding both bond1 and bond2 either as active/backup or LACP if that would work across two switches.
1 and 2 are ok if you are using ceph public network for one bond and ceph cluster network fo the other bond. BUT as I said, when using lacp without having a switch-stack oder mlag-feature on your switches (which you dont have) a switch failure leads to complete node-downtime (storage-wise). In your case if you put all your ceph-network (doesnt matter if public or cluster ceph) you will have complete downtime (everything is offline!) With Active Backup you could loose a switch, without ceph going down.

Edit: 3 makes totally NO SENSE at all - I dont think this is working and a valid configuration.

Anything wrong with putting the management UI on vmbr0 with 4 10g ports bonded this would free up a second 1g that could be bonded for corosync? How do I point corosync to this port or bond?
You can put UI IP on vmbr0. LACP with 4 Ports is unusual, typical you have 2 ports in lacp (working fine). As I already said, corosync should not be put on a interface that has a lot of traffic, as this can increase latency which causes node fencing (-> reboot of node -> which means downtime). So you are technically able to do the things you are asking for, but it is not best practice.

If its not possible to add a 1gbit card, you should use a Linux-VLAN-Adapter to separate vmbr0 and UI-traffic via VLAN.

Not sure what you mean about Proxmox backup server. I have a standalone machine for that. It has 4 10g ports and 2 1g ports. I was going to use 1g for management and the 10gs bonded for backup traffic as vmbr0 on that machine OR are you saying "put another card in the nodes for backup"?
Yes 1G mgmt is fine. 10G for Backup aswell, just make sure you are able to ping the pve-nodes from your pbs. Depending on the network you will use for "backup" you need to make sure you can reach the pve nodes.

Noted as well. My raid cards are running in HBA mode. They can handle 2 U.2 drives each or 8 SAS drives each. Would that be preferable over 8 data ssds? In either case are refurbished enterprise drives acceptable? I need to check what my backplanes can handle. They are handling the consumer SSDs fine and did well when setup as 7 drive raid 5 with HS. I'm thinking refurb should be ok, I'm not generating that much data.
Depending on how much data you have, I would go for 8 drives, as youll have more allowed disk-failures. with 2 disks, you can only loose on disk on a node, usually you start ceph with 4 disks per node.

I currently have two two port 10g NICs I had planned for this. Would it be better to bond two ports for VM traffic on vmbr0 and bond the other card for backup traffic? Backups will be nightly but in off hours. In that case ok to use all four on vmbr0 (I've installed but not configured Prox backup yet)
if backups dont interrupt your vm-performance (offhours) you can use the same adapter, but might need separation via linux vlan, because of different IP-subnet for backup then for Management (you said you also want to put mgmt on that) ,,,
Ok very confused here. These would be a ceph pool and backed up nightly. One drive per server. As I understand it these Optanes are enterprise grade. Plugged into riser cards. Can ZFS run on top of ceph? Or will my backup suffice?
You can run ZFS besides ceph, zfs usually has better performance then ceph. so zfs with a single disk will be better then ceph with a single disk per node. you can run zfs async (!) replication via UI and sync the vm between two nodes and also use HA for this vm. BUT its async means if the node fails where the vm is running, the vm will be started on the node where you replicated the data to (but with the data from the last sync!). Depends on how much you write, you can set the zfs-replication window down to 1-5 minutes. so every 5 minutes, the data gets synchronized. if a node fails, you will have X-minutes of data-loss depending on when the last sync was sucessfully completed.

I'm doing clean installs on each node today. Will then see what you say before I proceed. Mainly I am unsure how to separate corosync from management. Initial setup everything just runs on one port over vmbr0 so I really need guidance on utilizing my network hardware in the most correct manner.
As far as I understood you have 2 Ports that you wanna use for:

  • vmbr0 (VM-Traffic) (you can put IP for UI on that device)
  • Corosync as well (advice use Linux VLAN to separate it from the vmbr0 traffic vlan-wise)

So your total number of nics per NODE is:

  • 2x 10Gbit (VM/UI/COROSYNC/BACKUP) Ports
  • 2x 10Gbit (CEPH) Ports

is this correct? Or do you have 6 10Gbit Ports? Please make a list of avaiable ports per node.

Idk if you saw but I did list all my hardware a few comments up for you.

I do think my servers would have to run SAS enterprise SSDs. I don't think I have the PCIe lanes to run very many u.2 drives. What I'm fidning online are refurbished drives. I'm pretty sure the backplanes are only 6gbit, the raid/HBA cards are 12gbit or two u.2 but I do not know if those will run in my backplanes.

Yeah I would go for sas-ssds (8x) per node instead of 2x u.2 per node.
 
Last edited:
  • Like
Reactions: rsr911
3 nodes for now.

You wont be able to setup a 3 node ceph in 2 rooms, this is the most important thing in this whole thread. This just does not work, you need 3 rooms. If your not able to provide 3 locations for your servers, you need to put them in one room.
 
  • Like
Reactions: rsr911
Ports per server:

1g x 2
10gbe RJ48 x 4 (two dual port cards)
10g sfp+ x 4 (two dual port cards)

Plan was to use the SFP over fiber for this ceph cluster network. 10g RJ48 for VM/backup/corosync. 1g for management.

I was going to bond the 1g as active/backup for management. Bond the sfp 10g (4 ports in total) as bonded pairs on a switch then bonded pair on another switch in an active/backup arrangement (one switch with two ports per server as primary and second with two ports per server as backup) for ceph. Not sure how to handle the RJ48 10g ports. I thought I could bond them all somehow.

First I've learned I cannot use two rooms. Could you explain why is doesn't work? I don't mind adding a third room, but I would need to cable everything, add another rack and UPS etc. Currently my two rooms are about 50-60 meters apart. Is this a latency or power issue? My plan had been node 1 and backup server in one room and nodes 2 and three across the parking lot in another building mainly for physical separation in case of fire or other catastrophic event. Originally I was just going to do a VMware two node cluster vsan. One server in each room and a witness.

I feel like I'm in over my head at this point. I have all this hardware including the third server I bought to do three nodes. Would it be better to just put the servers in a rack in one building and backup server in the other?

#3 above was to be used with #1 and #2. Split the ceph back end traffic over two networks, one active and one backup. Two ports per server in LACP each for active network and two ports per server in LACP backup network, then bond the bonds as active/backup. These would be isolated networks.
 
I should add if I lose power in either building we'd be shutdown. I don't have backup generators for either building because in 25 years there we've experienced outages only a handful of times at that was both buildings. We just closed up shop for the day on those days.
 
Let me see if I can simplify:

I have the following:

Each server has two 1g ports, 2 dual port sfp+ NICs, 2 10g RJ48 dual port NICs. I have two TPlink sfp+ 8 port switches and two Netgear XS728t 28 port switches with 4 sfp+ ports on each. Total of 10 ports per server, 2 1g, 4 10g Ethernet, and 4 10g fiber.

Currently the Netgear switches are uplinked via LACP on two sfp+ ports. I intended to use these for public traffic and the TPlink switches for backend traffic. (Isolated networks).

Maybe that clears up my hardware. Now I just need to understand how best to use it.
 
Last edited:
OK. This helps. Best would be:

  • 1 port mgmt ui (1gbit)
  • 1 port corosync (1gbit) (corosync link0)
  • 2 ports for VMs (vmbr0 bond0) (10gbit)
  • 2 ports for Backup (bond1)
  • 2 ports for Ceph Public (10gbit fiber) (bond2) (corosync link1)
  • 2 ports for Ceph Cluster (10gbit fiber) (bond3)

First I've learned I cannot use two rooms. Could you explain why is doesn't work? I don't mind adding a third room, but I would need to cable everything, add another rack and UPS etc. Currently my two rooms are about 50-60 meters apart. Is this a latency or power issue? My plan had been node 1 and backup server in one room and nodes 2 and three across the parking lot in another building mainly for physical separation in case of fire or other catastrophic event. Originally I was just going to do a VMware two node cluster vsan. One server in each room and a witness.

In Ceph you use a replication-count of 3. that means each server gets one copy. you have three servers but only two rooms. that means you have one room where you would put 2 servers in it. but when the whole room fails, youll loose two nodes ot of three. But you always need to make sure that a room or network failure between multiple rooms wont affect the storage and proxmox ve cluster quorum.

In Ceph there are two parameters:

  • SIZE = 3 (this means all your data gets written three times)
  • MINSIZE=2 (this is default and should not be reduced!) (this means you always need to have 2 copies online and working)

SIZE minus MINSIZE = count of servers you can loose without a complete storage-downtime.

I feel like I'm in over my head at this point. I have all this hardware including the third server I bought to do three nodes. Would it be better to just put the servers in a rack in one building and backup server in the other?

You need to put the three nodes in one room or you put one node per room, but you also need three rooms then. If you put it in one room, you dont have any infrastructure that you can recover to or use, in case of disaster event (fire, long power outlet) ... The backupserver itself wont help you out a lot, when you have nothing to recover it to. (thats why some people install PBS ontop of PVE, use performance nvmes for backup, so they can disaster recover from pbs to pve on the same server!

#3 above was to be used with #1 and #2. Split the ceph back end traffic over two networks, one active and one backup. Two ports per server in LACP each for active network and two ports per server in LACP backup network, then bond the bonds as active/backup. These would be isolated networks.

I dont think this will work and I would not recommed to try out.
 
  • Like
Reactions: rsr911
OK. This helps. Best would be:

  • 1 port mgmt ui (1gbit)
  • 1 port corosync (1gbit) (corosync link0)
  • 2 ports for VMs (vmbr0 bond0) (10gbit)
  • 2 ports for Backup (bond1)
  • 2 ports for Ceph Public (10gbit fiber) (bond2) (corosync link1)
  • 2 ports for Ceph Cluster (10gbit fiber) (bond3)
Ok this makes more sense to a newbie like me. Now just to be sure how I'm setting it up:

1) 1g port for management gets my main IP and DNS hostname
2) 1g port for corosync link0 (is this setup when I build the proxmox cluster or before hand?)
3)bond0 gets an IP
4)bond1 gets an IP
5)bond2 gets an IP and is added as corosync link1
6)bond3 gets an IP

Ok so questions: I have two 8 port sfp+ switches. I assume #6 above can be separated from the main network. Should #5 above switch be uplinked to the public switch, I would assume yes.

If so my switch sfp+ connections would be like this on the public side: 8 port public -> main switch on LACP pair then main switch to the secondary 28 port switch in the other building with an LACP pair from switch to switch.

I can get around my two room issue in one of two ways:

1) cheapest, move a server to a third room. This is possible and could be well separated, 50-100 meters, but three nodes is not ideal for production.

2) more expensive option. Setup three nodes in building one for now, add two nodes in building two later on.

#2 would give me the two copies in each room if all five servers had a copy. Its also a good growth path I think.

I don't have physical room in the backup server to make it a fourth node. Plus its one step down in CPUs. It has a mirrored SSD array for boot and 5 HDD raid 5 array for backup storage with 1 hot spare.

I do have a chassis with just a motherboard, backplane, and drive trays. I could get another and populate both of these with drives, CPUs, ram and NICs.

However option #1 doesnt make sense in my case for power outages as the room would be in the same building. It could be well isolated from fire and other catastrophic events though.

I think for now I'll just put three servers in a rack and get the cluster going. Once working I can justify to ownership setting up two more servers. Or heck go all the way and setup three more servers and do two mirrored clusters. One in each building.
In Ceph you use a replication-count of 3. that means each server gets one copy. you have three servers but only two rooms. that means you have one room where you would put 2 servers in it. but when the whole room fails, youll loose two nodes ot of three. But you always need to make sure that a room or network failure between multiple rooms wont affect the storage and proxmox ve cluster quorum.

In Ceph there are two parameters:

  • SIZE = 3 (this means all your data gets written three times)
  • MINSIZE=2 (this is default and should not be reduced!) (this means you always need to have 2 copies online and working)

SIZE minus MINSIZE = count of servers you can loose without a complete storage-downtime.



You need to put the three nodes in one room or you put one node per room, but you also need three rooms then. If you put it in one room, you dont have any infrastructure that you can recover to or use, in case of disaster event (fire, long power outlet) ... The backupserver itself wont help you out a lot, when you have nothing to recover it to. (thats why some people install PBS ontop of PVE, use performance nvmes for backup, so they can disaster recover from pbs to pve on the same server!
Ok now I understand.
I dont think this will work and I would not recommed to try out.
Your recommendations above do render this idea obsolete. I'll go with two ports for ceph cluster and two for ceph public as you recommend.

Thank you! I think I now understand, finally!
 
Ok this makes more sense to a newbie like me. Now just to be sure how I'm setting it up:

1) 1g port for management gets my main IP and DNS hostname
2) 1g port for corosync link0 (is this setup when I build the proxmox cluster or before hand?)

Yes do the whole networking before starting the clustering. Do it web-ui only.

3)bond0 gets an IP
4)bond1 gets an IP
5)bond2 gets an IP and is added as corosync link1
6)bond3 gets an IP

correct

Ok so questions: I have two 8 port sfp+ switches. I assume #6 above can be separated from the main network. Should #5 above switch be uplinked to the public switch, I would assume yes.

No 5) does not need connection to anything else.
If so my switch sfp+ connections would be like this on the public side: 8 port public -> main switch on LACP pair then main switch to the secondary 28 port switch in the other building with an LACP pair from switch to switch.

As I said no 3node cluster via two rooms :)

I can get around my two room issue in one of two ways:
1) cheapest, move a server to a third room. This is possible and could be well separated, 50-100 meters, but three nodes is not ideal for production.
What do mean with that? theres no issue using 3 nodes for production?

2) more expensive option. Setup three nodes in building one for now, add two nodes in building two later on.
This does also not work later on. You need to think about every possible downtime-outcome: if you use three nodes in one room, and this room fails you still have the same problem I already explained :) Two room setups start with Room 1: 2 nodes | Room 2: 2 nodes | Room 3: Quorum Node doing a SIZE=4 MINSIZE=2 setup (with custom crushrule) or 3 Rooms with each having a full pve-ceph node (including storage not only quorum purpose host) and SIZE=3 MINSIZE=2.

#2 would give me the two copies in each room if all five servers had a copy. Its also a good growth path I think.
By default (without using custome crushrules you still write 4 copies (and they can be anywhere if you dont define room entities in crushmap). Doing a 5way replication (you say: each server has a copy) is absolute OVERKILL. Go for a 4way mirror (2 copies per room) while having the quorum node in room3. Quorum Node can be something really cheap, aslong as you have network in corosync and ceph.

I think for now I'll just put three servers in a rack and get the cluster going. Once working I can justify to ownership setting up two more servers.
Yeah go for a single room setup and get familiar with ceph and do lots of testing before going live with anything important. As I said above, you would need 2 more servers (one with osds) and the second just for quorum (third room).
Or heck go all the way and setup three more servers and do two mirrored clusters. One in each building.
Be careful with the wording. Proxmox VE does not offer mirroring for two complete seperated clusters. The example we talked about here is ONE Cluster but with 6 Nodes, that are fully usable and avaiable, but they use intelligten replica-placement, so you can loose up to a room, without having lots of downtime.

Theres a way to setup ceph-mirror between two seperated cluster, but this is a lot more complex then ceph already is for beginners.

Thank you! I think I now understand, finally!
Your Welcome, ceph is complex in the beginning but youll love it once you have a perfect setup and see how easy and secure it is (if setup correct) :)
 
Last edited:
  • Like
Reactions: rsr911

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!