Proxmox Ceph cluster - mlag switches choice -

TwiX

Renowned Member
Feb 3, 2015
314
23
83
Hi,

I plan to build a new proxmox ceph cluster (3 nodes).

LACP 2x25 Gbps for ceph
LACP 2x10 Gbps for VMs

My question is related to the switches, I want mlag switches indeed.

Mikrotik was my first choice due to functionalities and low budget as well.
I have 2 CRS310 and have settled a very simple architecture (1 proxmox with one lacp bond and 2 CRS310 in mlag mode).

I noticed that every time I reboot one of these Mkt CRS (for upgrade purpose or to simulate a crash eg.), I loose 1 or 2 pings at shutdown + 1-2 pings during the boot.
Seems that Mikrotik mlag is not as resilient as other manufacturers.

loosing one or two pings could be ok IMO for VMs but I don't know what impact it should have for ceph...
upgrading 2 mikrotik CRS will result for 4 Ceph complete unavailabilities during the upgrading process.

What kind of switches are you using for Ceph ? And how do you handle switches upgrades ?

Antoine
 
Not specific to Ceph, but for one cluster I went the used Mellanox route, SN2410's. Dirt cheap on ebay, relatively, very low latency. Needed 2 sets of these, instead I bought 6. 2 hot spares for 2 sets is enough redundancy :)
48x25 & 8x100 gbit each.
 
Thanks
I also worked with FS, Dell OS10 S5248F-ON which seems to be similar to SN2410 you 're using ...
Don't recall a preemption behavior on each reboot with them...
 
I also use Mikrotiks for my network, but the MLAG was unreliable at that time I tested a few years ago, so I used l3hw capabilities of the switches and created a redundant routed ECMP configuration. Works great for me...
 
Hi,

I plan to build a new proxmox ceph cluster (3 nodes).

LACP 2x25 Gbps for ceph
LACP 2x10 Gbps for VMs

My question is related to the switches, I want mlag switches indeed.

Mikrotik was my first choice due to functionalities and low budget as well.
I have 2 CRS310 and have settled a very simple architecture (1 proxmox with one lacp bond and 2 CRS310 in mlag mode).

I noticed that every time I reboot one of these Mkt CRS (for upgrade purpose or to simulate a crash eg.), I loose 1 or 2 pings at shutdown + 1-2 pings during the boot.
Seems that Mikrotik mlag is not as resilient as other manufacturers.

loosing one or two pings could be ok IMO for VMs but I don't know what impact it should have for ceph...
upgrading 2 mikrotik CRS will result for 4 Ceph complete unavailabilities during the upgrading process.

What kind of switches are you using for Ceph ? And how do you handle switches upgrades ?

Antoine
Hello.

I use extensively Mikrotik hardware. But I never used MLAG.
What you describe seems normal. In documentation (https://help.mikrotik.com/docs/spaces/ROS/pages/67633179/Multi-chassis+Link+Aggregation+Group) :
heartbeat (time: 1s..10s | none; Default: 00:00:05) This setting controls how often heartbeat messages are sent to check the connection between peers. If no heartbeat message is received for three intervals in a row, the peer logs a warning about potential communication problems. If set to none, heartbeat messages are not sent at all.
==> set the heratbeat to 1 second, but you will have still 3s of network loss.

Did you verify the loss time with only LACP on the path ?

For other users : can you share network loss time in case of one switch failure (if there is) ?

Note for tests:
ping -i .1 <des_ip>
 
Hello.

I use extensively Mikrotik hardware. But I never used MLAG.
What you describe seems normal. In documentation (https://help.mikrotik.com/docs/spaces/ROS/pages/67633179/Multi-chassis+Link+Aggregation+Group) :
heartbeat (time: 1s..10s | none; Default: 00:00:05) This setting controls how often heartbeat messages are sent to check the connection between peers. If no heartbeat message is received for three intervals in a row, the peer logs a warning about potential communication problems. If set to none, heartbeat messages are not sent at all.
==> set the heratbeat to 1 second, but you will have still 3s of network loss.

Did you verify the loss time with only LACP on the path ?

For other users : can you share network loss time in case of one switch failure (if there is) ?

Note for tests:
ping -i .1 <des_ip>

Thanks!

I don't think that network downtime is > 1 or 2sec with default parameters.

The main question is how ceph handles such network outages. I guess it is OK for corosync and VMs as well. But when this happens, the entire Ceph cluster is down twice for 2 seconds.

Moreover, mikrotik stable updates are released very often, every 1/2 months.