NFS session trunking / Multipathing / MPIO

Xandrios · Mar 28, 2024

Hello,

I've finally been able to get my hands on a few servers to test Proxmox for potential future deployments. Currently working with two HPE DL360 Gen10's, and a Gen11 is going to be available in a week. So far, on the Gen10's, not seeing any hardware issues (Which is great). Using the P408-i raid controller with LVM-thin locally, and looking at NFS for shared storage.

I'd like to use NFS as it can handle a less stable storage network, meaning that we can deploy using standard L2/L3 switches. I do however like to use redundant network paths (Two switches) as a single switch could take the cluster down when there is an issue, or during maintenance. In order to do this 'properly' it would require LACP to two switches, and having MLAG in between the switches. That way a switch may fully fail and things will keep working.

However the MLAG requirement makes that the 'deploy using standard L2/L3 switches' is no longer feasible. So I would need to do NFS session trunking: Have two separate IP networks between PVE's and NAS/SAN, which allows one path (=switch) to fail and still keep the storage online. I'm aware that this requires a manual mount and use PVE's "directory" storage method.

My issue is with getting session trunking to work. Was initially testing with NFS4.1 on a Synology, but replaced that with a Debian 12 server that supports 4.2 exports. Proxmox being Debian based this should be the perfect combination, right?

Tried various things. NAS has IP's 10.200.0.200/24 and 10.202.0.200/24, while the PVE host has 10.200.0.101 and 10.202.0.101.

Double mount, which was recommended using TCP:

Bash:

# mount -v -t nfs4 10.200.0.200:/var/nfs /mnt/mpio_test -o "nfsvers=4,minorversion=2,hard,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,max_connect=8"
== OK
# mount -v -t nfs4 10.202.0.200:/var/nfs /mnt/mpio_test -o "nfsvers=4,minorversion=2,hard,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,max_connect=8"
mount.nfs4: mount(2): Device or resource busy

Using trunkdiscovery:

Bash:

$ mount -v -v -v -t nfs4 10.202.0.200:/volume1/Proxmox_mpio /mnt/mpio_test -o "nfsvers=4,minorversion=1,hard,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,nconnect=2,max_connect=8,trunkdiscovery"
==OK
# mount -v -v -v -t nfs4 10.200.0.200:/volume1/Proxmox_mpio /mnt/mpio_test -o "nfsvers=4,minorversion=1,hard,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,nconnect=2,max_connect=8"
==OK

# nfsstat -m
/mnt/mpio_test from 10.202.0.200:/volume1/Proxmox_mpio
 Flags: rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,nconnect=2,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,local_lock=none,addr=10.202.0.200
/mnt/mpio_test from 10.200.0.200:/volume1/Proxmox_mpio
 Flags: rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,nconnect=2,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,local_lock=none,addr=10.202.0.200

# netstat -an | grep 2049
tcp        0    180 10.202.0.101:823        10.202.0.200:2049       ESTABLISHED
tcp        0      0 10.200.0.101:671        10.200.0.200:2049       ESTABLISHED
tcp        0    124 10.202.0.101:946        10.202.0.200:2049       ESTABLISHED

Having the trunkdiscovery option included on the second mount gives the resource-busy error. However, omitting that option on the second mount gives ...some result. We now have the share mounted via two networks. Netstat also shows these as established TCP connections. However note that the effective addr flag/option is overridden to the 202 address for both. Taking down the 202 network does not cause a switchover to the 200 network, meaning that the redundancy concept does not work. The mount just hangs until connectivity is restored. So this doesn't seem to be the way either.

Searching about this topic seems to return very little useful information. There are various articles by NAS vendors (e.g. NETAPP) that explain how their client does this. I've also tried VMWare in a desperate attempt and that does seem to be working as well (At least the mount is being created, and one path can be down without impacting the service). I'm at the point that I'm tempted to start making network captures to try to understand what's happening on the lower level.. but I'd rather not.

Does anyone have any recent experience with Session trunking for increased redundancy (And potentially higher transfer speeds).

Thanks!

bbgeek17 · Mar 28, 2024

Xandrios said:
However the MLAG requirement makes that the 'deploy using standard L2/L3 switches' is no longer feasible.

Almost all business grade switches support MLAG. That includes FS, Mellanox, Ubiquity, etc. They are all standard L2/L3 switches. Unless you mean something else.

Regarding actual Multipath implementation, I dont have daily experience with it as we do only Block storage. However, it seems that the mount paths need to be different and there are some presentation issues, at least as of time when this article was written:
https://www.suse.com/support/kb/doc/?id=000020404

It does look like this portion of technology is not as baked in as V3, but thats to be expected given the age of each.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Xandrios · Mar 28, 2024

Thanks. On the switches; I am aware of FS being an affordable vendor but Chinese 'tainted' hardware is unfortunately not an option with most of our customers. Mellanox seems to be quite overkill for a 3-node cluster, and ubiquity does not appear to support MLAG. We usually work with relatively basic 10Gbit Cisco switches but in order to get MLAG it looks like we'd have to move to their Nexus lineup, which cost more than the hypervisors themselves. If there is any other vendor that does 24 port 10Gbase-T switches supporting MLAG for under $ 5K a switch I'd be very very interested to hear about that.

However I personally always like software

If a problem can be resolved at a higher layer that would generally have my preference.

Your remark about the mountpoint is an important one. Most examples online indicate using the same mount point. But you are right, this is not required. Look at this:

Bash:

mount -v -t nfs4 10.202.0.200:/var/nfs /mnt/mpio_test -o "nfsvers=4,minorversion=2,hard,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,max_connect=8"
mount -v -t nfs4 10.200.0.200:/var/nfs /mnt/mpio_test2 -o "nfsvers=4,minorversion=2,hard,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,max_connect=8"

root@pve-gen10:~# netstat -an | grep 2049
tcp        0      0 10.202.0.101:806        10.202.0.200:2049       ESTABLISHED
tcp        0      0 10.200.0.101:776        10.200.0.200:2049       ESTABLISHED

root@pve-gen10:~# nfsstat -m
/mnt/mpio_test from 10.200.0.200:/var/nfs
 Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,max_connect=8,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,local_lock=none,addr=10.200.0.200
/mnt/mpio_test2 from 10.200.0.200:/var/nfs
 Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,max_connect=8,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,local_lock=none,addr=10.200.0.200

tcpdump -i any host 10.202.0.200 or host 10.200.0.200 and not arp &

ls /mnt/mpio_test
22:55:34.708813 IP 10.202.0.101.806 > 10.202.0.200.nfs: Flags [P.], seq 181:361, ack 253, win 501, options [nop,nop,TS val 142537818 ecr 1143967490], length 180: NFS request xid 1316892374 176 getattr fh 0,2/53
ls /mnt/mpio_test
22:55:35.603926 IP 10.200.0.101.776 > 10.200.0.200.nfs: Flags [P.], seq 361:541, ack 505, win 501, options [nop,nop,TS val 3561708625 ecr 3185362560], length 180: NFS request xid 667366993 176 getattr fh 0,2/53

So the same share is available via two mountpoints. While those mountpoints are configured to use different server IPs, NFS seems to utilize round-robin over the two paths to distribute the 'ls' requests for a single mountpoint. That's very un-intuitive but actually a good result!

However what does not yet seem to work... is redundancy. Taking down one of the two paths makes everything hang. Even though the other path is still available, the failed path is not taken out of service. Even though the TCP connection towards the (now unavailable) IP has long been closed. This definitely does require some more trial/testing... but its a step in the right direction.

bbgeek17 · Mar 29, 2024

Xandrios said:
Mellanox seems to be quite overkill for a 3-node cluster

Sure, but then you are asking for 24 port 10G switch which also seems overkill

I get it.
Take a look as Mellanox Sn2010, half rack 18 x 25 Gigabit SFP28 + 4 x 100 Gigabit QSFP28, which you can break out for more ports. A bit higher than 5K but infinitely better.

Xandrios said:
However what does not yet seem to work... is redundancy. Taking down one of the two paths makes everything hang. Even though the other path is still available, the failed path is not taken out of service. Even though the TCP connection towards the (now unavailable) IP has long been closed. This definitely does require some more trial/testing... but its a step in the right direction.

Do let us know what you find.

I find it curious that in SNIA NFSv4 overview the goal of trunking is:

brings a high degree of parallelization of storage and the potential for optimal resource consumption of network bandwidth.

Nothing about redundancy.
Nothing about HA here as well "if you have the ability to employ additional NICs in your environment, trunking provides increased parallelism and performance beyond the capability of nconnect."

I also share your frustration with the lack of clear documentation. Another sign that the technology is not at enterprise maturity level yet.

Good luck.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Xandrios · Mar 29, 2024

Thanks for your help. That Mellanox SN2010 looks pretty good, I also like being able to combine two in 1U. But price-wise that may be an obstacle. I'll add it to the list of possible options, thanks.

I've been testing the NFS session trunking and must admit that it does look like the redundancy aspect has not been implemented. Or at least I'm not able to get it to work. So close! That's a real pity. Things that I did learn:

Session trunking or Multipathing can be done straight from the proxmox UI by just mounting the same share twice (At different mount points, using different server IP's for each mount). The kernel/NFS will use all those IP's even though you're only using a single mountpoint for your VMs.
If one of the IP's goes down your IO operation will just hang until its back. Other IO operations may still succeed using the other connection(s). However any ops that were scheduled for that one connection, that has since gone down, are blocking until the connection is re-established.
Using a soft-mount this works...well, different. The first action sent to an unreachable IP fails eventually, and further requests are indeed only using the alternative still working paths.

If only the IO action was rescheduled on one of the other working connections this would work beautifully. But currently only a soft mount comes close, however in that case we may have IO failures reported back - which is not acceptable as VM data may be lost.

Has anyone tried something similar with iSCSI MPIO? When a path goes down, does the Multipathing daemon make sure that every single IO operation completes - for instance by re-scheduling it through another connection that does work?

My preference is not iSCSI. I really like that with NFS things may block, but it is guaranteed that every bit that's written from within the VM is also committed on disk at the NAS. I'd rather have a short delay than corrupt data. At least, that's how I understand iSCSI to work - there is a chance of data corruption when connectivity is last mid-way a set of write actions, especially when those actions may have to do with the filesystem metadata and such.

I may have to look at network options and give up on software based multipathing. Or at least protocol-based. Perhaps there is a way to do this on the OS level instead. Or alternatively still have to look at a hardware solution. That would probably best fit in a separate thread though.

Xandrios · Mar 29, 2024

Alright so I had a bit of a think about this. With NFS, even if its just a single connection, using a hard-mount small network hiccups are no issue. Sure, the storage IO will be delayed for a second but that's about it. When network connectivity is available again the existing connection will either resume, or, a new connection is established. I've seen both scenarios happen with the above tests. In any case, data integrity is not at risk.

So what we are looking for is a path selection, and switching to another path when the initial path goes offline. I was looking at an implementation on the NFS level, but this can actually be done from the OS too. I believe that typically this is called teaming using an arp-watcher.

A poor mans implementation could be something like this:

Shared storage is connected to two switches, one uplink per switch. Every PVE host is connected to the same two switches, one uplink per switch.
We'll use a dedicated storage subnet. On each host the two interfaces are configured with a single IP address within the storage subnet. So every server has one IP address assigned, however it is active on both interfaces (And therefore both links).
This means that each of these two interfaces may receive data for the IP address specified for that host, and packets will be picked up normally irrespective of the link/interface they came in on. This way we have redundancy on the receiving side of things.
On each host we use a static route to steer the outgoing traffic through one, or the other, interface. Using a periodic (e.g. every 5 seconds) check we determine which of the two links has end-to-end connectivity. If the link that has connectivity changes we'll update the route. This way we also have redundancy on the sending side of things.

So, lets give this a try. The NAS is 10.200.0.200 and PVE host that I'm testing on is 10.200.0.101.

Code:

7: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether d8:9d:67:25:4a:5e brd ff:ff:ff:ff:ff:ff
    inet 10.200.0.101/24 scope global vmbr1

8: vmbr2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether d8:9d:67:25:4a:5f brd ff:ff:ff:ff:ff:ff
    inet 10.200.0.101/24 scope global vmbr2

default via 10.200.0.1 dev vmbr0 proto kernel onlink
10.200.0.200 dev vmbr2 scope link
10.200.0.0/24 dev vmbr1 proto kernel scope link src 10.200.0.101
10.200.0.0/24 dev vmbr2 proto kernel scope link src 10.200.0.101

Deciding which interface to use to route outgoing traffic is fairly simple. You could use a ping or something similar to check connectivity towards a remote host. Or alternatively a simple arp-lookup may work even better as it is lightweight and never ignored.

Code:

root@pve-gen10:~# arping -c 1 -i vmbr1 10.200.0.200
root@pve-gen10:~# echo $?
0
root@pve-gen10:~# arping -c 1 -i vmbr2 10.200.0.200
root@pve-gen10:~# echo $?
0

Normally both would work, however if one is not working it would return an exit status of 1.

Changing a route is easily done as well:

Rich (BB code):

ip route change 10.200.0.200/32 dev vmbr1;
  arping -c 1 -i vmbr1 10.200.0.101;
  arping -c 1 -U -P -i vmbr1 10.200.0.101

16:22:59.120920 d8:9d:67:25:4a:5e > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.200.0.101 tell 10.200.0.101, length 44
16:23:00.212640 d8:9d:67:25:4a:5e > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Reply 10.200.0.101 is-at d8:9d:67:25:4a:5e, length 44

ip route change 10.200.0.200/32 dev vmbr2;
  arping -c 1 -i vmbr2 10.200.0.101;
  arping -c 1 -U -P -i vmbr2 10.200.0.101

16:24:08.848713 d8:9d:67:25:4a:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.200.0.101 tell 10.200.0.101, length 46
16:24:09.937134 d8:9d:67:25:4a:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Reply 10.200.0.101 is-at d8:9d:67:25:4a:5f, length 46

We see the arp being forced to go out and updating all switches and hosts on the network instanteneous.

Having an NFS mount active on the .101 (NAS at .200) works very well when a switchover is happening. When both paths are available then there is basically no downtime at all, the NFS TCP connections don't interrupt at all. If there is actual downtime on the primary link, and a switch is made to the secondary link, then the mount will block until the switch is made. If within 30 seconds or so the same TCP connection will resume, otherwise a reconnect may happen. After reconnecting the blocked NFS IO ops are returning/continuing without error.

Putting this in some kind of continuously running script on both the PVE nodes and NAS side may possibly actually work...

Few decisions to make:

The basics is very similar to linux NIC teaming. But a script may be more flexible, though it is a bit more hack-y as well.
How to make sure the script remains running/working. Initiate from cron every minute, and exit if already running? Would make sure the script re-starts within a minute when killed. Poor mans watchdog.
How to handle a NAS that is setup in HA. Surely would work, but may require some extra checks when an HA switch happens.
Which remote IP's to test against for end-to-end connectivity?
- On PVE side we could possible look at the mountpoints, meaning the configuration is limited (And adding extra storage mounts does not require config changes). May have to think about the startup when routes are not yet set, and mounts are being created (but likely failing due to not having a route...).
- On NAS side we could look at established NFS connections, but this would fail once TCP connections are fully timed-out and gone.
Decide on setting the route for the whole storage network at once, or, for each remote system individually. The latter allows - when both links are working - having some run over the first connection, while others run over the other. This increases performance.

Well.. interesting project. Not sure if it is feasible and would be suitable for a production environment. But interesting topic nonetheless

emunt6 · Mar 30, 2024

Xandrios said:
Thanks. On the switches; I am aware of FS being an affordable vendor but Chinese 'tainted' hardware is unfortunately not an option with most of our customers.

All of them are Chinese manufactured.

jlauro · Mar 30, 2024

Xandrios said:
I'd like to use NFS as it can handle a less stable storage network, meaning that we can deploy using standard L2/L3 switches. I do however like to use redundant network paths (Two switches) as a single switch could take the cluster down when there is an issue, or during maintenance. In order to do this 'properly' it would require LACP to two switches, and having MLAG in between the switches. That way a switch may fully fail and things will keep working.

However the MLAG requirement makes that the 'deploy using standard L2/L3 switches' is no longer feasible. So I would need to do NFS session trunking: Have two separate IP networks between PVE's and NAS/SAN, which allows one path (=switch) to fail and still keep the storage online. I'm aware that this requires a manual mount and use PVE's "directory" storage method.

Why is mlag/lacp a requirement? If the primary objective is things keep working if a switch fails, then you could use bond-mode balance-tlb instead of 802.3ad and do lacp between the switches. The balancing might not be as equal, but unless you are saturating your links during normal traffic it's generally good enough, and if not then more ports or bumping from 10 to 25gbe or faster might be a better option.

jlauro · Mar 30, 2024

emunt6 said:
All of them are Chinese manufactured.

So true... often the silicon is mostly the same, just different OEM on the box, maybe with different buffers and table sizes...

If not using FS directly, you can at least use them by referencing their web pages to get other vendors you trust to at least come down to more reasonable prices...

Ramalama · Mar 30, 2024

Xandrios said:
Thanks. On the switches; I am aware of FS being an affordable vendor but Chinese 'tainted' hardware is unfortunately not an option with most of our customers. Mellanox seems to be quite overkill for a 3-node cluster, and ubiquity does not appear to support MLAG. We usually work with relatively basic 10Gbit Cisco switches but in order to get MLAG it looks like we'd have to move to their Nexus lineup, which cost more than the hypervisors themselves. If there is any other vendor that does 24 port 10Gbase-T switches supporting MLAG for under $ 5K a switch I'd be very very interested to hear about that.

However I personally always like software If a problem can be resolved at a higher layer that would generally have my preference.

Your remark about the mountpoint is an important one. Most examples online indicate using the same mount point. But you are right, this is not required. Look at this:

Bash:

mount -v -t nfs4 10.202.0.200:/var/nfs /mnt/mpio_test -o "nfsvers=4,minorversion=2,hard,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,max_connect=8" mount -v -t nfs4 10.200.0.200:/var/nfs /mnt/mpio_test2 -o "nfsvers=4,minorversion=2,hard,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,max_connect=8" root@pve-gen10:~# netstat -an | grep 2049 tcp 0 0 10.202.0.101:806 10.202.0.200:2049 ESTABLISHED tcp 0 0 10.200.0.101:776 10.200.0.200:2049 ESTABLISHED root@pve-gen10:~# nfsstat -m /mnt/mpio_test from 10.200.0.200:/var/nfs Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,max_connect=8,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,local_lock=none,addr=10.200.0.200 /mnt/mpio_test2 from 10.200.0.200:/var/nfs Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,max_connect=8,timeo=50,retrans=1,sec=sys,clientaddr=0.0.0.0,local_lock=none,addr=10.200.0.200 tcpdump -i any host 10.202.0.200 or host 10.200.0.200 and not arp & ls /mnt/mpio_test 22:55:34.708813 IP 10.202.0.101.806 > 10.202.0.200.nfs: Flags [P.], seq 181:361, ack 253, win 501, options [nop,nop,TS val 142537818 ecr 1143967490], length 180: NFS request xid 1316892374 176 getattr fh 0,2/53 ls /mnt/mpio_test 22:55:35.603926 IP 10.200.0.101.776 > 10.200.0.200.nfs: Flags [P.], seq 361:541, ack 505, win 501, options [nop,nop,TS val 3561708625 ecr 3185362560], length 180: NFS request xid 667366993 176 getattr fh 0,2/53

So the same share is available via two mountpoints. While those mountpoints are configured to use different server IPs, NFS seems to utilize round-robin over the two paths to distribute the 'ls' requests for a single mountpoint. That's very un-intuitive but actually a good result!

However what does not yet seem to work... is redundancy. Taking down one of the two paths makes everything hang. Even though the other path is still available, the failed path is not taken out of service. Even though the TCP connection towards the (now unavailable) IP has long been closed. This definitely does require some more trial/testing... but its a step in the right direction.

Microtik CRS518, im using 2 of those via MLAG.
That are really good switches for the money, but be aware, no L3 routing or ROCEv2 or RDMA or Stacking or basically anything beyond L2 Capabilities.
They work great.

Another Option are HP S5212F-ON with Sonic Firmware.
https://forum.level1techs.com/t/del...c-setup-guide-25gbe-100gbe-on-a-budget/198643
Those are pretty cheap and support MLAG with a ton of other features, in my opinion thats a better solution as those Microtik switches, because of L3 Capabilities and a lot of other Features! But still no Stacking. RDMA/Rocev2 is supported tho.
I would nowadays use Rocev2 even if i don't need it or have anything to use it with, since there are no downsides.
With Intel E810 Cards for example you can Passthrough (SR-IOV) a Virtual Function to a Windows-VM and use there SMB-Direct (RDMA) as a File-Server. Sadly this is not supported by samba.
However thats the only thing i experimented with, but im sure RDMA (Rocev2) will be more and more important over time.

Never seen an Ubiquity device that supports MLAG, lol.

For NFS, i never knew that it supports multipath, to my "limited" knowledge i know only multipath for iscsi. Thats what we used an eternity ago until better solutions came up. iscsi is dead i think anyway, especially with fast flash storage it's useless.
NFS is still usefull, because there are literally no other ways to share a filesystem to multiple Servers at same time, at a reasonable speed.

Dunno if that post helps, but maybe you didn't loked into those 2 switches.
Cheers

Ramalama · Mar 30, 2024

jlauro said:
Why is mlag/lacp a requirement? If the primary objective is things keep working if a switch fails, then you could use bond-mode balance-tlb instead of 802.3ad and do lacp between the switches. The balancing might not be as equal, but unless you are saturating your links during normal traffic it's generally good enough, and if not then more ports or bumping from 10 to 25gbe or faster might be a better option.

There is even an easier solution, called spanning tree. Which is exactly meant to have multiple switches.
But that is an extreme ressource inefficient solution.

and how does balance-tlb helps? you cannot connect simply 2 cables to 2 different switches and use balance-tlb on proxmox. How should that even work? You can Bond only to a single switch, except your switches are stackable or support Mlag.

jlauro · Mar 31, 2024

Ramalama said:
There is even an easier solution, called spanning tree. Which is exactly meant to have multiple switches.
But that is an extreme ressource inefficient solution.

and how does balance-tlb helps? you cannot connect simply 2 cables to 2 different switches and use balance-tlb on proxmox. How should that even work? You can Bond only to a single switch, except your switches are stackable or support Mlag.

You can connect as many cables as you want. IE:

auto bond10
iface bond10 inet manual
bond-slaves eno1 eno2 eno3 eno4
bond-miimon 100
bond-mode balance-tlb
mtu 9000
#Bonnd 4x10gbe uplinks

Yes, you can simply connect 2 cables to 2 different switches (or 2 ports to one switch and 2 to another). The switches do need to be connected together via spanning tree / LACP / etc. It works by MAC address. The switches will learn what was used to send the traffic to and reply back same port. No configuration needed on the switches, not even LACP. Incoming broadcasts are duplicated on all ports, but it's fairly good at merging them. That's the point of balance-tlb, it does not require any special network-switch support. I'm not sure if it handles if a switch is not functioning but still providing a link, but in my testing it works well if the switch is dead / cable pulled.

As it uses MAC addresses it isn't very good for a single machine. However, as proxmox is hosting several VM, they each have a unique MAC address and so it works well enough on distributing the load if you have a lot of MACs tied to it.

Ramalama · Mar 31, 2024

jlauro said:
You can connect as many cables as you want. IE:

auto bond10
iface bond10 inet manual
bond-slaves eno1 eno2 eno3 eno4
bond-miimon 100
bond-mode balance-tlb
mtu 9000
#Bonnd 4x10gbe uplinks

Yes, you can simply connect 2 cables to 2 different switches (or 2 ports to one switch and 2 to another). The switches do need to be connected together via spanning tree / LACP / etc. It works by MAC address. The switches will learn what was used to send the traffic to and reply back same port. No configuration needed on the switches, not even LACP. Incoming broadcasts are duplicated on all ports, but it's fairly good at merging them. That's the point of balance-tlb, it does not require any special network-switch support. I'm not sure if it handles if a switch is not functioning but still providing a link, but in my testing it works well if the switch is dead / cable pulled.

As it uses MAC addresses it isn't very good for a single machine. However, as proxmox is hosting several VM, they each have a unique MAC address and so it works well enough on distributing the load if you have a lot of MACs tied to it.

But spanning tree will block your connection between the switches.
If it wouldn't then you'll have a loop and a broadcast flood.

What i mean is, you switches are connected somehow to a 3rd switch or a gateway etc...
Or maybe this is simply too simple for me, because on my company where i need redundancy everything is connected to a core-router. (Which is a stack of 2 switches over fiber in 2 different buildings)

So i would need to connect the dumb switches to the core-router anyway and if i connect the dumb switches together, rstp should disable any of the 3 connections.

That's in my head a usual scenario, if you have company's where you need redundancy, this will be a shitty solution.
It will work, because one connection is disabled and there is still a way, but the way could go through the core-router in the worst case.

However you're right, it will work, because the packets will find a way, and take even the same way back, like you said, so there isn't even an asymmetrical packet flow.

But hell, that's a crap solution, sorry

jlauro · Mar 31, 2024

Ramalama said:
But spanning tree will block your connection between the switches.
If it wouldn't then you'll have a loop and a broadcast flood.

What i mean is, you switches are connected somehow to a 3rd switch or a gateway etc...
Or maybe this is simply too simple for me, because on my company where i need redundancy everything is connected to a core-router. (Which is a stack of 2 switches over fiber in 2 different buildings)

So i would need to connect the dumb switches to the core-router anyway and if i connect the dumb switches together, rstp should disable any of the 3 connections.

That's in my head a usual scenario, if you have company's where you need redundancy, this will be a shitty solution.
It will work, because one connection is disabled and there is still a way, but the way could go through the core-router in the worst case.

However you're right, it will work, because the packets will find a way, and take even the same way back, like you said, so there isn't even an asymmetrical packet flow.

But hell, that's a crap solution, sorry

A balance-tlb will not forward packets between the links, so there will be no loop and spanning tree on the switches will not block. The ports on the proxmox server will only send one broadcast, so no flood. All ports will get broadcast traffic, but the link will remove the duplicates so the vms don't see a storm. The proxmox host doesn't take part in rstp with this type of link.

If the switches are not connected directly, then they can be connected to a 3rd switch (or 3rd and 4th switches) and use lags and or rstp or mlag or other alternative method to keep from loops if you do 4 switches. If redundant everything, I would assume each switch has at least two uplinks to two core switches.

It's not ideal in that as you said traffic could go from one switch and through core switch and that out other switch. You would expect it would be 50%, but every time I measure it, I get lucky and better than 50% takes a single hop path. Either way, even if only 50% take the better path, it's less shitty than spanning tree or active/passive failover where 50% edge links would be completely disabled. I have higher speed links, or at least lags between switches if not faster links and monitor so they don't saturate. Depends on your work load, but for me, more traffic is to/from core than between peer hosts. I tend to architect the vm distribution so most peer traffic stays in the host and doesn't even hit the switches.

Might not be as ideal as MLAG, but far better than putting spanning tree on proxmox, and if you don't have switches that can do MLAG it actually works rather well at removing that requirement.

Xandrios · Mar 31, 2024

jlauro said:
Why is mlag/lacp a requirement? If the primary objective is things keep working if a switch fails, then you could use bond-mode balance-tlb instead of 802.3ad and do lacp between the switches. The balancing might not be as equal, but unless you are saturating your links during normal traffic it's generally good enough, and if not then more ports or bumping from 10 to 25gbe or faster might be a better option.

The objective is mainly redundancy, where the tricky part is detecting that one of the two switches has an issue.

When the server takes links out of service purely based on their link-status (Which is the default for balance-tlb), this means that any fault scenario which leaves the server links active, is not being handled. For instance, if the connectivity between the two switches fails - then all regular ports may remain up. In that scenario, if two servers have elected different switches to be their primary link, that means these two servers won't be able to communicate.

We've recently seen some of our Cisco 10Gbit switches behave very unpredictable where some of the ports would randomly fail. Turned out to be due to some overheating event that occurred. However it has shown that switches, besides being "OK" or "NOK" can also be "Partly OK" .. which is a terrible situation to be in.

Therefore the main thing of importance is validating the path between the two servers that are communicating, not only the path between server and switch. I've later realised that this can be achieved using the arp checking mechanism on a linux bond (Instead of using link status). It is not natively supported by Proxmox, however I did get it to work for basic connectivity. I'm still working on getting the arp requests to be VLAN tagged which turns out to be a bit of a challenge, but it should theoretically be possible.

Ramalama said:
Microtik CRS518, im using 2 of those via MLAG.
That are really good switches for the money, but be aware, no L3 routing or ROCEv2 or RDMA or Stacking or basically anything beyond L2 Capabilities.
They work great.

Another Option are HP S5212F-ON with Sonic Firmware.
https://forum.level1techs.com/t/del...c-setup-guide-25gbe-100gbe-on-a-budget/198643
Those are pretty cheap and support MLAG with a ton of other features, in my opinion thats a better solution as those Microtik switches, because of L3 Capabilities and a lot of other Features! But still no Stacking. RDMA/Rocev2 is supported tho.

Dunno if that post helps, but maybe you didn't loked into those 2 switches.
Cheers

Thanks for the suggestions! That Dell one looks pretty good. We're using 10Gbit interfaces normally, but I'll keep this one in mind as a possibility. Would have to look into what SFP28 interface boards are going for these days.

Ramalama · Apr 1, 2024

Xandrios said:
The objective is mainly redundancy, where the tricky part is detecting that one of the two switches has an issue.

When the server takes links out of service purely based on their link-status (Which is the default for balance-tlb), this means that any fault scenario which leaves the server links active, is not being handled. For instance, if the connectivity between the two switches fails - then all regular ports may remain up. In that scenario, if two servers have elected different switches to be their primary link, that means these two servers won't be able to communicate.

We've recently seen some of our Cisco 10Gbit switches behave very unpredictable where some of the ports would randomly fail. Turned out to be due to some overheating event that occurred. However it has shown that switches, besides being "OK" or "NOK" can also be "Partly OK" .. which is a terrible situation to be in.

Therefore the main thing of importance is validating the path between the two servers that are communicating, not only the path between server and switch. I've later realised that this can be achieved using the arp checking mechanism on a linux bond (Instead of using link status). It is not natively supported by Proxmox, however I did get it to work for basic connectivity. I'm still working on getting the arp requests to be VLAN tagged which turns out to be a bit of a challenge, but it should theoretically be possible.

Thanks for the suggestions! That Dell one looks pretty good. We're using 10Gbit interfaces normally, but I'll keep this one in mind as a possibility. Would have to look into what SFP28 interface boards are going for these days.

Mellanox 4-LX are cheap as hell, but if you use those, you will need workarounds, because they have bugs with tagged traffic.
If you don't have vlans/no tagged traffic, they are amazing, a lot better (faster/cooler) as any comparable intel x5xx card.

Any other cards don't have bugs, neither intel ones, neither broadcom, neither connectx 5.

Vlanson connectx4 lx will work, but you have to enable promiscuous mode on the physical interface itself. Dunno if that's a really big downside, because of how cheap and good otherwise those cards are.

I personally switched for higher end servers to e810, on lower end im using those cheap mellanox cards.

Cheers

alexskysilk · Apr 1, 2024

Xandrios said:
I do however like to use redundant network paths (Two switches) as a single switch could take the cluster down when there is an issue, or during maintenance. In order to do this 'properly' it would require LACP to two switches, and having MLAG in between the switches. That way a switch may fully fail and things will keep working.

active/passive bonds work over any switch(es), and will get you to where you wanna go. its a totally 'proper' configuration.

bomzh · Apr 1, 2024

Xandrios said:
If there is any other vendor that does 24 port 10Gbase-T switches supporting MLAG for under $ 5K a switch I'd be very very interested to hear about that.

We're using Netgear's M4300-24X (XSM4324CS) switches, they're perfectly stacking and do what you're looking for. Probably the only weak hardware side of these switches is single PSU per unit. These switches cost around $5k.

Search

Search

NFS session trunking / Multipathing / MPIO

Xandrios

New Member

bbgeek17

Distinguished Member

Xandrios

New Member

bbgeek17

Distinguished Member

Xandrios

New Member

Xandrios

New Member

emunt6

Active Member

jlauro

Member

jlauro

Member

Ramalama

Well-Known Member

Ramalama

Well-Known Member

jlauro

Member

Ramalama

Well-Known Member

jlauro

Member

Xandrios

New Member

Ramalama

Well-Known Member

alexskysilk

Distinguished Member

bomzh

Member