Limiting iSCSI access to only one node in a cluster

kchouinard · Mar 6, 2023

I am planning a cluster with enough nodes and iSCSI disks that if I allow access to every node for every disk it will overrun the allowed number of connections to my SAN.
My cluster will have 10 nodes and 200 iSCSI disks (roughly). With 8 connections on each node per disk (8x10x200) that comes to 16K connections. The SAN only allow 15K connections. I am using Nimble hybrid SANs.

One possible solution is to allow access (at the SAN level) to only one node and also to a target node just prior to migrating a VM. This would reduce the number of connections by 10.

The result will be that nodes will have many disks with "Status: unknown".

Will this cause instability? My assumption is that the nodes will continue to check the disk status. I am concerned that the nodes may eventually become unstable.

I have also discovered that you have to give access to at least the first node in a cluster to make the LUN target discoverable. I could then remove access to the first node after adding the disk. This seems to work, but as stated before I am not sure it's a good idea.

bbgeek17 · Mar 7, 2023

Few clarifying questions:
- Why do you need 200 LUNs? Are you using Direct LUN approach?
- Why do you expect 8 connections per disk?
- How do you expect HA to work in case of failure if you need to manually allow a connection?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

kchouinard · Mar 7, 2023

- We are running KVM (on Centos) directly and Nimble storage. Each LUN is a drive in the VM.
Nimble does block level multi-datacenter replication, so we have DR recovery long distance asynchronously within 15 mins. Nimble only provides SAN not NAS.

- The iSCSI/multipath setup uses 4 connections per iface to provide 100K iops to the SAN. With two nics per server that makes 8 connections.

- I recognize we will lose HA, snapshots, and backups at the Proxmox level with this setup. We already have the Nimble infrastructure and we require the long distance replication. We would do backups and snapshots at the SAN level. Failover would be manual. I haven't looked deeply into Blockbridge to see if it is a sufficient replacement for the Nimble, but even so that would be down the road.

bbgeek17 · Mar 7, 2023

Thank you for clarifying your setup @kchouinard . So your Nimble is connected to the network via 4 interfaces, each with its own IP, producing an 8-connection mesh. Are those Nimble's 1Gbit connected? I am surprised you need 4 of them to achieve 100k IOPS. A single 10G connection should handle 250K 4k IOPS without a problem. A basic dual multi-path gives a 500K IOPS.

That said, I am not trying to sell you on Blockbridge. You have invested in Nimble, and I completely understand wanting to get the ROI from them.

I ran a quick test with PVE and raw non-orchestrated Blockbridge iSCSI. I manually created LUNs and attached them to the VMs (the direct LUN use-case you refer to). As soon as I attached a disk to a VM, iSCSI sessions were instantiated on all nodes in my cluster. So you are correct that the configuration will overload your Nimble.

Can you make it work with careful micromanagement of the connections? It seems so. But you might be the only person in the world doing this.

Keep in mind that your other nine nodes will continuously attempt to connect to storage and fail. Assuming 20 VMs per node, this creates a continuous herd of ~8 (connections) * 9 (nodes) * 180 (disks) connection requests that fail at the Nimble and generate 12960 recurring failure messages on your PVE hosts. You will also need to be very careful with ACLs to prevent the other nodes in the cluster from connecting when you open the ACLs to move a disk between hosts. That is, only open up access for a single IP at a time.

To answer your question about whether you should expect instability - one would need to run in such setup for a prolonged time to shake out any possible leaks, race conditions, etc. Frankly, I'd be concerned about the side effects of 12K continuous connection failures. I think it's safe to say that it hasn't been tested like that.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

kchouinard · Mar 7, 2023

bbgeek17 said:
Thank you for clarifying your setup @kchouinard . So your Nimble is connected to the network via 4 interfaces, each with its own IP, producing an 8-connection mesh. Are those Nimble's 1Gbit connected? I am surprised you need 4 of them to achieve 100k IOPS. A single 10G connection should handle 250K 4k IOPS without a problem. A basic dual multi-path gives a 500K IOPS.

That said, I am not trying to sell you on Blockbridge. You have invested in Nimble, and I completely understand wanting to get the ROI from them.

I ran a quick test with PVE and raw non-orchestrated Blockbridge iSCSI. I manually created LUNs and attached them to the VMs (the direct LUN use-case you refer to). As soon as I attached a disk to a VM, iSCSI sessions were instantiated on all nodes in my cluster. So you are correct that the configuration will overload your Nimble.

Can you make it work with careful micromanagement of the connections? It seems so. But you might be the only person in the world doing this.

Keep in mind that your other nine nodes will continuously attempt to connect to storage and fail. Assuming 20 VMs per node, this creates a continuous herd of ~8 (connections) * 9 (nodes) * 180 (disks) connection requests that fail at the Nimble and generate 12960 recurring failure messages on your PVE hosts. You will also need to be very careful with ACLs to prevent the other nodes in the cluster from connecting when you open the ACLs to move a disk between hosts. That is, only open up access for a single IP at a time.

To answer your question about whether you should expect instability - one would need to run in such setup for a prolonged time to shake out any possible leaks, race conditions, etc. Frankly, I'd be concerned about the side effects of 12K continuous connection failures. I think it's safe to say that it hasn't been tested like that.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Thanks bbgeek17,

We have two 10G nics per node. I am following Nimbles suggested configuration. I had at one point (many years ago) tested with only one connection per nic, which is how Nimble's automated configuration tool set up the node. The result was poor IOPS performance (~ 20K iops on a 100K IOPS SAN). I questioned Nimble support as to why their best practice was different than their configuration tool. They gave me some answer about how the config tool was just a starting point. I may need to visit this again.

We have to do quite a bit of micro management now with attaching volumes to a host manually and no live migration. So there is some benefit to the Proxmox interface. With Proxmox the only management when moving a VM to another node would be to grant it access on the Nimble. I was hoping there would be a pre/post migration hook we could script the access with, but I've read that it's still on the drawing board.

In my current environment when moving volumes if you don't clean up behind yourself the node continually tries to connect. This causes a lot of connection failure messages and eventually locks up the node.

I have a 3 node test cluster setup with 100 temp iscsi disks. I have limited access to only the first two nodes to see how the third node manages. "syslog" has more than 500K failures just so far today. Not good.

My other idea is to have smaller clusters with only 2-3 nodes - leaving the disk attached all the time. I could bring in a node if I need to migrate VM's.
So the math would be:
3 clusters with 66 iSCSI disks per cluster X 8 connections per disk X 3 clusters (3 * 66 * 8 * 3 ) = 4752 connections. That would be doable.

Thanks for you thoughts on this.

bbgeek17 · Mar 8, 2023

The good news is that you can feel confident that Proxmox supports your scale of VMs and disks without a problem (once you've figured out the storage bits). We've got a few customers north of 2K VMs/Disks per cluster using iSCSI. That said, with how we manage connections, the connection count is always (disks * paths). NVMe/TCP is slightly different because there are many more paths. Still, the connection count is independent of the number of nodes.

Good luck. Feel free to reach out if you need help.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Search

Search

Limiting iSCSI access to only one node in a cluster

kchouinard

New Member

bbgeek17

Distinguished Member

kchouinard

New Member

bbgeek17

Distinguished Member

kchouinard

New Member

bbgeek17

Distinguished Member

We value your privacy