MS-01 HCI HA Ceph Sanity Check

J-dub

Member
Dec 20, 2023
34
6
8
Okay I am down to networking for my planned cluster and before I buy the "stuff", I wanted to check in and make sure I'm not forgetting something or under/over purchasing for my needs.

I have 3 Minisforum MS-01 w/ the Intel 13 i9 cpu / 96GB ram / Samsung enterprise 7.6TB U2 ssd (Ceph) / 2TB M.2 (for OS). They have two 2.5gps and two 10gps ports and an empty PCIe 4 8x open slot.

For the Ceph private network I was planning on using a MikroTik CRS504-4XQ-IN (4 port 100gps)
For the Ceph public I was thinking of another MikroTik CRS504-4XQ-IN but I'd need a dual 100gps PCIe x8 Nic card and I'm, not sure that's realistic (heat may be an issue too).
For the Promox Corosync I thought one of the 2.5gps ports to a cheap 4 port switch would be fine.
For the VMs/Containers and backup server CRS309-1G-8S+IN (8 ports at 10gps)

For the dual 100gps Nics, Mellanox MCX516A-CCAT ConnectX-5 EN is a PCIe 3 x16 and I "think" that is "basically" the same as a PCIe 4 x8.

I can save money if 25gps is enough for the 3 OSDs of Ceph I'm panning on. I can get easily find a dual 25gps Nic for the nodes. The same seems to be true w/ 40gps used enterprise gear. I'm not interested in full depth rack gear.
I prefer to keep things physical separated and not VLANs for the Ceph/ Corosync networks.

I "think" I have a good plan. I still haven't figured out the PCIe Nics however. I have no experience with MikroTik.

Have I over/under done anything? Am I not thinking of something? Do you have any advice or opinions to share?

Thank You!
 
Last edited:
Heat will be an issue with QSFP28. That's nothing designed for home environments. Besides the fact those connectors are extremely prone to errors and touch-sensitive they're getting really hot, even in perfect data center conditions.
If you want to go production with this setup, then you've chosen the wrong barebone. If this is going to be a luxury home lab, you don't need to max out the u.2 NVMe.

SFP28 is compareable to SFP+, so this might be a better choice. CRS518 might be a great player for this. A CRS518 Pair with MLAG and dual 40G interconnect would serve all the ports perfectly with full redundancy.

I've also ordered 3 MS-01 for a home lab and decited to go with a MT CRS309, LACP both SFP+ and do everything with VLAN seperation over the port channel. That's easy, resilient and should be fast enough.
 
Last edited:
Heat will be an issue with QSFP28. That's nothing designed for home environments. Besides the fact those connectors are extremely prone to errors and touch-sensitive they're getting really hot, even in perfect data center conditions.
If you want to go production with this setup, then you've chosen the wrong barebone. If this is going to be a luxury home lab, you don't need to max out the u.2 NVMe.

SFP28 is compareable to SFP+, so this might be a better choice. CRS518 might be a great player for this. A CRS518 Pair with MLAG and dual 40G interconnect would serve all the ports perfectly with full redundancy.

I've also ordered 3 MS-01 for a home lab and decited to go with a MT CRS309, LACP both SFP+ and do everything with VLAN seperation over the port channel. That's easy, resilient and should be fast enough.
I have not heard of the error and touch-sensitivity issues with QSFP28. DAC copper should help with the heat issues. Maybe not enough though?

I want to keep the Ceph networks on separate switches for lower latency. I'd imagine having the switch process VLAN and port isolation, while dealing with link/chassis aggregation would be detrimental to that goal?

I don't think the MikroTik CRS309-1G-8S+in (10gbps) is going to be able to keep up w/ low latency, but that price is nice.. I think a Thunderbolt-Net mesh would probably be similar performance and the MS-01 has two per node to do it, for the cost of the cables.
 
I have not heard of the error and touch-sensitivity issues with QSFP28. DAC copper should help with the heat issues. Maybe not enough though?

I want to keep the Ceph networks on separate switches for lower latency. I'd imagine having the switch process VLAN and port isolation, while dealing with link/chassis aggregation would be detrimental to that goal?

I don't think the MikroTik CRS309-1G-8S+in (10gbps) is going to be able to keep up w/ low latency, but that price is nice.. I think a Thunderbolt-Net mesh would probably be similar performance and the MS-01 has two per node to do it, for the cost of the cables.
I don't have any experience with passive QSFP28 DAC so I can't tell. I'm responsible for some installations with thousands of 100G links and all I can say is, that these MTP/MTO multiplex connections are far more unreliable than single fiber SC/LC connections.

As far as I know, all the L2 functionality is available in the switch chip so the CPU doesn't step in. That means, the latency should not be affected that much. Maybe there's a measurable delay but I doubt this will result in major performance issues.

I'm going with the CRS309 just because it's lying around in the shelf. If this won't work well, I'll go with the 2x CRS518 or maybe the CRS326 setup and put in a quad 10GBe XL710 NIC to the MS-01 nodes for additional interfaces. Mainly because I have some of these lying around, too :p

But, to be honest, the MS-01 cluster will be a lab only for Proxmox to test alternative storage like VitaStor and LinStor and to dive into OpenStack. My 24/7 home server is also Proxmox-powered but built with server components. I wouldn't trust consumer hardware with my data at all.
 
I'm doing the homelab with a trio of ms-01s. Each host has 1x7.68 u.2, and 2x 3.84tb m.2. booting off of a USB to nvme adapter. Ceph network is a thunderbolt mesh with open fabric routing, and using the built in sfp ports lacp'ed to my switch for proxmox vms, 1 of the 2.5 ports dedicated for amt, the other for proxmox management.

If you have any questions about the ms01 I'm happy to answer.
 
Last edited:
I'm doing the homelab with a trio of ms-01s. Each host has 1x7.68 u.2, and 2x 3.84tb m.2. booting off of a USB to nvme adapter. Ceph network is a thunderbolt mesh with open fabric routing, and using the built in sfp ports lacp'ed to my switch for proxmox vms, 1 of the 2.5 ports dedicated for amt, the other for proxmox management.

If you have any questions about the ms01 I'm happy to answer.
First question... how is everything working out?
Second question .. what would you change - if anything?
 
I don't have any experience with passive QSFP28 DAC so I can't tell. I'm responsible for some installations with thousands of 100G links and all I can say is, that these MTP/MTO multiplex connections are far more unreliable than single fiber SC/LC connections.

As far as I know, all the L2 functionality is available in the switch chip so the CPU doesn't step in. That means, the latency should not be affected that much. Maybe there's a measurable delay but I doubt this will result in major performance issues.

I'm going with the CRS309 just because it's lying around in the shelf. If this won't work well, I'll go with the 2x CRS518 or maybe the CRS326 setup and put in a quad 10GBe XL710 NIC to the MS-01 nodes for additional interfaces. Mainly because I have some of these lying around, too :p

But, to be honest, the MS-01 cluster will be a lab only for Proxmox to test alternative storage like VitaStor and LinStor and to dive into OpenStack. My 24/7 home server is also Proxmox-powered but built with server components. I wouldn't trust consumer hardware with my data at all.
That's some solid real world experience!

I don't have another option but to buy "it" and try it. I'm hoping to make a decent selection that works, without having to return/sell things. To that end, I'm overspending and selecting things that I "think" will easily handle the clusters needs. The twin CRS518-16XS-2XQ option is probably the coolest, but that $2800 tag is more than I can explain away to my wife ;-) She'd be mad - then she'd buy a new couch or bed or car... I can't afford them all lol

I gave away all my real servers to the local college. The power draw and noise was fine when I had a business use for it - but at home - I just don't want it. To big/hot/noisy. The mini-PC "craze" sparked my interest again and the low power/small size with real performance is exciting to my nerd brain.

The CRS510-8XS-2XQ-IN might be a good option? 8 25gps ports and 2 100gps .. at $850ish I could do two of those... hmmmm
 
Last edited:
First question... how is everything working out?
Second question .. what would you change - if anything?
My goal of this project was to decrease power utilization, and gain redundancy, from my single r730xd which has dual e5-2690v4 cpu's and 256 gb of ram. My r730xd's power floor, even with power tuning is around 280w (that's with storage).

Each ms-01 node has 96gb of ram, and 3 nvme drives. After power tuning each node uses about 40w of power, with occasional spikes under high load to like 90-100w. But on average the entire cluster uses 120w total (no spinning disk yet, so not an apples to apples comparison). I'd probably get better power usage if i wasn't using so much nvme storage in each node. The enterprise u.2 and m.2 storage is power hungry. When i was first testing with a single consumer m.2 drive my power floor at idle on proxmox was like 18watts per node. I added the extra storage to try and increase IOPS, but it didn't help.

I am sure my storage performance is limited by both my ceph pool network configuration, and the m.2 slot's pcie lanes. slot 1 is gen 4x4, but i have a 3x4 u.2 drive in it. slot 2 is 3x4, and i have a 3x4 enterprise m.2 drive in it. slot 3 is gen3x2, and i have the same enterprise m.2 3x4 drive as slot 2, so slot 3 is being bottlenecked by only haveing 2 pcie lanes. I'm sure that 3x2 is limiting me in some ways. But i have another bottleneck somewhere that is limiting my iops. I did benchmarks with ceph pool that is slot 1 only, then slot 1 and slot 2, and finally all slots. They were all within margin of error of each other. I assume it's the thunderbolt network bottlenecking me.

the thunderbolt network took a bit to get stable, esp with openfabric routing. ipv6 works much more reliably than ipv4 with openfabric and thunderbolt for some reason. each node has a direct connection to one another, with open fabric if one link goes down, say a cable or a port fails, they can route traffic through an intermediary node. performance tanks, but the node doesn't go offline. My real world performance is limited to about 23gbps per port, but only half duplex, so it's not ideal, but for my needs the performance is fine as most of my vm's are not iop intensive. Raw througput is fine, limited by the thunderbolt network. Based on the data i've found based on the limitations of current intel DMA controllers, 25gbps is around the limit for thunderbolt networking. So my 23gbps is in range.

booting off a usb to nvme adapter was a little fiddly with adapters and cables at first. Random lockups that i determined were due to storage going offline. After some research on m.2 to usb chipsets i ordered a realtek based nvme to usb adapter, i still had issues, so i tried a different usb type-a cable and that resolved it. I haven't had a single lockup, or storage glitch since for the boot drive. I plan to add another one and run raid 1 for boot. Drives arrived from amazon today.

I had the cluster running some test loads for a few days, and it's been in home-lab "production" since Saturday, today is tuesday, so not long. I only had 1 lockup on 1 node in production, but I think that was due to me messing around with gpu drivers to set up vgpu vf's with the igpu. Which i did get working. For plex vgpu's work fine for transcoding but not hardware color tone mapping for hdr to sdr xcoding. So i had to bind card0 to plex rather than the vf's, card1-7. Since i'm running plex in an LXC this was no issue.

My spinning disk storage 12x16tb drives, is still on the r730xd. I'm debating on putting some kind of broadcom/lsi 16e card in one of the nodes, and run storage off that. Or building a dedicated low power nas. That's my next project.

overall i'm happy with the set up, it's not perfect, but for my needs and goals it's working. I may wind up going with 100gbps or 25gbps networking if i find the need, but so far it's working. I'm happy with the performance of the ms-01, if it could handle more than 96gb of ram, honestly i'd only *need* one system. It's amazing how power efficient this little box it. I wish it had 25gbps onboard networking insead of 10gbps. The biggest annoyance with the system is the the rj45 ports, i with they mounted them upside down, with the lip on the case, it's impossible to click the release latch on the 8p8c connectors, i have to stick a small screw driver in there.

Intel AMT is an adequate replacement for idrac. it's not perfect and has it's idiosyncrasies, i'm using mesh commander to manage it.
 
Last edited:
  • Like
Reactions: J-dub and UdoB
Okay I am down to networking for my planned cluster and before I buy the "stuff", I wanted to check in and make sure I'm not forgetting something or under/over purchasing for my needs.

I have 3 Minisforum MS-01 w/ the Intel 13 i9 cpu / 96GB ram / Samsung enterprise 7.6TB U2 ssd (Ceph) / 2TB M.2 (for OS). They have two 2.5gps and two 10gps ports and an empty PCIe 4 8x open slot.

For the Ceph private network I was planning on using a MikroTik CRS504-4XQ-IN (4 port 100gps)
For the Ceph public I was thinking of another MikroTik CRS504-4XQ-IN but I'd need a dual 100gps PCIe x8 Nic card and I'm, not sure that's realistic (heat may be an issue too).
For the Promox Corosync I thought one of the 2.5gps ports to a cheap 4 port switch would be fine.
For the VMs/Containers and backup server CRS309-1G-8S+IN (8 ports at 10gps)

For the dual 100gps Nics, Mellanox MCX516A-CCAT ConnectX-5 EN is a PCIe 3 x16 and I "think" that is "basically" the same as a PCIe 4 x8.

I can save money if 25gps is enough for the 3 OSDs of Ceph I'm panning on. I can get easily find a dual 25gps Nic for the nodes. The same seems to be true w/ 40gps used enterprise gear. I'm not interested in full depth rack gear.
I prefer to keep things physical separated and not VLANs for the Ceph/ Corosync networks.

I "think" I have a good plan. I still haven't figured out the PCIe Nics however. I have no experience with MikroTik.

Have I over/under done anything? Am I not thinking of something? Do you have any advice or opinions to share?

Thank You!
What RAM did you use to get 96GB? And hows it working?
 
My goal of this project was to decrease power utilization, and gain redundancy, from my single r730xd which has dual e5-2690v4 cpu's and 256 gb of ram. My r730xd's power floor, even with power tuning is around 280w (that's with storage).

Each ms-01 node has 96gb of ram, and 3 nvme drives. After power tuning each node uses about 40w of power, with occasional spikes under high load to like 90-100w. But on average the entire cluster uses 120w total (no spinning disk yet, so not an apples to apples comparison). I'd probably get better power usage if i wasn't using so much nvme storage in each node. The enterprise u.2 and m.2 storage is power hungry. When i was first testing with a single consumer m.2 drive my power floor at idle on proxmox was like 18watts per node. I added the extra storage to try and increase IOPS, but it didn't help.

I am sure my storage performance is limited by both my ceph pool network configuration, and the m.2 slot's pcie lanes. slot 1 is gen 4x4, but i have a 3x4 u.2 drive in it. slot 2 is 3x4, and i have a 3x4 enterprise m.2 drive in it. slot 3 is gen3x2, and i have the same enterprise m.2 3x4 drive as slot 2, so slot 3 is being bottlenecked by only haveing 2 pcie lanes. I'm sure that 3x2 is limiting me in some ways. But i have another bottleneck somewhere that is limiting my iops. I did benchmarks with ceph pool that is slot 1 only, then slot 1 and slot 2, and finally all slots. They were all within margin of error of each other. I assume it's the thunderbolt network bottlenecking me.

the thunderbolt network took a bit to get stable, esp with openfabric routing. ipv6 works much more reliably than ipv4 with openfabric and thunderbolt for some reason. each node has a direct connection to one another, with open fabric if one link goes down, say a cable or a port fails, they can route traffic through an intermediary node. performance tanks, but the node doesn't go offline. My real world performance is limited to about 23gbps per port, but only half duplex, so it's not ideal, but for my needs the performance is fine as most of my vm's are not iop intensive. Raw througput is fine, limited by the thunderbolt network. Based on the data i've found based on the limitations of current intel DMA controllers, 25gbps is around the limit for thunderbolt networking. So my 23gbps is in range.

booting off a usb to nvme adapter was a little fiddly with adapters and cables at first. Random lockups that i determined were due to storage going offline. After some research on m.2 to usb chipsets i ordered a realtek based nvme to usb adapter, i still had issues, so i tried a different usb type-a cable and that resolved it. I haven't had a single lockup, or storage glitch since for the boot drive. I plan to add another one and run raid 1 for boot. Drives arrived from amazon today.

I had the cluster running some test loads for a few days, and it's been in home-lab "production" since Saturday, today is tuesday, so not long. I only had 1 lockup on 1 node in production, but I think that was due to me messing around with gpu drivers to set up vgpu vf's with the igpu. Which i did get working. For plex vgpu's work fine for transcoding but not hardware color tone mapping for hdr to sdr xcoding. So i had to bind card0 to plex rather than the vf's, card1-7. Since i'm running plex in an LXC this was no issue.

My spinning disk storage 12x16tb drives, is still on the r730xd. I'm debating on putting some kind of broadcom/lsi 16e card in one of the nodes, and run storage off that. Or building a dedicated low power nas. That's my next project.

overall i'm happy with the set up, it's not perfect, but for my needs and goals it's working. I may wind up going with 100gbps or 25gbps networking if i find the need, but so far it's working. I'm happy with the performance of the ms-01, if it could handle more than 96gb of ram, honestly i'd only *need* one system. It's amazing how power efficient this little box it. I wish it had 25gbps onboard networking insead of 10gbps. The biggest annoyance with the system is the the rj45 ports, i with they mounted them upside down, with the lip on the case, it's impossible to click the release latch on the 8p8c connectors, i have to stick a small screw driver in there.

Intel AMT is an adequate replacement for idrac. it's not perfect and has it's idiosyncrasies, i'm using mesh commander to manage it.

Your experience with Thunderbolt-Net mirrors others, that I've seen on other forums with several other PCs.

I tried the Thunderbolt-Net with mini PCs before I bought these new MS-01. I tried for a month to get it working, but it turned out that the manufacturer never signed up for Intel's something-or-other, so Thunderbolt-Net was not enabled on the USB4 ports of my nodes. It was such a trial by fire, I decided I'd just go with a 100gbps network.
The MS-01 having the PCIe slot is what sold me on it,. I'm going to try direct node-to-node connections first (Connectx5 dual port cards), if that fails then I'll try the switch.

My plans are similar to yours. I seem to want to do them the more expensive way though. Instead of using the USB4 for the Ceph private network, I'm thinking attaching a USB4 enabled, 4 drive, enclosure to one of the nodes. I might just make a NAS with another Connectx5 card on it and slap it on the LAN though... it's just for backups.

There is a A/E slot drive you might be interested in - for your OS drive - if your USB solution doesn't work out. No RAID 1 though...

I didn't want to limit the drive speed with the PCIe 3 x2 slot - so mine is empty.

You'll probably end up mounting a fan on the outside of the box with the HBA controller card. I'm planing to add one for my Connectx5 cards, maybe two (push/pull) on the PCIe side.

The RJ45 port issue is real. Super annoying to remove a cable, luckily I don't plan to do that often, once the nodes are in place.

I've never used the Intel AMT before, but it should be a fun to learn system. I actually really liked Dell's iDRAC, it will be missed.
 
There is a A/E slot drive you might be interested in - for your OS drive - if your USB solution doesn't work out. No RAID 1 though...

I considered that, but i coudln't find any A/E nvme drives in the US. I found one in Europe and i found a bunch of adapters. the usb storage has been working fine though. I'm probably going to put a google coral in the A/E slot eventually, but going to try openvino first with the v-igpu.

mounting a fan on the outside of the box with the HBA controller card.

I haven't tried a HBA yet, but i plan on repasting the CPU with PTM7950. I have some in the fridge, just need to set it up. But i agree the thermals are a little tight as it is, esp with a power hungry HBA.

I didn't want to limit the drive speed with the PCIe 3 x2 slot - so mine is empty.

You could set up a seperate ceph pool by drive class, for the 3x2 slot only. So you could run a "slow" pool if you need some extra storage.
 
I don't have any experience with passive QSFP28 DAC so I can't tell. I'm responsible for some installations with thousands of 100G links and all I can say is, that these MTP/MTO multiplex connections are far more unreliable than single fiber SC/LC connections.

As far as I know, all the L2 functionality is available in the switch chip so the CPU doesn't step in. That means, the latency should not be affected that much. Maybe there's a measurable delay but I doubt this will result in major performance issues.

I'm going with the CRS309 just because it's lying around in the shelf. If this won't work well, I'll go with the 2x CRS518 or maybe the CRS326 setup and put in a quad 10GBe XL710 NIC to the MS-01 nodes for additional interfaces. Mainly because I have some of these lying around, too :p

But, to be honest, the MS-01 cluster will be a lab only for Proxmox to test alternative storage like VitaStor and LinStor and to dive into OpenStack. My 24/7 home server is also Proxmox-powered but built with server components. I wouldn't trust consumer hardware with my data at all.
SCALE has their edge solution based on three Intel NUCs and seems that due to very high electricity costs worldwide, it sells.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!