Scaling beyond single server. Suggestion wanted.

dsh

Well-Known Member
Jul 6, 2016
45
3
48
34
Hello, I have been running 1U Xeon e5 2620 v4 CPU with 4 Sata SSD(Adata SSD 1tb SSD, very slow for ZFS:( ) configured on ZFS mirrored stripe. It run well for me but it doesn't have enough IO for my VM needs.
So I recently we purchased Amd Epyic 7351p with 8 NVME ssd (Intel P4510 1TB) to solve IO issues. I could run it as single server with ZFS but I'd like to minimize downtime in case of failure.

So far I see those 3 options below. I am not experienced enough make decision. So asking your opinion.
1. HA cluster + Ceph. 2 amd + 1 intel(old)
It seems I need 3 server for HA.
If I purchase another Epyc server with NVME drive, cluster would consist of 2 Epyc server with 8 nvme driver each and 1 Intel server with 4 Sata SSD.

If I configure them as cluster + ceph storage, will Intel server with 4 SATA bottleneck overall cluster?

2. Epyc 8 nvme + Intel 4 ssd + QDevice

What about this case? All VM run on Epyc server and if Epyc server fails ideally live migrate to Intel server.

3. This option is not truely HA but I am fine with up to 10min of down on failure.
Epyc 8 nvme + Intel 4 ssd ZFS replication like this https://opensourceforu.com/2019/05/setting-up-proxmox-ve-on-two-servers-for-storage-replication/

Can I set zfs replication to every 5 min or less. Then when Epyc failed, I can restart VM on Intel server. I am fine with if VM restarted with losing up to last 5 min of data.


Can someone experienced give me insight on these options? Or what would you recommend?
 
Ceph node with sata ssd will bottleneck/kill nvme performance. For ceph you need atleast 10Gbps network and nvme can saturate it easily.
If you want realtime replication and realtime HA, you can try drbd9 - both nvme nodes as drbd storage and intel node as pve/drbd controller (require linbit plugin/modules).

If you can live with zfs replica (can be even 1 minute timer), you still need 3 nodes for automatic HA (3rd node can be everything).

With manual "HA" you almost never get to 10min downtime on failure in case of the 2 nodes.
 
Ceph node with sata ssd will bottleneck/kill nvme performance. For ceph you need atleast 10Gbps network and nvme can saturate it easily.
If you want realtime replication and realtime HA, you can try drbd9 - both nvme nodes as drbd storage and intel node as pve/drbd controller (require linbit plugin/modules).

If you can live with zfs replica (can be even 1 minute timer), you still need 3 nodes for automatic HA (3rd node can be everything).

With manual "HA" you almost never get to 10min downtime on failure in case of the 2 nodes.
Thank you.

So I am not going to mix nvme + sata SSD in CEPH. That rules out option 1.

I have not read about drbd9. Thanks for suggesting it seems very good option for realtime HA. I will read further about it.

What do you mean by "never get to 10 min downtime". Do you mean I will almost always get more than 10 min with ZFS replication?
I have setup monitoring on my server and it sends me e-mail. Assuming I am on computer, I receive e-mail, login to proxmox node, restart VMs on backup server. If I time from moment I received e-mail I can restart in under 10 min. Excuse my ignorance, am I missing something here?
 
You can mix sata/nvme on nodes. But for performance it needs to be defined as different storage types in crush map (don't know, if its automatically). And it will require both types of the ssds used on all nodes in such small cluster.

10min window works only if you are able to react in such timeframe asap. And nobody can do it (sleep, shopping, sport etc).
 
  • Like
Reactions: dsh
Hi dsh

I agree with [B]czechsys[/B] using Linbit drbd is a good option and can work with 2 or more hosts of various types so dont all need to be the same.

We are currently speaking with Linbit our selves and should be testing DRBD and Linstore module by the end of this year.

What’s your network capacity? 10, 25 GB etc?

Contact Linbit as they have a module for ProxMox.

From my understanding there are very few hardware Raid options for NVME drives and most providers use either software raid for drive redundancy but this has its own set of issues.

You would be best to put the OS on a smaller mirrored SSD pair and have larger NVME single drives replicating to another host, this will be like network raid, with Linstore you can replicate 0, 1, 2 + etc to as many hosts (node drives) as you need to and HA can be handled automatically by Linstore for automated fail over to a new host incase of disaster as the module uses ProxMox clustering.

The replication will only be as fast as your network and what ever network speed you decide to use you’ll be better off with dual ports/ cards 2 x 10, 25 etc incase a card or port fails.

Dual switches incase a switch fails.

Don’t use SATA SSD they are basically at this point in time only 6 GBPs bandwidth, SAS are currently 12 GBPs, there is a new SCSI standard coming out which will be 24 GBPs but thats still a little while away as far as i understand and all of these things even if they are released tomorrow need to have Linux drivers developed which will also take time before they are production ready.

Personally im not a Ceph fan but thats just my opinion, the team here at ProxMox do an awesome job at working and implementing Ceph and if you are paying for ProxMox support can help with any issue that may be encountered so if your looking for max uptime with Ceph and dont already have support from ProxMox i would strongly consider it.

It’s also good to support the hard work the team put into making ProxMox its such a great platform to use.

Any hoo feel free to ping me any questions if i can help with any additional info.

“”Cheers
G
 
  • Like
Reactions: XMarcR and dsh
Thank you.

After reading some documentation, I have finally configured DRBD9 on ZFS with Proxmox 6 on two nodes in lab.

Now my only concern is it's not officially supported by Proxmox and possibly very few user-base compared to Ceph.
 
  • Like
Reactions: velocity08
Hi dsh

My 2 cents.

Ceph = object storage
DRBD = 1:1 or 1:N replication

DRBD Simple vs CEPH complicated.

If Ceph breaks it could mean no data.

If DRBD breaks it just stops replicating you can still use your data on 1 host while troubleshooting issues.

CEPH support from ProxMox directly

DRBD - support from Linbit.

Are you using the Linbit ProxMox module?
It’s available for download from Linbit directly.

I think there is another module called Linstore this manages DRBD and runs as either a service or VM cant remember.

Before going into production do a lot of testing.
Think up many scenarios on how you can break the setup, then test and document a fix for each one.
If you are going to use this in production consider getting a support subscription from Linbit for quick support.

Im interested in hearing about your experiance so please post back some info once you start testing as we are planning the same with a little bit larger cluster.

“”Cheers
G
 
What about this case? All VM run on Epyc server and if Epyc server fails ideally live migrate to Intel server.

1. If a node ails it means you lost connection to it for whatever reason. This makes LIVE migration of any kind impossible because live migration needs both source and target servers running and communicating. HA is handled differently, on Proxmox you will have VMs respawning on live nodes after a 1 min timeout (Proxmox does not support hot standby HA).
2. Live migrating from AMD to Intel or viceversa means that you will have to make compromises on virtual CPU capabilities. It works just you have to use the lowest common denominator cpu flag mask. Because HA means starting the machine again, live migration is not needed to be supported for it to work.
 
  • Like
Reactions: dsh and velocity08
Thank you.

After reading some documentation, I have finally configured DRBD9 on ZFS with Proxmox 6 on two nodes in lab.

Now my only concern is it's not officially supported by Proxmox and possibly very few user-base compared to Ceph.

Forgot to mention zfs will slow things down and chew up more memory for little benefit.

Personally the less layers to deal with the better.

How’s the testing going ?

“”Cheers
G
 
Forgot to mention zfs will slow things down and chew up more memory for little benefit.

Personally the less layers to deal with the better.

How’s the testing going ?

“”Cheers
G
Hi Velocity08,

Main VM in our setup is ERP database. I have an irrational fear about bit rot. Running it on ZFS gives me some confidence. I've never trusted Hardware/Software RAID.

I have not made any progress yet on lab system as it has limited memory and had some problem due to lack of available memory. /thanks to ZFS :)/. I should receive additional RAM by this Friday.
 
  • Like
Reactions: velocity08
Hi dsh

we've been running SSD in an enterprise production environment under above average loads and the drives have really been great so far.
We
don't use SATA only SAS eMLC and MLC drives in hardware equivalent to Raid 6.

With the DRBD replication between hosts this is another mirror of the data.

May i ask what sort of drives are you using ?
What ZFS policy are you looking at using?

""Cheers
G
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!