Production Clusters - any tips before hardware purchase ?

Q-wulf

Renowned Member
Mar 3, 2013
613
39
93
my test location
Are there any Hints, Tips, Nice to know's, or Gotcha's one should know before planning, sizing and purchasing the cluster hardware ?

Things that i am curious about (Proxmox 6.2):

  1. Can you mix and match CPU-vendor and Models inside the same cluster? (e.g. 2x Intel Cascade Lake-SP + 2x Amd Epic 7000 + 1x Threadripper 3000) If i remember correctly, last time i sized one (4+ years ago) this was not recommended.
  2. Are there any issues with specific 1G and 10G Nic-models ? I used to have a couple Hetzner Servers that had massive issues with a specific Intel network card model. Is there a list of known to (not) be working models ?
  3. Can proxmox utilize NVME SSD's (m.2/pcie) with ZFS or Ceph properly ? (last time i tried to install proxmox on ZFS raid1 it wasn't supported), considering the following option(s):
    1. Sata 6 GB/s (or M.2 via 2.5" Sata Adapter) on controller in HBA mode
    2. SAS 12 GB/s based SSDs using a SAS-Raid controller in HBA mode.
    3. M.2 NVME via Add-in Card (e.g. Asus Hyper)
    4. 2.5" + NVME backplane combo from a OEM.
  4. Is there any noticeable speed difference by using Ext4 Raid1 vs ZFS Raid1 for the Host-OS, or does this not matter anymore ?
  5. Any specific Server vendors/models (dell, hp, fujitzu, terra, ...) that are known to be "iffy" when using proxmox on them (driver-support) ?
  6. is there a list of mainboards that work well (or is that not an issue anymore ?) I found this list (not updated since 2016?) https://pve.proxmox.com/wiki/Mainboards. Not sure if doing a self-build or going with a vendor like e.g. dell.
  7. How many CPU-Cycles do you typically need for CEPH when running CEPH and proxmox 6.2 on the same node in order not to bottleneck it ? (a formula would also suffice (x Cpu cycles per OSD))
  8. Not sure yet if i want to use Ceph or ZFS for performance (12 TB total Storage capacity requiered - would be at least 3x12 OSDs on Ceph). HA it not really requiered, but would be nice to have.

Background:
I am creating 2 production clusters at two different locations (3+ Nodes per cluster), seperate networks (openvswitch) for clients (4x 1G or 2x10G), corosync (2x1G) and 2x 10G or 1x 40G for storage/backup traffic. I might eventually use the Proxmox Backup Server for remote syncs (if it turns out to be working stable enough for my needs (bandwith [up] is an issue at both locations))





History:

- I have used Proxmox (clusters) extensively until 5.x with Ceph (standalone) as a storage backend . Currently unsure wether to use ZFS or Ceph as a storage backend (my experience is with 40+ HDD's backed by SSDs for journals per node, not small amount of SSD based OSDs)
- Only have used proxmox 6 and 6.2 on standalone servers (hetzner/OVH) since then for non-production projects.
- i just did a deepdive through the wiki as a refresher.
 
1) Based on our reference documentation, I would still not recommend mixing CPU types.
2,5,6) There is no official list from Proxmox for those topics. Note: Not every part of our Wiki is created by Proxmox VE developers. Some have been written by community members and might be outdated. The Proxmox VE Administration Guide (reference documentation) is the most accurate and up to date source of information for Proxmox VE. Some parts of the Wiki are autogenerated from this administration guide (You can recognize those parts if you really want to: View Source should contain "Do not edit - this is autogenerated content").
3) NVMEs are explicitly listed in the administration guide for Ceph OSDs.
7) Administration Guide:
As a simple rule of thumb, you should assign a CPU core (or thread) to each Ceph service to provide enough resources for stable and durable Ceph performance.
I don't think that a stricter guideline would be useful, as the load for Ceph is different for each setup. Additionally, in a hyper-converged setup there are VMs on top of it, raising the CPU requirements further above some Ceph-only requirement.
4, 8) Ext4, ZFS and Ceph all have completely different targets. For example, Ceph is a shared storage while the others are not. I suggest taking a look at the Wiki page about storages in Proxmox VE.

What might be worth a look is the Proxmox VE Ceph Benchmark.
 
Thanks for this administrator guide. That helped quite a bit, was good read.

regarding storage replication:
I can in theory go and configure a VM from Node 1 to replicate to node 2 and 3 and do this Monday to Saturday from 06:00 to 18:00 every minute by using the following parameter(s)

Code:
mon..sat 6..18/1
This would keep the VM's "concurrent" up to a minute in delay in case of failure ? Meaning only 1 Minute of data being lost (we currently do 60 Minute incremental backups using shadowprotect SPX which gets quite pricey as your VM count goes up).

If that is the case i don't really need CEPH.
I don't need really need the HA aspect.
A Manual input from an admin in case of failure is good enough, as long as replaying a full backup from an NAS (Data loss <=60 Minutes) does not need to pull all the data on the large VM's (20-ish TB per VM) [5 hours on 10 Gbit/s, 2 days on 1 Gbit/s]





regarding ZFS and pre-writing the SSD's with different amounts of data.

I'd probably be using RaidZ2 or RaidZ3.
I did read somewhere (afair ixsystems forum) that it makes sense to fill SSD's with random dummy data (different amounts of dummy data), to avoid them failing at the same time. Is that something one should still do ? or is it not enough to monitor the drives S.M.A.R.T-values for signs of failure?

For HDDs what i have done typically is to order the drives at different times and from different suppliers in order to avoid getting all drives from the same charge (which may be faulty). It probably makes sense to use the same make/model of SSD (if available) for a Server in order not to bottleneck the ZFS pool, right ?

Regarding ZFS Pool Speeds

Is there a rule of thumb for RaidZ2 and RaidZ3 regarding speeds of a pool ?
e.g. in Raid 10 i can estimate that 4 SSD's will generate speed and IO roughly x2. or a 100% increase in IOps
I am probably going with Pools of 12 or 16 NVME SSD's.

Is there a rule of Thumb regarding IOP increase / decrease ? Assumption being that the CPU does not bottle neck)



Regarding ZFS or EXT4 for the Proxmox host system.

Does it make sense to run 2 small SSDs in their own ZFS pool as opposed to 2 small SSDs in Raid-1 using ETX4 ?
Reason i am asking: I used to have a Proxmox server at Hetzner with HDD's (Spinners), if i used ZFS for the pool and the Vm's (on a second pool generated tons of IO, my host system would become unresponsive) If i used EXT4 for the host it would not (this was during proxmox 5.x (AFAIK 5.4))
 
Last edited:
About storage replication: Sounds about right. Minimum interval is 1 minute. However, I didn't understand what the NAS and 60 minutes have to do with it. You have two nodes A & B and after an initial replication only deltas have to be sent from A to B each minute. You might have data loss of some seconds in case of failure. But vzdump backups to somewhere like NAS is sort of independent of this.

About ZFS:
did read somewhere (afair ixsystems forum) that it makes sense to fill SSD's with random dummy data (different amounts of dummy data), to avoid them failing at the same time. Is that something one should still do ? o
I'd mix manufacturers.
It probably makes sense to use the same make/model of SSD (if available) for a Server in order not to bottleneck the ZFS pool, right ?
As long as the performance is similar, why should there be a bottleneck?

Is there a rule of thumb for RaidZ2 and RaidZ3 regarding speeds of a pool ?
RAID-Z should be like one single disk (IOPS) with very high bandwidth [1, 2, 3].

About host filesystem: Normally ZFS as host filesystem should work.

Most important: If you're gonna build two clusters with dozens of NVMEs and such, I'd strongly suggest a test setup to get an idea of ZFS root, storage replication etc.
 
However, I didn't understand what the NAS and 60 minutes have to do with it.

we currently use a backup-software inside the VMs / Servers. to accomplish a similar functionality. because that software does not do delta-syncs it has to write the whole VM-/ServerData. As such we had to set it to 60 Minutes to not saturate the links on the storage (which is a NAS). [The software is about 800 USD per Server VM - budget that i can use for the cluster, since proxmox has this feature now enabled (storage replication) and can do it more efficiently (60 seconds vs 60 Minutes and delta-syncs vs full data backup)

Most important: If you're gonna build two clusters with dozens of NVMEs and such, I'd strongly suggest a test setup to get an idea of ZFS root, storage replication etc.

That was the plan (get the hardware, then do performance tests - with RL-Workloads).
2 clusters because 2 locations. I am currently in the pre-budgeting phase. Hence the "mostly theoretical" questions.
 
I think the code should contain :0:
Code:
mon..sat 6..18:0/1

we currently use a backup-software inside the VMs / Servers. to (...)
Then using Proxmox VE should avoid some costs. The biggest part of your VMs is replicated to the other node(s). So if the host of your guest goes down you only have to move a config file and can restart it on the other node.

, I'd strongly suggest a test setup to get an idea of ZFS root, storage replication etc.
What I meant in addition was that you can install Proxmox VE itself as virtual machine and more guests into it. This way you can (in a small scale) see with your own eyes storage replication capabilities or high availability without paying a single Cent.
 
What I meant in addition was that you can install Proxmox VE itself as virtual machine and more guests into it. This way you can (in a small scale) see with your own eyes storage replication capabilities or high availability without paying a single Cent.

I am aware. It just does not "really" help testing the performance (dis-)advantages of a specific ZFS setup vs another.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!