Proxmox Nested Cluster - HA - ZFS - Dell PowerEdge

LeFred42 · Nov 25, 2024

Hi all,
First of all, this is my first message so I hope I post it on the right place as I'm not yet familiar with the good practices of this forum. Feel free to correct me if I'm wrong.

Here's my issue :
Since a couple of months I dived into proxmox technology as I'd like to build a personal home lab; Here's my setup :

Dell poweredge R 620 with 336 Gig of RAM (yes, this is awesome !!); 6X1To HDD configured in RAID 6 (that is 3.3 To available)

On this guy I installed a master PVE (named PVEM) in which I installed 3 nodes (ie 3 vms, PVE1,PVE2,PVE3) to learn and test things such as replications and clustering.

On PVEM I also have a couple of machines which I use for my own needs.

Although the Vms in the nested cluster are not too fast, performance is enough for me to test and learn.

Now as I progress in my understanding and various experiments with proxmox, I'm getting into issues regarding HA and replication : Live migration in the nested cluster is working like a charm, without loosing access to the VMs running during the process of migration BUT when I test to simulate a crash of one of the nodes in the cluster, although the migration succeed in the end, I loose access to the VM as it is shut down during it.

If I understood correctly, I should have replication activated between the nodes in order to have the disks available everywhere and to do that it seems that I need a storage pool like ZFS or CEPH (To be honest I don't really understand what those things are). The thing I understood is that to create a ZFS pool I should "see" physical disks of my dell poweredge but as those are in a RAID 6 virtual disk this is not possible.

So, according to you guys, how should I proceed to be able to test one of those storage technologies without loosing my main configuration ? I mean if there's a way to break the RAID 6 to dedicate two HDDs to ZFS pool and still having my Proxmox server configured ?

Thank you in advance

Fred

Johannes S · Nov 25, 2024

LeFred42 said:
Hi all,
First of all, this is my first message so I hope I post it on the right place as I'm not yet familiar with the good practices of this forum. Feel free to correct me if I'm wrong.

Here's my issue :
Since a couple of months I dived into proxmox technology as I'd like to build a personal home lab; Here's my setup :

Dell poweredge R 620 with 336 Gig of RAM (yes, this is awesome !!); 6X1To HDD configured in RAID 6 (that is 3.3 To available)

On this guy I installed a master PVE (named PVEM) in which I installed 3 nodes (ie 3 vms, PVE1,PVE2,PVE3) to learn and test things such as replications and clustering.

On PVEM I also have a couple of machines which I use for my own needs.

Although the Vms in the nested cluster are not too fast, performance is enough for me to test and learn.

Now as I progress in my understanding and various experiments with proxmox, I'm getting into issues regarding HA and replication : Live migration in the nested cluster is working like a charm, without loosing access to the VMs running during the process of migration BUT when I test to simulate a crash of one of the nodes in the cluster, although the migration succeed in the end, I loose access to the VM as it is shut down during it.

If I understood correctly, I should have replication activated between the nodes in order to have the disks available everywhere and to do that it seems that I need a storage pool like ZFS or CEPH (To be honest I don't really understand what those things are). The thing I understood is that to create a ZFS pool I should "see" physical disks of my dell poweredge but as those are in a RAID 6 virtual disk this is not possible.

The whole migration and HA feature needs a shared storage (means a storage which can be used by all nodes). Imagine a case where you have a VM on node 1. Node 1 fails, but the VM is configured for HA so is relaunched on node 2. How would node2 have the data if it doesn't have access to the same storage as node1? Ceph is one way to setup a shared network storage which needs at least three nodes and at least one dedicated storage disk per node, it's recommended to use at least four per node in a production setup:
https://pve.proxmox.com/wiki/Storage:_CephFS
https://pve.proxmox.com/wiki/Storage:_RBD
https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster
If you want to play with it in your environment: Set up four virtual disks per node vm and setup Ceph with it.

In smaller production environments it might be better to use one of the other shared storage options e.G. mounting a network share from a NAS via NFS or CIFS: https://pve.proxmox.com/wiki/Storage

Now ZFS is NOT a shared storage but it has a feature which allows to replicate the contents of one ZFS formated disk/partition on one node to a ZFS formated disk/partition on a remote node. Proxmox allows to setup replication schedules (default 15 minutes, can be reduced to one minute or extended to several hours) so in case of a failed node you will have a (depending on configuration small or bigger) loss of data. Depending on your usecase this doesn't need to be a problem. For example I don't really care if my DNS cache proxy isn't uptodate, it's data is temporary anyway and nothing important. For a productive database server I would feel different.
One note: Replication needs a ZFS pool on every node (in your case PVE1, PVE2 and PVE3) which will be used for it:
https://pve.proxmox.com/wiki/Storage_Replication

The name of the ZFS pool needs to be the same on every node, otherwise it won't work.

Personally in my homelab I have two mini-PCs with storage replication (since Ceph would be highly overkill and my network isn't fast enough to use something like NFS ) and a third mini-PCs as a combined NAS/Proxmox Backup Server (which also run as a qdevice).
What's this qdevice business: A HA cluster need three nodes to ensure that in case of a failed node the remaining can still have a majority which node will take over. The catch is that the third node doesn't need to be a ProxmoxVE node but can any Linux install with the qdevice software (even a docker container on a NAS with docker support or a raspberry pi would do):
https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

LeFred42 said:
So, according to you guys, how should I proceed to be able to test one of those storage technologies without loosing my main configuration ? I mean if there's a way to break the RAID 6 to dedicate two HDDs to ZFS pool and still having my Proxmox server configured ?

HDD isn't a good storage plattform for VMs in any case due to their worse performance than SSDs.
However for your question this is not relevant: The great thing with virtualization is, that you can configure virtual disks and add them to your VMs. So you actually don't need to change anything with the HDDs at all, just configure a couple of virtual disks and play with them, be it for ZFS or something else like Ceph. So you don't need to change anything just for fooling around with your nested cluster as long as you can live with the subpar performance of a nested setup.
If you havn't already done so: Setup a local (ideal would be something like a dedicated mini-pc but with a limited budget a VM on your PVEM would do) and offsite (e.g. on a cheap vserver or a tuxis cloud PBS https://www.tuxis.nl/proxmox-backup-server/ ) Proxmox Backup Server. Backup your VMs on a regular schedule on your local PBS and use the remote PBS to have an offsite backup in case your PVEM gets broken. If you remember to make a backup before you change anything on your nested PVE node vms or setup the schedule high enough (e.G. a backup every hour) you can always go back if something went wrong. Thanks to PBS deduplication the needed space won't be a big problem.

A general advice: The proxmox manual and wiki are great resources in learning how everything works and can be configured as well as this forum with it's search function. I'm not a big fan of youtube videos ( I prefer the written form) but they might be of big help too. Maybe somebody else have something to recommend? I woudn't recommend any helper scripts if you plan to just run them without understanding what you want to do since people tend to run into problems at some point if they don't bother to understand their consequences. But reading and changing them to your needs is a great learning oppurturnity too.

LeFred42 · Nov 25, 2024

Waouw !!

Thank you very much Johannes for that very detailed answer !!!

I'll take the time to read and dive into all that but I guess I'll need weeks to master all that....

I'll let you know how things go when I have the time to dig into it all

Thanks a lot

LeFred42 · Nov 26, 2024

Hi,
Ok today I tested the most simple thing to test for me right now before going any further in what Johannes explained :

- I mounted a NFS shared storage (Synology NAS) into the cluster (every nodes has the right privilèges)
- I moved the disk of the test vm to that shared storage
- I manually migrated this VM from one node to another without loosing access (no ping lost)
- I put that very VM in HA configuration of the cluster
- I tried to simulate a crash of the node on which it was (I disconnected the nics)
- The migration indeed succeeded in the end but with about 5 minutes with no access to that vm (It's only a 6 go disk)

Am I missing something or is it possible that my LAN speed (1gig) is the point of failure ?

Thank you

Azunai333 · Nov 26, 2024

LeFred42 said:
- I tried to simulate a crash of the node on which it was (I disconnected the nics)
- The migration indeed succeeded in the end but with about 5 minutes with no access to that vm (It's only a 6 go disk)

If you pull the plug there is no way to live migrate the current session to the other host, because there is no connection to the other nodes anymore.
So it needs to boot a new session on another node. But at the same time you don't want two separate nodes to run the same VM.

So after pulling the plug these steps (see [1]) happen:
- up to 1 minute: Corosync detects that the node is offline. Fencing beginns, isolated node restarts itself with the watchdog.
- 1 minute: Other nodes just wait for 1 minute for the fencing to be complete.
- Remaining nodes with quorum try to migrate and start the HA VMs.
- Now you'll have to wait for the VMs to be booted.

[1] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_fencing

LeFred42 · Nov 26, 2024

This forum is definitely incredible !!! Thanks to you guys for your explanations and for your help

Azunai333 said:
If you pull the plug there is no way to live migrate the current session to the other host, because there is no connection to the other nodes anymore.
So it needs to boot a new session on another node. But at the same time you don't want two separate nodes to run the same VM.

So after pulling the plug these steps (see [1]) happen:
- up to 1 minute: Corosync detects that the node is offline. Fencing beginns, isolated node restarts itself with the watchdog.
- 1 minute: Other nodes just wait for 1 minute for the fencing to be complete.
- Remaining nodes with quorum try to migrate and start the HA VMs.
- Now you'll have to wait for the VMs to be booted.

[1] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_fencing

I understand, so this is why we need a process of frequent replications between the nodes , right ? (which could be obtained through technologies such as CEPh or ZFS if I'm correct, which I'll need too study soon)

Johannes S · Nov 26, 2024

LeFred42 said:
I understand, so this is why we need a process of frequent replications between the nodes , right ? (which could be obtained through technologies such as CEPh or ZFS if I'm correct, which I'll need too study soon)

Yes or a storage where the VM data is independent from the node (e.G. a NAS or SAS System). Ceph is a little bit of both: It's like a software RAID, but as a distributed system over a network (please take this as gross simplification just to get an idea) . The basic idea is that your data is saved multiple times on several nodes and all nodes acess the data over a common access point (thus from the view of Proxmox it's like using a local storage but the data is always in sync like on a NAS). If I understand everything correctly ( I don't use Ceph myself) the default means, that of each object two replicas (copies) are stored inside the system. The consequence is first, that you can't use the combined capacity of the node disks but also that in case one node fails the remaining ones still have two copies of the data to work with and can take over. Now with more nodes and more disks per node you will get more available capacity and can survive more failing nodes, you can play with it with one of the ceph online calculators to get an idea:
https://florian.ca/ceph-calculator/
Another consequence of this scale-out-architecture is that with more nodes everythings get faster (as long as the network connection is sufficient enough to actually leverage it). In the German forum here @Falk R. mentioned the reference implementation of a Ceph cluster at CERN as example: They have around 1000 nodes with 5TByte/s in times of peak usage.

However: To leverage this power you need the right hardware, meaning fast network (at least 10 GBit/s, more is better and 25 Gbit/s seems to be the recommendation of most professionals here for new setups, I even read some people who setup new clusters with 100Gbit for the Ceph Network), at least three nodes (better more to have a better failure safety) and four disks (recommended: enterprise ssd with powerloss protection) per node. Obviouvsly this comes with a price. Some coworkers of the department which runs the VMWare Cluster at my workplace did did a case study to replace their vmware cluster with Proxmox VE some years ago. In the end they deciced against it due to missing features and the needed investment costs for Ceph (with Ceph we would have the missing features since they were a limitation of the storage backends we would have needed to use for our SAS). However this was before Broadcom bought Vmware and Proxmox progressed a lot since them. So it might be that in some years with the next renewal of Vmware the decision might be different. If the costs for the Vmware renewal are higher than buying new hardware and switching to ProxmoxVE/Ceph it's a new situation. It's not my department though so I don't have any stakes in it

LeFred42 · Nov 27, 2024

Johannes S said:
Yes or a storage where the VM data is independent from the node (e.G. a NAS or SAS System). Ceph is a little bit of both: It's like a software RAID, but as a distributed system over a network (please take this as gross simplification just to get an idea) . The basic idea is that your data is saved multiple times on several nodes and all nodes acess the data over a common access point (thus from the view of Proxmox it's like using a local storage but the data is always in sync like on a NAS)......

Ok, I get the basics but I should really get deeper into it...all of your advices will keep me busy for the winter to come (I'm in europe).

My goal is to learn the "how tos" as much as I can on my "virtual" cluster and afterwards use my physical "only" node to build my testing machines.
Thanks

Azunai333 · Nov 27, 2024

Johannes S said:
If I understand everything correctly ( I don't use Ceph myself) the default means, that of each object two replicas (copies) are stored inside the system.

Little correction: The default is 3/2 (3 replicas/2 minimum). This means 3 copies of data, 2 have to be online to read/write it.
And to add: Default failure domain is host, meaning there won't be 2 replicas on the same host.

Search

Search

Proxmox Nested Cluster - HA - ZFS - Dell PowerEdge

LeFred42

New Member

Johannes S

Renowned Member

LeFred42

New Member

LeFred42

New Member

Azunai333

Active Member

LeFred42

New Member

Johannes S

Renowned Member

LeFred42

New Member

Azunai333

Active Member

We value your privacy