Hello forum,
[Edits done: Clarifications added to counter misunderstandings in original replies, also changed Proxmox to PVE]
I am considering building a new production (not lab) Proxmox VE (PVE) cluster with HA for both running VMs and the underlying storage. Given the current DRAM shortage, I have to consider how to optimize the design in terms of total DRAM purchase for the entire cluster.
This is going to be a migration of VMs from another system, so the number of VMs and their sizes are not in question or open for redesign. Only the design of the new cluster hardware and how PVE is installed on it.
I am currently considering two options:
A. Two PVE nodes and a third non-proxmox machine acting as tie-breaker to help the cluster choose the machine that will be responsible for production execution if the network links between the two nodes fail. In terms of DRAM, this means that each Proxmox node will need enough physical memory to run all the VMs while the other node is down or isolated, thus the DRAM per node should be (sum of all VMs + clustering overheads such as ZFS overhead), total cost is thus 2 x (sum of all VMs) + 2 x (cluster overheads)
B. Three PVE nodes with no tiebreaker outside the cluster. In terms of DRAM, this means that if one node fails, the two remaining machines could split the VMs between them, thus the DRAM per node should be (sum of half the VMs + clustering overheads such as ZFS overhead), total cost is thus 1.5 x (sum of all VMs) + 3 x (cluster overheads).
Option B, thus theoretically saves 25% of the VM HA memory cost, but adds 50% to the cluster overhead memory cost, and also adds the cost of a third physical machine.
Official PVE "system requirements" were clearly written when DRAM was cheap, suggesting, without stating reasons, that an additional 1GB RAM/TB disk be added to the clustering overheads. Question is how much this can be safely squeezed for cost, perhaps to 0.5GB/TB or 0.25GB/TB corresponding to 2 bytes/disk block or 1 byte/disk block. Fundamental issue is how much of the stated overhead must be in memory, versus how much is just cached data that can be reloaded from disk/regenerated on the fly versus how much is somehow forced to be kept in physical node RAM at all times.
Another question affecting the purchase calculation is if PVE HA requires complete copies of all running VM memory on the node that would take over if the running node crashes, or if PVE uses a mechanism that just keeps the memory snapshots on redundant disks until the moment of failover. Obviously, if PVE reboots VMs after their active physical node fails, then no DRAM is needed on the node that will potentially run the VM after failover. My calculations for scenario B above assume near zero physical DRAM reservation for potential failover of VMs running on other nodes, thus if each of 3 nodes use y/3 GB for VM memory each, each node needs y/2 GB memory for VMs, of which y/6 GB will just idle waiting for the arrival of HA reloaded VMs from other nodes, whereas keeping alive VM memory clones would need 2/3 * y GB, of which y/3 GB is idle VM memory clones (y/6 GB from each of the other nodes).
Clarification also stated in a reply below: "Clustering overheads" means all the physical memory on PVE nodes stemming from running the full HA suite of features, including disk related in memory metadata and VM related metadata. For example, if the HA storage mechanism for virtual disks requires an in-memory data structure of 4 bytes per PVE node per 4 KB virtual disk, then this adds an overhead of 1GB/TB . As another example, if the HA mechanism for VMs needs an in-memory data block for each VM the size of the VRAM of a large gaming GPU, such as 16GB, then that adds 16GB/VM . I obviously hope the numbers are smaller, such as 1MB/TB disk overhead and 2MB/VM machine overhead (including 1MB virtual VRAM).
[Edits done: Clarifications added to counter misunderstandings in original replies, also changed Proxmox to PVE]
I am considering building a new production (not lab) Proxmox VE (PVE) cluster with HA for both running VMs and the underlying storage. Given the current DRAM shortage, I have to consider how to optimize the design in terms of total DRAM purchase for the entire cluster.
This is going to be a migration of VMs from another system, so the number of VMs and their sizes are not in question or open for redesign. Only the design of the new cluster hardware and how PVE is installed on it.
I am currently considering two options:
A. Two PVE nodes and a third non-proxmox machine acting as tie-breaker to help the cluster choose the machine that will be responsible for production execution if the network links between the two nodes fail. In terms of DRAM, this means that each Proxmox node will need enough physical memory to run all the VMs while the other node is down or isolated, thus the DRAM per node should be (sum of all VMs + clustering overheads such as ZFS overhead), total cost is thus 2 x (sum of all VMs) + 2 x (cluster overheads)
B. Three PVE nodes with no tiebreaker outside the cluster. In terms of DRAM, this means that if one node fails, the two remaining machines could split the VMs between them, thus the DRAM per node should be (sum of half the VMs + clustering overheads such as ZFS overhead), total cost is thus 1.5 x (sum of all VMs) + 3 x (cluster overheads).
Option B, thus theoretically saves 25% of the VM HA memory cost, but adds 50% to the cluster overhead memory cost, and also adds the cost of a third physical machine.
Official PVE "system requirements" were clearly written when DRAM was cheap, suggesting, without stating reasons, that an additional 1GB RAM/TB disk be added to the clustering overheads. Question is how much this can be safely squeezed for cost, perhaps to 0.5GB/TB or 0.25GB/TB corresponding to 2 bytes/disk block or 1 byte/disk block. Fundamental issue is how much of the stated overhead must be in memory, versus how much is just cached data that can be reloaded from disk/regenerated on the fly versus how much is somehow forced to be kept in physical node RAM at all times.
Another question affecting the purchase calculation is if PVE HA requires complete copies of all running VM memory on the node that would take over if the running node crashes, or if PVE uses a mechanism that just keeps the memory snapshots on redundant disks until the moment of failover. Obviously, if PVE reboots VMs after their active physical node fails, then no DRAM is needed on the node that will potentially run the VM after failover. My calculations for scenario B above assume near zero physical DRAM reservation for potential failover of VMs running on other nodes, thus if each of 3 nodes use y/3 GB for VM memory each, each node needs y/2 GB memory for VMs, of which y/6 GB will just idle waiting for the arrival of HA reloaded VMs from other nodes, whereas keeping alive VM memory clones would need 2/3 * y GB, of which y/3 GB is idle VM memory clones (y/6 GB from each of the other nodes).
Clarification also stated in a reply below: "Clustering overheads" means all the physical memory on PVE nodes stemming from running the full HA suite of features, including disk related in memory metadata and VM related metadata. For example, if the HA storage mechanism for virtual disks requires an in-memory data structure of 4 bytes per PVE node per 4 KB virtual disk, then this adds an overhead of 1GB/TB . As another example, if the HA mechanism for VMs needs an in-memory data block for each VM the size of the VRAM of a large gaming GPU, such as 16GB, then that adds 16GB/VM . I obviously hope the numbers are smaller, such as 1MB/TB disk overhead and 2MB/VM machine overhead (including 1MB virtual VRAM).
Last edited: