Esxi migration to proxmox

Sharin

New Member
Jun 1, 2024
4
0
1
Hello everyone,

I'm looking for some advice before starting a major migration project at work.
We are planning to migrate all our ESXi hosts to Proxmox VE.

Hardware​


We will deploy Proxmox on three Dell R740 servers, each equipped with a PERC H740P RAID controller.
The remaining disks (excluding the system disks) are six 3.84 TB SAS SSDs per server.
All controllers can be switched to HBA mode.


The network between the nodes is already in 25 GbE, so Ceph traffic should not be an issue from a bandwidth perspective.


Storage plans​


i have considering three approaches:


  1. Ceph on all three nodes
  2. ZFS on each server (local storage)
  3. Or a mix (RAID-controlled LVM for boot + Ceph or ZFS for the VM pool)

Workload​


Our VM workloads include:


  • Several Remote Desktop / RDS servers
  • Some servers running SQL databases
  • [Add other workloads if needed]

My main concern is which storage backend would be the most appropriate for this type of mixed workload.
I know that Ceph is often recommended for HA clusters, but SQL workloads can sometimes suffer from Ceph latency compared to local ZFS.


Questions​


  1. Is Ceph a good fit for a three-node cluster with this workload?
  2. Would local ZFS provide significantly better performance for SQL or RDS servers?
  3. Does a hybrid approach make sense (Ceph for most VMs, ZFS for high-I/O workloads)?

Any guidance or feedback based on similar deployments would be greatly appreciated.


Thanks in advance.
 
Regarding those Dell boxes - you're going to want to run a search in here this forum for every piece of hardware. There are issues abound and no official hardware compatibility lists like VMWare has.
Off the top of my head - BOSS-S1 with Intel drives don't work on the latest kernel; Mellanox NICs don't support 'bridge-vlan-aware' or >512 VLANs; Intel NICs have driver hanging issues.

Not related to Dell, but if you happen to be running vSAN, then PVE ESXi migration does not support it - you'll have to export VMs with ovftool and that is very slow going. Nothing to be done about it, ovftool just sucks.
 
While it's true 3-nodes is the minimum for a Ceph cluster, you can only lose 1 node before losing quorum. You'll really want 5-nodes. Ceph is a scale-out solution. More nodes/OSDs = more IOPS. Can lose 2 nodes and still have quorum.

While converting the PERC to HBA-mode does work, I've had issues with the megaraid_sas driver. I swapped out the PERCs for a Dell HBA330 which use the much simpler mpt3sas driver. No issues.

I use two small drives for mirroring Proxmox using ZFS RAID-1. Rest of drives are either for Ceph or ZFS.

If you going to do 3-node Ceph cluster (test it versus ZFS RAID-10) and you know you won't ever expand it, I suggest a full-mesh broadcast network per https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_Setup

This eliminates a switch. I use a 3-node setup for staging environments and all Ceph & Corosync traffic runs on this network. To make sure this network traffic never gets routed, I use the IPv4 link-local address of 169.254.1.0/24 and set the datacenter migration to use this network and use the insecure option.
 
  1. Is Ceph a good fit for a three-node cluster with this workload?
I've tried to use Ceph in a small Homelab, with hardware three classes below yours (Mini PC and slow network). I've learned that there are some pitfalls and I would work hard to start with five instead of the absolute minimum if three nodes: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

  1. Would local ZFS provide significantly better performance for SQL or RDS servers?
Removing the complete network stack should result in higher performance, right? For Ceph each and every byte to be written needs to be transferred to at least one other node before "done" can be signaled to the application.

Disclaimer: not using Ceph currently and I have no experience with (larger) SQL servers.
 
Well like often the answer is: It depends :)

Or to be more precise what workload are you planning to run, which downtimes/riscs you can live with and what tradeoffs you are willing to make.

First: ZFS and Ceph both want to mange the discs directly, so if you happen to have a HW-RAID controller you need to reconfigure it to "HBA-mode/IT-Mode" or something like that.
Second: Local ZFS will propably be faster than Ceph but even with Storage replication (https://pve.proxmox.com/wiki/Storage_Replication ) you will have a minimal dataloss due to it's asyncronous nature. Basically with Storage replication the data on the VM is transferred to any other configured node. By default the schedule is every 15 minutes, this can be reduced to one minute. So you need to ask yourself, whether this is a loss you can live with or not.
Third: Ceph is syncron but as explained by others with three nodes you won't have auto-healing and you can only loose one node. So the Cluster will continue to run if one node fails or is down due to maintenance (e.G. system updates + reboot) but not both on the same time. Depending on your needs this might still be enough even if you don't want to invest in a fourth or fifth node. Udos writeup is great but more aimed at homelabbers with very basic hardware. If you implement Ceph according to the recommendations ( https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster ) you should have good performance and no issues with the given constraints of a three-node-cluster (aka not allowing failure/downtime of two nodes at the same time).
Fourth: ZFS will propably perform worse than ext4 or XFS but it also has way more features (see: https://forum.proxmox.com/threads/f...y-a-few-disks-should-i-use-zfs-at-all.160037/ ). Nontheless if you really need it you could benchmark and compare your workloads to ext4 or xfs. This might especially interesting for mariadb/mysql/postgres databases since XFS has a reputation for having better performance for large databases. However since with this you won't have the replication of ZFS or Ceph you would need to implement it on the application level, e.g. the internal clustering/replication mechanisms of SQL databases. Please note, that Veeam at the moment doesn't support replication/cluster for mysql/mariadb and postgres (not sure about MS SQL).
Five: For both cases (ceph+zfs) the Proxmox Website has success storys which might be worth a read: https://www.proxmox.com/en/about/about-us/stories and of course this forum. Several of the professionals here reported on smb customers who are happy with very small (two-node + qdevice clusters for zfs replication, three nodes for ceph) setups since they could live with the limitations but couldn't afford more hardware. Here for example is a recent thread on MS SQL Server and Ceph (but with a five-node cluster with four storage nodes): https://forum.proxmox.com/threads/anyone-using-ceph-mssql-server.104073/#post-823988
And here a quote from the German forum from @Falk R. ( he is a freelancing consultant who supports businesses in their migration to ProxmoxVE), translated with deepl:

"Wenn du heutzutage neue Server kaufst, würde ich für Ceph immer 100G einplanen. Ich baue seit Jahren keine neuen Cluster mit unter 100G für Ceph auf.
Auch wenn viele meinen, man braucht mehr als 3 Nodes, dann ist man sofort bei 5 und das sprengt oft den Rahmen kleinerer Firmen. 3 Nodes reichen locker, wenn man damit leben kann, dass während der 5 Minuten Reboot eines Nodes, kein anderer sterben darf."
"If you're buying new servers these days, I would always plan for 100G for Ceph. I haven't built any new clusters with less than 100G for Ceph in years.
Even though many people think you need more than 3 nodes, you immediately end up with 5, which often exceeds the budget of smaller companies. 3 nodes are easily sufficient if you can live with the fact that no other node is allowed to die during the 5-minute reboot of one node.
Translated with DeepL.com (free version)"


Falk explained several times his reasoning for 100G for new clusters: In his experience normally the network and slow (non-nvme) storage are the bottlenecks. So if you need to buy new hardware anyhow it makes sense to plan with some headspace for further growth and more performance. This means: U2-NVME-SSDs with power-loss-protection and a dedicated 100GB network for the Ceph storage and two further (slower) network links for corosync cluster communication and the regular guest traffic . This fit's the officithal recommendations for a healthy cluster who recommend to have at least three network links:
  • one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster traffic.
  • one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the ceph server and ceph client storage traffic. Depending on your needs this can also be used to host the virtual guest traffic and the VM live-migration traffic.
  • one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync cluster communication.
    https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster

Please note, that Falk replaced the 25+ Gbps network with 100G. Depending on your workload needs you might also want to switch out the 10 with a faster network. For corsosync 1Gbps is usually enough though.

I would recommend to get a ProxmoxVE partner on board to plan the architecture for your migration according to your business needs: - https://proxmox.com/en/partners/find-partner/explore
It might also be worth to book a training so you don't need to figure everything out by reading the documentation.
 
Last edited:
  • Like
Reactions: Onslow and UdoB