New to PVE, trying to figure out redundant HA storage with Ceph... Or is there some better way?

Jun 2, 2024
15
1
3
I have decades of experience with Hpe Proliant and Dell PowerEdge and each of their hardware RAID controllers on the server, but very little knowledge with Ceph or ZFS etc.

With VMware's recent BS I am looking to switch all clients I have VMware systems installed at to Proxmox. I've created a little 3 note test system and followed some online videos etc to create the Ceph/OSD disks and monitors etc and it all went ok. Was able to create Vm's enabled for HA and did some simulations to live move running Vm's and it did that without issue or the vm needing rebooting. I failed the PVE node and also the OSD disk and the VM did come back up after a few minutes on another PVE node. I'm curious how to handle a single hard disk failure without the whole node having to move the VM to another PVE node. When creating the OSD you had to pick a single 'disk' and you get the warning they don't support hardware RAID disks. So how can we create redundant storage on each PVE node for Vm's to handle simple disk in the pool failure without having to fail it to another PVE node / cause a reboot? I created a ZFS drive using 2 drives as a simple mirror and created Vm's on that and simulated a failed drive and it seem to keep running... But couldn't find a way to create the "HA" aspect for the VM that it could be booted on another PVE node if the whole node failed.

Most clients are single VMware hosts with direct attached storage on hardware RAID controller but a few have SAN's shared storage with multiple hosts and the Vm's are setup in HA so that if a host goes down they do keep running on the other host without even needing reboot. (Hpe MSA2000 array with 12Gbps redundant SAS connections to each Proliant G10 host) Can Proxmox leverage shared SAN storage for HA?

Did a bunch of searching but not really finding what I am looking for. What is the best practice for multi node setup with local attached storage that can be redundant to survive disk failures without having to move everything to another node?
 
Okay, I think there is some misunderstanding, which is okay since Ceph works quite a bit different.

Ceph achieves the redundancy over the cluster. If you have a replicated pool with a (default) size of 3, each data block/object is stored 3 times in the cluster on different hosts.

There is no need for additional local redundancy on the disk level. The Ceph clients (VMs in our case) will communicate to the OSDs in the cluster to read and write their data.

So if a single disk fails, Ceph will show warnings. Check the Ceph MGR documentation for monitoring options. Ceph will then recover the lost data from that disk to other disks/hosts in the cluster and you will have full redundancy again once it is done.
You can then replace the disk, destroy the OSD daemon, add a new disk, create a new OSD on it and Ceph will rebalance the data in the cluster.

When you plan your Ceph cluster, you ideally give Ceph enough resources to recover to. Therefore, if possible, it is a good idea to use more but smaller resources (Nodes, Disks). But you need to balance that with the resource requirements (CPU, memory) for each additional Ceph service.

If you have exactly as many nodes as replicas (size), you need to be careful. This is usually a 3-node cluster. In this case, please use at least 4 OSDs per node. Because of the rule that replicas need to be on different hosts, it can only recover on the same host, should a disk fail. If you use only 2 disks for OSDs and one fails, well, the remaining one will be full very quickly...
 
Hi @padair , welcome to the forum.

To add to @aaron's expert Ceph explanation.

ZFS is not a Cluster Aware Filesystem. You can only have "HA" if you implement asynchronous replication. https://forum.proxmox.com/threads/high-availability-with-local-zfs-storage.122922/

Single ESXi host with local storage translates to Single PVE host with either LVM, or ZFS (no RAID). Obviously, no HA.

Proxmox does work with SAN storage, and you can configure HA. That requires use of Thick LVM.



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!