Is a GlusterFS Hyperconverged Proxmox with Hardware Raid doable?

Zubin Singh Parihar

Well-Known Member
Nov 16, 2017
56
5
48
42
Hey Folks,

I'm trying to get clear on something here with a possible setup...

Can we create a Proxmox Hyper-converged GlusterFS system on a system with hardware raid?

I'm thinking of the following scenario:

3 x Dell R720
PERC HARDWARE RAID
2x256 GB SSD -- Proxmox OS (Raid1)
6X1TB SSD -- GlusterFS Brick (each XFS format)
2x10GB NIC -- Gluster Traffic and VM Traffic
4X1GB NIC -- Management Network

3 replica policy

Each 1TB is formatted with XFS, because it doesn't do the raw disk, correct?


Would this work?
Seems like a simpler system than CEPH Hyper-converged...

What are the pros and cons?

Let me know your thoughts...
 
Last edited:
GlusterFS is a perfect option for e.g. to set up a HAProxy redundantly and share the configs and SSL certificates. That's it, in my opinion it's not good for much else.

In my opinion, your hardware is very well suited for CEPH. You already meet the basic requirements. You would just have to flash the controller into a proper HBA.
But you definitely can't compare GlusterFS and CEPH with each other. Both work fundamentally differently.

But as I said, GlusterFS is great for sharing small files like configs between two LBs, but a definite no-go from me for VM images. I would always strive to build a CEPH. Even with three nodes, the performance can be sufficient. Of course, you can't expect a throughput of 100k IOPS, but a solid basic performance that is usually sufficient.
 
  • Like
Reactions: Zubin Singh Parihar
I run a 3-node full-mesh broadcast (https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_Setup) Ceph cluster without issues. I flashed the PERCs to IT-mode. Corosync, Ceph public & private, Migration network traffic all go through the full-mesh broadcast network.

IOPS for writes are in the hundreds. IOPS for reads are 2x-3x write IOPS.

I use the following optimizations learned via trial-and-error:

Set write cache enable (WCE) to 1 on SAS drives (sdparm -s WCE=1 -S /dev/sd[x])
Set VM cache to none
Set VM to use VirtIO-single SCSI controller and enable IO thread and discard option
Set VM CPU type to 'host'
Set VM CPU NUMA if server has 2 or more physical CPU sockets
Set VM VirtIO Multiqueue to number of cores/vCPUs
Set VM to have qemu-guest-agent software installed
Set Linux VMs IO scheduler to none/noop
Set RBD pool to use the 'krbd' option if using Ceph
 
Last edited:
  • Like
Reactions: Zubin Singh Parihar
Hey @jdancer

Thanks for this info!

A company I work for has a 3-Node CEPH Cluster already setup.

The configuration for each of those Servers:
  • 2 x Intel CPU Gold
  • 512GB RAM
  • IT Mode RAID Controller
  • 2 x 256GB SSD Samsung EVO 870 (Proxmox
  • 6 x 1TB SSD Samsung EVO 870
  • 2 x 10GB NICs (CEPH Public and Sync)
  • 2 x 1GB NICs (Management network)
Networking:
  • 24 Port - Dell 10 GB Switch
  • 48 Port - Dell 1GB Switch

I have some questions about your optimization experience:
  • Set write cache enable (WCE) to 1 on SAS drives (sdparm -s WCE=1 -S /dev/sd[x])
    • Each of the OSD's are a 1TB Samsung EVO 870... Do we run this command on each of these devices?
  • Set VM cache to none
    • We do this already
  • Set VM to use VirtIO-single SCSI controller
    • We do this already
  • enable IO thread
    • We do this already
  • discard option
    • We do this already
  • Set VM CPU type to 'host'
    • Why do you do this? All my CPU's are the same (Intel CPU Golds), so it would work... but why not keep it to the default 'x86-64-v2-AES'? Is it for performance?
  • Set VM CPU NUMA if server has 2 or more physical CPU sockets
    • We have Dual Socket, but why is it better? do you get more performance?
  • Set VM VirtIO Multiqueue to number of cores/vCPUs
    • We do this already
  • Set VM to have qemu-guest-agent software installed
    • We do this already
  • Set Linux VMs IO scheduler to none/noop
    • Where do you do this? Inside the VM? And if so, how?
  • Set RBD pool to use the 'krbd' option if using Ceph
    • Where do you do this and Why?

Thanks you so much in advance!
 
Set write cache enable (WCE) to 1 on SAS drives (sdparm -s WCE=1 -S /dev/sd[x])
  • Each of the OSD's are a 1TB Samsung EVO 870... Do we run this command on each of these devices?
It should usually be activated anyway. But you can check with hdparm -W /dev/sda.

By the way, SSDs are generally good for CEPH, but the Evo will certainly cost you a lot of performance. Here it is best to switch to enterprise SSDs such as the Samsung PM883.

Set VM cache to none
  • We do this already
The question here is why are you using this? If you want to ensure that no data is lost in the event of a power outage, then both none and writeback would not be for you. More about it here: https://pve.proxmox.com/wiki/Performance_Tweaks#Disk_Cache

I've actually always used writeback and have never had performance problems or data loss.

Set VM CPU type to 'host'
  • Why do you do this? All my CPU's are the same (Intel CPU Golds), so it would work... but why not keep it to the default 'x86-64-v2-AES'? Is it for performance?
You don't have to set host, but the x86-64-v2-AES will cost you performance.
If you only have the same CPU models in the cluster anyway, you can set host. If you have a mixed cluster, you should find and adjust the lowest common denominator. Live migration between old and new nodes is also possible. With Host CPU you usually cannot migrate between different CPU models.
You can only benefit from the performance of the CPU with the right CPU model. Otherwise, functions may be emulated that the CPU could do natively and faster. An example of this is AES.

Set VM CPU NUMA if server has 2 or more physical CPU sockets
  • We have Dual Socket, but why is it better? do you get more performance?
Here you have to differentiate. The node does NUMA anyway, but now the VM could also do NUMA.
You only get a real benefit if your application is also NUMA aware.

In principle, NUMA would be advantageous if the resources assigned to your VM can no longer be served by a CPU with your RAM. In multi-CPU systems, each CPU manages its RAM. If a process is running on CPU 2 but is currently using CPU 1's RAM, CPU 2 must first access CPU 1 and then the RAM. The route via the QPI costs performance and latency. With NUMA, an application can assign the RAM used to one CPU and only use the RAM of the other CPU when necessary.

Set Linux VMs IO scheduler to none/noop
  • Where do you do this? Inside the VM? And if so, how?
As far as I know, it's now set to none anyway.
In my opinion, this can be completely ignored and neglected today.

Set RBD pool to use the 'krbd' option if using Ceph
  • Where do you do this and Why?
Often it doesn't make much of a difference (librbd (user space) or krbd (kernel page cache)). There is a video on the CEPH YouTube channel about: https://www.youtube.com/watch?v=cJegSAGWnco

You have to decide for yourself and test what makes sense or not.
However, new features are often more available for librbd than for krbd. At least it used to be that way, but I don't know if it's still that way today.

But it has to be said that you have the greatest benefit of enterprise switches and SSDs, which account for the most latency and therefore performance. Things like I/O Scheduler, VM Cache or KRBD etc. are subtleties - if you have the time and desire to deal with them, you can do it. But you can also simply use the time for your customers and not look for the last 100 IOPS ;-)

//EDIT:
If you want to use an EFI/TPM device in the VM, then KRBD is always used: https://forum.proxmox.com/threads/dual-stack-ceph-krbd-and-efi-tpm-problem.137234/
 
Last edited:
It should usually be activated anyway. But you can check with hdparm -W /dev/sda.

By the way, SSDs are generally good for CEPH, but the Evo will certainly cost you a lot of performance. Here it is best to switch to enterprise SSDs such as the Samsung PM883.


The question here is why are you using this? If you want to ensure that no data is lost in the event of a power outage, then both none and writeback would not be for you. More about it here: https://pve.proxmox.com/wiki/Performance_Tweaks#Disk_Cache

I've actually always used writeback and have never had performance problems or data loss.


You don't have to set host, but the x86-64-v2-AES will cost you performance.
If you only have the same CPU models in the cluster anyway, you can set host. If you have a mixed cluster, you should find and adjust the lowest common denominator. Live migration between old and new nodes is also possible. With Host CPU you usually cannot migrate between different CPU models.
You can only benefit from the performance of the CPU with the right CPU model. Otherwise, functions may be emulated that the CPU could do natively and faster. An example of this is AES.


Here you have to differentiate. The node does NUMA anyway, but now the VM could also do NUMA.
You only get a real benefit if your application is also NUMA aware.

In principle, NUMA would be advantageous if the resources assigned to your VM can no longer be served by a CPU with your RAM. In multi-CPU systems, each CPU manages its RAM. If a process is running on CPU 2 but is currently using CPU 1's RAM, CPU 2 must first access CPU 1 and then the RAM. The route via the QPI costs performance and latency. With NUMA, an application can assign the RAM used to one CPU and only use the RAM of the other CPU when necessary.


As far as I know, it's now set to none anyway.
In my opinion, this can be completely ignored and neglected today.


Often it doesn't make much of a difference (librbd (user space) or krbd (kernel page cache)). There is a video on the CEPH YouTube channel about: https://www.youtube.com/watch?v=cJegSAGWnco

You have to decide for yourself and test what makes sense or not.
However, new features are often more available for librbd than for krbd. At least it used to be that way, but I don't know if it's still that way today.

But it has to be said that you have the greatest benefit of enterprise switches and SSDs, which account for the most latency and therefore performance. Things like I/O Scheduler, VM Cache or KRBD etc. are subtleties - if you have the time and desire to deal with them, you can do it. But you can also simply use the time for your customers and not look for the last 100 IOPS ;-)

//EDIT:
If you want to use an EFI/TPM device in the VM, then KRBD is always used: https://forum.proxmox.com/threads/dual-stack-ceph-krbd-and-efi-tpm-problem.137234/


1. Re: Samsung PM883 --> Thank you. I will take a look into this.
What about Samsung Pro's?



2. "Set VM cache to none
We do this already
The question here is why are you using this? If you want to ensure that no data is lost in the event of a power outage, then both none and writeback would not be for you. More about it here: https://pve.proxmox.com/wiki/Performance_Tweaks#Disk_Cache"

Thanks for the link, I'll study it.

Didn't you mention in your earlier post that you do the same?
What i meant by VM Cache was 'Cache: Default (No cache)' like in the picture below.
Also what I included in the picture below, is that i set 'Async IO: native'
What are your thoughts on that?

1704149536570.png


3 . You don't have to set host, but the x86-64-v2-AES will cost you performance.
I understand. Thank You!


4. In principle, NUMA would be advantageous if the resources assigned to your VM can no longer be served by a CPU with your RAM. In multi-CPU systems, each CPU manages its RAM. If a process is running on CPU 2 but is currently using CPU 1's RAM, CPU 2 must first access CPU 1 and then the RAM. The route via the QPI costs performance and latency. With NUMA, an application can assign the RAM used to one CPU and only use the RAM of the other CPU when necessary.
Great Explaination!

5. Set Linux VMs IO scheduler to none/noop
Thank You!

6. Set RBD pool to use the 'krbd' option if using Ceph.....
Thank You!

7. The rest...
Thank You!


All great info! I'll research those which you suggested.
Thanks so much for the contribution!
 
My optimizations are for a SAS HDD environment. I don't use any SATA/SAS SSDs in the Promox Ceph clusters I manage.

I also don't run any Windows VMs.

All the VMs are Linux which are a mix of Centos 7 (EOL this year)/Rocky Linux (RL) 8 & 9/Alma Linux (AL) 8 & 9.

For the RL/AL VMs, I set the Linux IO Scheduler as follows:

/etc/udev/rules.d/60-ioschedulers.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="none"

For Centos 7, it's:

# grubby --update-kernel=ALL --args='elevator=noop'

Reboot the VMs when done.

Reason for the above optimization is to let the hypervisor do the IO scheduling since it controls the physical hardware and not the VM. This is true for KVM and ESXi (VMware) hypervisors. More info at https://access.redhat.com/solutions/5427

As for using VM Disk Cache to None is because the VM is sending the data directly to the disk which knows how to schedule IO requests optimally. More info at https://forum.proxmox.com/threads/disk-cache-wiki-documentation.125775

The reason for setting set write cache enable (WCE) on SAS HDDs is because the majority are used with hardware RAID controllers in production which have their own RAM cache with battery backup. I don't use any HW RAID controllers, so I enable WCE for maximum IOPS.

I use the default VM Disk Async IO of 'io_uring' without issues.

Again, you have test the optimizations and see what works for your environment.
 
I don't have any exact experiments or evidence on this, but some of the points simply won't apply to CEPH. It must also be abstracted here that the CEPH is a storage device in itself and the hypervisor therefore has no physical access to the data storage media; only CEPH has that.

The topic of the cache cannot simply be settled with "none" or be considered useful; there is, for example, this page from CEPH directly: https://docs.ceph.com/en/latest/rbd/qemu-rbd/#running-qemu-with-rbd
 
I don't have any exact experiments or evidence on this, but some of the points simply won't apply to CEPH. It must also be abstracted here that the CEPH is a storage device in itself and the hypervisor therefore has no physical access to the data storage media; only CEPH has that.

The topic of the cache cannot simply be settled with "none" or be considered useful; there is, for example, this page from CEPH directly: https://docs.ceph.com/en/latest/rbd/qemu-rbd/#running-qemu-with-rbd
That's good info to know.

I've changed the VM's disk cache to writeback and will monitor performance.

I also have standalone ZFS servers and also changed the cache to writeback.
No, they're also desktop / prosumer but not enterprise grade. You will not have fun with them on CEPH.
Agree.

You want enterprise sold-state storage with power-loss protection (PLP).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!