Hyperconverged setup advice

zBrain

Active Member
Apr 27, 2013
37
0
26
I have a 2U rack server with 4 nodes. Each node has:

2 CPUs, 6 cores, hyperthreaded (24 total)
128GB RAM
2x 1GB + 1x 10GB NIC
3x 4TB WD Gold
2 NVMe slots

I'm migrating from an old cluster involving multiple FreeNAS boxes and a big mess.

Currently I have set up a couple nodes with a 3 way mirror using ZFS. This works ok once VMs are on it but migrating is an issue. It will go ok until RAM is 50% utilized then IO wait spikes in the VM and the transfer slows to a crawl. I tried using a NVMe device as a ZIL but that didn't fix it.

I mostly went ZFS because of familiarity. But I think I'm bottlenecking because I have no striping, so my write perf isn't great. Not sure why it goes smooth until RAM fills up.

So, I think I may want to use ceph. I'm just not sure how to best utilize my hardware.

First: Is this a bad idea? Am I better to go back to dedicated storage? I know the first suggestion is going to be buy SSDs but currently that's not an option.

Assuming this isn't a bad idea on it's face:
Is it best to install Proxmox itself on the NVMe devices? Should I mirror them using ZFS then use the HDDs as OSDs?

Is there anything I need to know moving from ZFS to ceph that is going to bite me? Does ceph have anything analogous to ARC?

Any general advice/experience is appreciated. I've read up enough to feel like I know just enough to make a huge mess out of a small one.
 
How are you migrating exactly? Which network connections do your migrations use? For some operations you can define bandwidth limits.

In contrast to ZFS, Ceph is a real distributed object store and file system. That means you can get real high-availability, for example. The idea of hyper-converged setups (same physical nodes for computing & storage) with Ceph is tightly integrated into Proxmox VE. We even deploy our own Ceph packages (covered by subscription) and have a whole chapter about this in our reference documentation. Nevertheless, ZFS is generally a very good choice together with Proxmox VE for local storages and also well integrated.

I know the first suggestion is going to be buy SSDs
This is correct :)

Is it best to install Proxmox itself on the NVMe devices? Should I mirror them using ZFS then use the HDDs as OSDs?
Wouldn't hurt, but I'd say having Proxmox VE on a fast disk is not more important than having virtual machines on fast disks.

Is there anything I need to know moving from ZFS to ceph that is going to bite me?
What you have to pay more attention to is your network. You could test something like 10GB for Ceph, 1x 1GB for Corosync, 1x 1GB for rest.
You can take a look at our Ceph benchmark to get an idea of the impact of network and disk speed.

Does ceph have anything analogous to ARC?
There are journal and DB/WAL: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_precondition
 
Last edited:
Ok, assuming I'm going to try to work out my ZFS setup for now and look into ceph later:

I'm migrating from a FreeNAS server that only has 1GB NIC. Storage is on a dedicated subnet.

As I said, things go fine while migrating the disk at first. RAM usage steadily climbs until it hits 50% usage on the Proxmox node, then iowait on the VM jumps, the migration operation slows to a crawl and I have to cancel it because things on the VM stop responding.

Once the migration completes, things are fine. But for some of my larger VMs, I can't just grit my teeth and wait it out, as it would result in services being broken for 15+ hours.

Am I doing anything fundamentally wrong?
 
SSDs would, unfortunately, most likely help you greatly here.

AFAIK, ZFS buffers writes in RAM up untill it fills up it's ARC (which is set to 50% of total system RAM by default), after which it has to start flushing/writing to the drives. Depending on how you're doing the writes (e.g. if other processes are hitting the drives with lots of random read/writes, or long sustained reads/writes) it makes sense for the transfers to slow down so significantly, because the drives simply cannot write the data quickly enough.

The drives being 3-way mirrored might impact this performance more, too, since (I imagine, at least) the writes have to synced to each drive, which adds more overhead
 
Last edited:
AFAIK, ZFS buffers writes in RAM up untill it fills up it's ARC (which is set to 50% of total system RAM by default), after which it has to start flushing/writing to the drives.
Adaptive-Replacement-Cache is a read cache. A Write-Cache (for async-writes only!) exists in Ram and is written to the real disks every 5 seconds by defaiult.
 
Adaptive-Replacement-Cache is a read cache. A Write-Cache (for async-writes only!) exists in Ram and is written to the real disks every 5 seconds by defaiult.
Oh, I hadn't realized that! Thank you for correcting me there.

I guess it may be the kernel buffering the writes then? I was trying to theorize about the 50% memory usage, and subsequent tanking of performance, but I guess I was a bit overzealous lol
 
It seems odd to me as I've moved virtual disks between freenas boxes without issues before.

I appreciate any ideas, because I feel like I must be missing some knowledge.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!