Advice for cluster/ceph

wav3front

Member
Aug 31, 2024
51
2
8
Hi all,

I'm doing an attempt to setup a cluster with OVH dedicated servers.
Each server has 4x1TB NVMe.

My question is, should I install Proxmox VE on the first nvme (no raid1) and use the remaining 3 for the Ceph storage, or should I just 2xNVMe for Proxmox installation (redundancy) and 2xNVMe for ceph?

It's obviously overkill to use 2x1TB for proxmox installation, but those are the options I have.

If I install proxmox on just 1 nvme, if that drive fails, will I be able to restore the VMs quickly?

thanks
Alex
 
It can run on other cluster members. The Ceph storage layer still exists on 3/4. If the server is operating it may be able to migrate (?) otherwise HA will start it on another server.

you may want to read through https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/ if you haven’t seen it.
Thank you sir.

One more question if you will. How will the IO performance be compared to local nvme?
The lan network connectivity on OVH is 25gbit
 
Yeah, i guess ill do some tests and decide on it.

I may end up using local zfs and use the cluster for replication.

Kinda best of both worlds approach.

Thank you sir.
 
I did some initial tests and I'm getting 1600MB/sec read and 1300MB/sec write via Ceph (OVH 25g private network) while on local NVMe's I'm getting 13000/5500.

That's a gigantic difference. I don't think LAN is the bottleneck, I should be getting around 3000MB/sec on the 25g lan.

Any ideas how to optimise it further?
 
Are these enterprise drives?

How much do your VMs actually read/write?
Ah, you mean the real-world usage.
It depends, sometimes I do have bursts.
This is intended for web hosting usage using Enhance panel. I use a lot of caching but still I need the I/O speed, specially for mariaDB.
 
For performance critical stuff (like databases) ceph or zfs are not really a good fit.

Especially ceph because every write has to go over the network to the other nodes and only when all involved nodes (depends on ceph config) report a successful write then ceph reports a sucessful write back to the application - thats a huge IO delay compared to local storage.

Even if you don't need the higher bandwith - with 100G networking the latency / IO delay would be way better. But i guess thats not an option with OVH.

So if you want your DB to be fast use local storage, ext4 or xfs file system and do the HA on application level.
 
  • Like
Reactions: wav3front
ZFS being a COW filesystem with checksums compression deduplication snapshots and many other features has write amplification (depends on config / features used). This means for every single write your DB does you have actually 4 to 8+ real writes to disk (again depends on config, VM filesystem and other factors) which does increase IO delay and wears out your SSD's faster.
 
That tuning guide most likely assumes the DB directly installed on an os with ZFS storage, not a DB in a VM stored on ZFS with another file system in it ...
 
But there really isn't any other option, if someone wants soft-raid? Only zfs.
It seems like btrfs is still not mature enough.
 
Last edited:
Hi friend, I can tell you that I have been studying this topic for months, between zfs and ceph, the problem is that in my case I have many vm that do different jobs, about 50 vm on 3 nodes, I started with HDD disks with zfs raidz but I had many problems with IO delay, then I tried a low-end sdd with ceph, still problems, I then put the sdd in support of zfs again in cache and log, but nothing changed, to make me stay that way, then I bought server ssd, precisely Samsung PM1643a 7.68TB, I put 2 for each node with ceph and now all my problems are gone, IO delay almost zero at the moment, but I have to increase the total capacity by adding more disks slowly, I hope I don't have problems by increasing the datastore precisely,
 
  • Like
Reactions: UdoB