New install. Proxmox + Ceph cluster config questions.

Deeh514

New Member
Oct 15, 2024
4
0
1
Greetings Proxmox community!

I'm planning to migrate my VMs, mostly web servers, from VMWare to Proxmox sometime next year.
Used this opportunity to try Ceph, which I heard a lot about.

For now, I've built a 3 node cluster setup to test HA and Ceph.
Proxmox version: 8.2.7
Ceph version: 18.2.4 reef (stable)

Each server has:
-2x500GB SSD drives in RAID1 for PVE install
-2x1TB SSD drives for Ceph each drive exported as RAID0
-4x1G ports: 1 management, 2 (LACP) for VMs
-4x10G ports: 2 (LACP) for Ceph cluster network & 2 (LACP) for CephFS public network
-All 10G traffic is configured with jumbo frames

Created manager, monitor, ceph, OSD and MDS for each node.
Created ceph pool "my-ceph-pool" and CephFS "my-cephfs"
Added "mds_standby_replay = true" to ceph.conf under each MDS.
Set HA shutdown_policy to migrate

VM setup:
-Two Rocky 9 VMs stored on "my-ceph-pool".
-Each VM has access to 10G LACP for CephFS.
-Inside each VM "my-cephfs" is mounted under /mnt/cephfs
-HA is enabled on each VM.

Simulating a graceful reboot/shutdown of a node/Active MDS:
-VMs migrate without any interruptions prior to reboot/poweroff
-CephFS fail-over takes approx 75 secs

Simulating a node/Active MDS power failure:
-VMs on the node take approx 180 seconds to fail-over
-CephFS fail-over takes approx 75 secs

Stopping Active MDS manually
-CephFS fail-over takes approx 3-4 secs

Proxmox HA on VMs is a bonus, eventually VMs will have their own Master/Slave failover mechanism (keepalived) from within.
My main priority is performance and uptime of the CephFS share.
I'll probably end up moving VMs to local LVM, get rid of "my-ceph-pool" and use Ceph strictly for CephFS.
Eventually, the cluster will have 13 identical nodes and have 15 or so VMs connecting to the CephFS share.

With all that in mind, I was hoping to get answers to some questions:
1. Is there an option similar to HA's "shutdown_policy:migrate" to put MDS in “stopped” status prior to reboot/poweroff?
2. Reading Ceph docs I stumbled upon: "mds_reconnect_timeout" (default 45s) and "mds_beacon_grace" (default 15s). Has anyone had experience tweaking those settings? Didn't find much info on these forums.
3. I also read about having multiple MDS per single CephFS, are there any (dis)advantages to it?
4. A similar question for multiple CephFS, is it preferable to have one CephFS share with 8 subfolders or 8 CephFS shares?

I'm still new to Proxmox and Ceph, so any other recommendations and improvements are more than welcome.

Thanks!
 
You could add a standby-replay MDS instance that will be active faster than a cold standby.

Each CephFS needs at least one active MDS. With 8 CephFS you would need 8 MDS instances. Do you have the capacities to run these?

Multiple MDS per CephFS are needed when the load is too high for just one. This is usually the case with a multitude of filesystem clients that are all very active with a large number of open files.
 
Thank you for my first reply ever on this forum :)

Standby-replay MDS was already enabled during my initial tests, where active MDS took approx 75 secs to fail-over.
Yesterday, I've lowered "mds_reconnect_timeout" & "mds_beacon_grace" to 10 secs each...the fail-over took approx 45 secs.
I'm still testing this, but so far havent had any negative side effects.

I'm expecting CephFS to be busy reading and writing. It will be mounted on 13 client VMs. The cluster will have 13 MDS servers.
From what I'm reading having 1 CephFS with subfolders and 2-3 active MDS is the preferred method. It's flexible and easier to manage.
I'm curious to know how it compares perfomancewise to having multiple CephFS?
In temrs of capcacities it will be 13 HP DL360 servers (248GB RAM, 32cores, 8xSSDs)
 
I'm still new to Proxmox and Ceph, so any other recommendations and improvements are more than welcome.
3 node config is an absolute minimum configuration for ceph. I dont consider 3 node to be either good for production or good for a predictor of proper performance simulation. on top of that, 2 OSDs per node is way too little. there is no granularity for io which means all your IOs vye for the attention of the same mechanisms. Lastly, avoid passing your disks through your raid controller; use passthrough if available.

cephfs with three nodes is extremely wasteful. You'd be better off just putting your filestore on a external nas; it will perform better and you can get better utilization from your disks (cephfs on a 3:2 replication pool yields 33% storage efficiency.) cephfs starts to make sense when you have 6+ nodes with a lot of drives.
 
Thanks, I'll keep it mind.
3 node setup is purely for testing...the production node will have at least 13 nodes, each with 6 to 8 OSDs (havent decided yet) with dedicated 10G LACP vlan, just for the ceph cluster.

I realize that with cephfs the more nodes it has, the better performance. Is it safe to assume that the fail-over time will decrease as well?
 
Is it safe to assume that the fail-over time will decrease as well?
what.... failover?

I guess I missed this part of your question. ceph is multiheaded; as long as you dont have a service with only one daemon, there should be no "failover" to speak of. If you only have one active mds daemon, there will naturally be a "failover" period to a standby; the amount of time that takes is dependent on how many operations you have in flight, and what is the available bandwidth between the server serving the active MDS and the standby mds.

The solution to this is simple- have multiple active mds daemons.

https://ceph.com/community/new-luminous-multiple-active-metadata-servers-cephfs/
 
  • Like
Reactions: Deeh514
thanks, that's really helpful
guess I'll have to expand my test cluster and restart tests with 2 active MDS
just to be sure, is this the command to to increase active MDS?
ceph fs set $my_cephfs max_mds 2
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!