How do I limit the allocated disk space for CephFS?

emptness · Aug 9, 2023

Hello everybody!
I need to provide shared disk space for multiple VMs on a Proxmox VE+Crfz cluster. VM disk storage is already deployed on Ceph pools (hdd and ssd pools).
I want to use CephFS on the hdd pool for these purposes.
Please tell me how good is this solution?
And most importantly, how to limit the CephFS disk space provided to the VM?
As far as I understand, a separate pool will be created for CephFS, which I can configure to distribute data only on HDD drives. Is there any way to explicitly specify that this pool should take only a specific number of GBs?

Philipp Hufnagl · Aug 9, 2023

Why CephFS? Have you considered RBD?

aaron · Aug 9, 2023

To add to @Philipp Hufnagl

For VM and LXC container disk images, use RBD. It is the layer that provides block devices on top of Ceph's object store. Create a new pool and make sure the "Add Storage" checkbox is enabled so that the matching storage configuration for Proxmox VE is added automatically. Assign the HDD and SSD rules if you want the RBD pools to be stored on specific device classes.

The CephFS is intended to store files, like ISOs, maybe backups. Though, it is not a good idea to have backups only located on the same machines as the guests. CephFS depends on the MDS (metadata server) to provide the clustered file system functionality. The MDS' work in active-standby mode. If an MDS dies, a standby needs to take over. But first it needs to catch up, and that can take a few moments. I have seen reports where it might even take a few minutes in large clusters. Definitely not something you want to store your disk images on!

emptness · Aug 9, 2023

aaron said:
To add to @Philipp Hufnagl

For VM and LXC container disk images, use RBD. It is the layer that provides block devices on top of Ceph's object store. Create a new pool and make sure the "Add Storage" checkbox is enabled so that the matching storage configuration for Proxmox VE is added automatically. Assign the HDD and SSD rules if you want the RBD pools to be stored on specific device classes.

The CephFS is intended to store files, like ISOs, maybe backups. Though, it is not a good idea to have backups only located on the same machines as the guests. CephFS depends on the MDS (metadata server) to provide the clustered file system functionality. The MDS' work in active-standby mode. If an MDS dies, a standby needs to take over. But first it needs to catch up, and that can take a few moments. I have seen reports where it might even take a few minutes in large clusters. Definitely not something you want to store your disk images on!

Thanks for the reply.
But you misunderstood me.
VM disk images are stored on RBD.
I also need to connect a shared storage to several VMs. So that several guest OSs can work with files on disk storage at the same time.
I figured out how to limit the volume of pools, you can set a quota for a pool.
But now I have another problem.
I can't connect Suraly space to VM. I tried to do this using the kernel functionality and using ceph-fuse. But nothing comes out. Cephfs is not mounted in guest OS. Returns an error:
kernel: [10713946.415107] ceph: No mds server is up or the cluster is laggy

emptness · Aug 9, 2023

emptness said:
Thanks for the reply.
But you misunderstood me.
VM disk images are stored on RBD.
I also need to connect a shared storage to several VMs. So that several guest OSs can work with files on disk storage at the same time.
I figured out how to limit the volume of pools, you can set a quota for a pool.
But now I have another problem.
I can't connect Suraly space to VM. I tried to do this using the kernel functionality and using ceph-fuse. But nothing comes out. Cephfs is not mounted in guest OS. Returns an error:
kernel: [10713946.415107] ceph: No mds server is up or the cluster is laggy

I have solved this problem too!
To connect an external client to ceph, it is necessary that the monitors and MDS are deployed on a public network accessible to the client, that is, the VM network and the ceph public network must be the same.
It does not connect in any other way.
It's kind of weird. Then why do I need cephfs in proxmox? Only for iso images?

aaron · Aug 10, 2023

It should also work if the VMs can access the Ceph Public network if it is routed. Performance will obviously be best if they can access the Ceph Public network directly.

emptness said:
It's kind of weird. Then why do I need cephfs in proxmox? Only for iso images?

Or container templates to store them within the cluster and have them available on each node without the need to some external network share. Backups do work too, but should not be the only place where they are stored (3-2-1 backup strategy).

Alternative use cases which do happen, is like what you wanted to achieve (and we misunderstood); using it directly as a shared file system between other machines, either VMs in the cluster or completely separate ones. I have seen a customer that is storing the VMs on local storage on the nodes and is using Ceph only for CephFS and connects Windows clients to it.

emptness · Aug 10, 2023

aaron said:
It should also work if the VMs can access the Ceph Public network if it is routed. Performance will obviously be best if they can access the Ceph Public network directly.

Or container templates to store them within the cluster and have them available on each node without the need to some external network share. Backups do work too, but should not be the only place where they are stored (3-2-1 backup strategy).

Alternative use cases which do happen, is like what you wanted to achieve (and we misunderstood); using it directly as a shared file system between other machines, either VMs in the cluster or completely separate ones. I have seen a customer that is storing the VMs on local storage on the nodes and is using Ceph only for CephFS and connects Windows clients to it.

Glad to hear it works)
I would like to hear your opinion.
My scheme: 10.10.10.0/24 Ceph cluster network. The same network is a cluster network for PVE.
192.168.60.0/24 the public network of the PVE cluster for client access to the VM.
On the 10.10.10.0/24 network, I also placed ceph monitors (cluster and public ceph are the same).

The fact is that the ceph cluster network is connected to a more reliable stack of switches and I thought that the monitors in it would be in a more accessible state, which would increase the availability of the cluster.
But now I realize that I was wrong.
Tell me how ceph networks should be properly configured in your opinion?

aaron · Aug 10, 2023

First of, the Ceph Public network is mandatory. The Ceph Cluster network is optional and, if configured, will be used for the replication traffic between the OSDs. Taking away quite a bit of load from the Ceph Public network.
Both networks need to be fast (clients, e.g. VMs, accessing the cluster via the Ceph Public network) and reliable.

The network used for the Proxmox VE cluster (corosync) doesn't need to have a lot of bandwidth, but low latency is very important. Corosync can handle up to 8 networks by itself and switch if one network is deemed unusable.
It is best practice to give Corosync at least one dedicated physical network (1 Gbit is usually enough) for itself. This way, there won't be interference if another service is taking up all the available bandwidth. Configuring additional networks is a good idea to give Corosync fallback options, should there be issues with the dedicated one.
See the docs (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy) on how to add additional networks to the Corosync config.

Using only one Corosync network, and having that shared with Ceph can quickly lead to issues. How bad they are, depends. Do you use HA? Then the results can be quite catastrophic.
If Ceph starts using up all the available bandwidth, the latency for Corosync goes up. Maybe to the point where it considers the network unusable. If it cannot switch to another network, the Proxmox VE cluster connection is lost -> the node is not part of the quorum anymore -> /etc/pve/ is read-only.
That means, any action that wants to write there will fail, for example changing some configs, starting a guest and so forth.

If the node has HA guests, the LRM will be in active mode. That increases the severity. The HA stack uses Corosync to determine if a node is still part of the cluster. If the node lost the connection for ~1 minute and the LRM is active (due to HA guests running on the node), it will fence itself (hard reset). It does that, to make sure that the HA guests are definitely powered down, before the (hopefully) remaining cluster will start these guests on other nodes.

If Ceph is the reason for the lost Corosync connection, it is likely that all nodes are affected. The result would be, that the whole cluster (if all nodes had HA guests running) will do a hard reset.

So, give Ceph good reliable networks for both, Cluster and Public, and give Corosync options to switch physical networks and ideally its physical network.

emptness · Aug 10, 2023

aaron said:
First of, the Ceph Public network is mandatory. The Ceph Cluster network is optional and, if configured, will be used for the replication traffic between the OSDs. Taking away quite a bit of load from the Ceph Public network.
Both networks need to be fast (clients, e.g. VMs, accessing the cluster via the Ceph Public network) and reliable.

The network used for the Proxmox VE cluster (corosync) doesn't need to have a lot of bandwidth, but low latency is very important. Corosync can handle up to 8 networks by itself and switch if one network is deemed unusable.
It is best practice to give Corosync at least one dedicated physical network (1 Gbit is usually enough) for itself. This way, there won't be interference if another service is taking up all the available bandwidth. Configuring additional networks is a good idea to give Corosync fallback options, should there be issues with the dedicated one.
See the docs (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy) on how to add additional networks to the Corosync config.

Using only one Corosync network, and having that shared with Ceph can quickly lead to issues. How bad they are, depends. Do you use HA? Then the results can be quite catastrophic.
If Ceph starts using up all the available bandwidth, the latency for Corosync goes up. Maybe to the point where it considers the network unusable. If it cannot switch to another network, the Proxmox VE cluster connection is lost -> the node is not part of the quorum anymore -> /etc/pve/ is read-only.
That means, any action that wants to write there will fail, for example changing some configs, starting a guest and so forth.

If the node has HA guests, the LRM will be in active mode. That increases the severity. The HA stack uses Corosync to determine if a node is still part of the cluster. If the node lost the connection for ~1 minute and the LRM is active (due to HA guests running on the node), it will fence itself (hard reset). It does that, to make sure that the HA guests are definitely powered down, before the (hopefully) remaining cluster will start these guests on other nodes.

If Ceph is the reason for the lost Corosync connection, it is likely that all nodes are affected. The result would be, that the whole cluster (if all nodes had HA guests running) will do a hard reset.

So, give Ceph good reliable networks for both, Cluster and Public, and give Corosync options to switch physical networks and ideally its physical network.

Thanks for the detailed answer!
That is, the network scheme should be like this:
1 physical isolated network for corosync (preferably 2 network cards in aggregation).
1 physical isolated ceph cluster network (2 network cards in aggregation).
1 physical isolated ceph public network (2 cards in aggregation)
and a separate network card for client access to the VM?
With this configuration, how to connect cephfs and VM guest OS? after all, VMs do not see the public ceph network!

And tell me what is this parameter?

Don't be angry with me))) I honestly try to search in the documentation first and then in google. But for what this parameter is needed, I could not find information ((( Even if you just give a link where its purpose is described in detail, I will already be very grateful to you!

aaron · Aug 11, 2023

emptness said:
That is, the network scheme should be like this:
1 physical isolated network for corosync (preferably 2 network cards in aggregation).
1 physical isolated ceph cluster network (2 network cards in aggregation).
1 physical isolated ceph public network (2 cards in aggregation)
and a separate network card for client access to the VM?

Well, the more physical networks the better. But sometimes it is not possible. Then certain compromises need to be done. Maybe the Ceph Cluster network isn't needed as the Public network has enough bandwidth for the cluster to perform well.

Corosync doesn't need LAG/bonds as it can switch itself between networks. So one single NIC dedicated for Corosync alone, with additional links on the other networks as fallback, can work too if you cannot spare that many NICs.

Depending on the network and bandwidth needs of the guests, you might run the MGMT and VM network on the same. If you require more security, you can separate them with VLANs for example.

If you really have a lot of NICs to spare, you could have a dedicated migration network between the nodes

emptness said:
With this configuration, how to connect cephfs and VM guest OS? after all, VMs do not see the public ceph network!

There are a few options I can imagine. If you trust the VMs, you could configure the IP address for the Ceph Public network not directly on the bond, but create a vmbr interface on top of the bond and set the IP address there. Then you could give the VMs a second NIC that is using that new vmbr interface. Give the VMs IPs in the same network.

The other would be, to enable your router/firewall to connect to the Ceph Public network. Then traffic from the VMs can be routed to the Ceph public network.

claster looks wrong

And should not be needed for the MONs. Did you add it yourself to the config file?

emptness · Aug 11, 2023

aaron said:
Well, the more physical networks the better. But sometimes it is not possible. Then certain compromises need to be done. Maybe the Ceph Cluster network isn't needed as the Public network has enough bandwidth for the cluster to perform well.

Corosync doesn't need LAG/bonds as it can switch itself between networks. So one single NIC dedicated for Corosync alone, with additional links on the other networks as fallback, can work too if you cannot spare that many NICs.

Depending on the network and bandwidth needs of the guests, you might run the MGMT and VM network on the same. If you require more security, you can separate them with VLANs for example.

If you really have a lot of NICs to spare, you could have a dedicated migration network between the nodes

There are a few options I can imagine. If you trust the VMs, you could configure the IP address for the Ceph Public network not directly on the bond, but create a vmbr interface on top of the bond and set the IP address there. Then you could give the VMs a second NIC that is using that new vmbr interface. Give the VMs IPs in the same network.

The other would be, to enable your router/firewall to connect to the Ceph Public network. Then traffic from the VMs can be routed to the Ceph public network.

claster looks wrong And should not be needed for the MONs. Did you add it yourself to the config file?

The fact is that we have just 100 Gbit switches assembled in a cluster for ceph replication. These switches are isolated from all networks and are used only for the ceph cluster. Now both the ceph cluster and public networks are connected to this switches.
And there are also switches of the shared LAN, but they are 1 Gbit. Clients connect to the VM through these switches.

I assumed that the cluster network is the most important for ceph, I did not understand correctly from the documentation. That's why I'm suffering)

The idea of connecting the vmbr of the Ceph public network to the VM also came to my mind. But for some reason, the VM does not see other monitors besides the one that is located with it on the same PVE host. I can't figure out why(((

I met this parameter in google))) Must be removed) But it is definitely applicable in OSD. Can you tell me why?

aaron · Aug 14, 2023

emptness said:
The idea of connecting the vmbr of the Ceph public network to the VM also came to my mind. But for some reason, the VM does not see other monitors besides the one that is located with it on the same PVE host. I can't figure out why(((

Can you post the network config of the host, /etc/network/interfaces and the config of such a VM? qm config {vmid}

aaron · Aug 14, 2023

emptness said:
I met this parameter in google))) Must be removed) But it is definitely applicable in OSD. Can you tell me why?

OSDs are the only service making use of the cluster network. Everything else is using the public one. Therefore, for OSDs it is a valid config option.

emptness · Aug 14, 2023

aaron said:
Can you post the network config of the host, /etc/network/interfaces and the config of such a VM? qm config {vmid}

Figured it out. The problem was with LXC. When I lifted the VM and connected the network card from the ceph cluster network, the network started working normally.

From the documentation, I understood that the ceph public network (network of monitors) should be available to clients so that we can connect cephfs over the network. But at the same time, I often find information that the monitors must be on a separate network in order to maintain the state of the ceph cluster. And then I start to get confused)))

aaron · Aug 14, 2023

emptness said:
From the documentation, I understood that the ceph public network (network of monitors) should be available to clients so that we can connect cephfs over the network. But at the same time, I often find information that the monitors must be on a separate network in order to maintain the state of the ceph cluster. And then I start to get confused)))

Do you have links where you find that?

Overall, the Ceph Public network is used for everything in the Ceph cluster, including the clients accessing it. Guests that have their disk images stored in Ceph are clients too. The optional Ceph Cluster network is, if configured, only used by the OSDs for the replication traffic.

emptness · Aug 14, 2023

aaron said:
Do you have links where you find that?

Overall, the Ceph Public network is used for everything in the Ceph cluster, including the clients accessing it. Guests that have their disk images stored in Ceph are clients too. The optional Ceph Cluster network is, if configured, only used by the OSDs for the replication traffic.

I mean, the public network should be stable and with low latency.
Won't the traffic of clients connecting to the VM prevent this? For example, that something will be uploaded or downloaded to the guest's file system.

aaron · Aug 14, 2023

Well, it needs to be able to handle all of that. That is why the optional Cluster network exists. To have the option to take away quite a lot of load to a different physical network.

The MONs are the first thing a client or Ceph service contacts the first time it connects to the cluster to get a full overview of the current state, which services are up and running where. So the MONs need to be reachable by every Ceph service and client. Once done, the clients and services talk directly to each other for the most part.

emptness · Aug 15, 2023

aaron said:
Well, it needs to be able to handle all of that. That is why the optional Cluster network exists. To have the option to take away quite a lot of load to a different physical network.

The MONs are the first thing a client or Ceph service contacts the first time it connects to the cluster to get a full overview of the current state, which services are up and running where. So the MONs need to be reachable by every Ceph service and client. Once done, the clients and services talk directly to each other for the most part.

I have a cluster of 100 Gbit switches with bonds connected to it (2 connections from each server) - this is a ceph cluster network for replication, I'm not worried about this network (at the moment monitors are connected to the same network, public and cluster are the same).
The main LAN also consists of a cluster of switches and PVE servers are also connected using bonds, but only 1 Gbit. Through this network we connect to the VM and it also has other servers and services.
I want to make the 2nd network public for ceph. Do you think this is a bad idea?

aaron · Aug 16, 2023

emptness said:
I want to make the 2nd network public for ceph. Do you think this is a bad idea?

If I understand correctly, the VM needs to connect to the Ceph clusters public network to use the CephFS? Then the VM needs to be able to establish a connection to the Ceph cluster. So you need the cabling and probably routing set up accordingly. If you configure the router/firewall in between these networks, you can at least limit who can open connections in which direction. For example, only the VMs that really need it, are allowed to access the Ceph Public network. There should be no need to allow connection the other direction.

emptness · Aug 16, 2023

aaron said:
If I understand correctly, the VM needs to connect to the Ceph clusters public network to use the CephFS? Then the VM needs to be able to establish a connection to the Ceph cluster. So you need the cabling and probably routing set up accordingly. If you configure the router/firewall in between these networks, you can at least limit who can open connections in which direction. For example, only the VMs that really need it, are allowed to access the Ceph Public network. There should be no need to allow connection the other direction.

You misunderstood me. I want to convert a network of monitors to a 1 Gbit network through which users connect to the VMs.

How do I limit the allocated disk space for CephFS?

Member

Active Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member