HA NFS service for KVM VMs on a Proxmox Cluster with Ceph

Rainerle

Renowned Member
Jan 29, 2019
120
35
68
Hi,
we are migrating from a VMware ESXi setup with a NetApp NFS based shared storage.

We also did use NFS filesystems for mounts like /home or /root and application filesystems like a shared /var/www within our virtual machines and host-specific filesystems like /var/log.

Most of our VM-specific filesystems have been migrated to disks attached to the virtual machines now, but we are still wondering on how to migrate our shared filesystems.

Ideas:
  1. Mount Cephfs from the Proxmox VE Ceph Storage (is this supported?) and fiddle around the the Cephx keys and such
  2. Create three VMs, install Ceph and Cephfs on top of the Ceph RDB disks of Proxmox VE and mount Cephfs with in the VMs
  3. Create a single Debian Container on one of the nodes, install NFS kernel server there and export Cephfs directories. No HA and a downtime if we have to service the Proxmox VE host.
  4. Create a single Debian KVM VM, install NFS kernel server and export a disk. No HA - if we reboot the VM the NFS clients stall on the shared filesystem.
  5. Create two Debian KVM VMs, one disk for OS each and one shared disk between them. On that shared disk use OCFS2 and create a heartbeat active-passive NFS server.
  6. Create three Debian KVM VMs or CTs, install NFS Ganesha there and export CephFS directories over these.
Has anybody done something like this before? Any eperiences to share?

Best regards
Rainer
 
Last edited:
  • Like
Reactions: pcuci
I'm running a NFS server in HA with proxmox. My setup is:
- VM is CentOS7, network is 10Gb ethernet.
- Shared storage is /home
- VM disk is on the CEPH strorage shared by my 3 nodes, with 3 replication for the data.
- The data disk for the NFS share is a different storage (logical disk added) and Backup is disabled for it.
- The VM is backuped 3 times a week
- the shared storage is backed up with BackupPC every day on a user base (so they can restore any lost file themself).

With this setup, migrating the VM between hosts is immediate, and with HA downtime is really short. It is running for several years without any problem.
 
Hi @samontetro ,
so what happens if you want to reboot the CentOS7 VM?
- Do your NFS clients stall during that time?
- Do your NFS clients just reconnect?

From my point of view you have a Single Point of Failure with that single VM.

Thanks for your message though.

Rainer
 
How much downtime is acceptable? And do you need the NFS server at all (eg. all linux clients)?
 
Hi @Alwin,

we are currently still running all of our Debian Linux VMs as PXE-booted diskless NFS-Root machines. We have all applications (in a disabled state) installed into one image, create a snapshot and assign that readonly snapshot using DHCP to the VMs. Based on the hostname a config file is assigned and by that boot scripts mount NFS/ZFS/ext4 filesystems, mount bind config files and enable applications using systemctl.

So currently we are very concerned about HA with takeover times below a 5 seconds window. We replaced previously a ZFS-based SUN Storage with NetApp since we experienced around 5 minutes take over time there...

Our future could be booting from an prepared ISO sitting on CephFS or just a RBD OS Disk snapshotted, protected, cloned and shared between many VMs.

So NFS in future would hold /home, /root and /var/www and other files shared between members of a farmed or clustered application. So a failovertime of 10 seconds could be acceptable.

I have been looking into using the CephFS coming with Proxmox but fear loosing enterprise support if we fiddle around with the setup.

Best regards
Rainer
 
we are currently still running all of our Debian Linux VMs as PXE-booted diskless NFS-Root machines. We have all applications (in a disabled state) installed into one image, create a snapshot and assign that readonly snapshot using DHCP to the VMs. Based on the hostname a config file is assigned and by that boot scripts mount NFS/ZFS/ext4 filesystems, mount bind config files and enable applications using systemctl.
This could probably be done by a VM template and involving the snapshot feature (contrary to naming). If it should not have a state.
snapshot=<boolean>
Controls qemu’s snapshot mode feature. If activated, changes made to the disk are temporary and will be discarded when the VM is shutdown.
OFC, this will only be a good option if the VM isn't running for a long time. Since the temp. image is stored in /var/tmp/ (hardcoded) :/.

If state is acceptable then linked clones could help.

So currently we are very concerned about HA with takeover times below a 5 seconds window. We replaced previously a ZFS-based SUN Storage with NetApp since we experienced around 5 minutes take over time there...
Well, Proxmox VE needs ~2min to start recovery. So the HA needs to be done a level higher.
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html

So NFS in future would hold /home, /root and /var/www and other files shared between members of a farmed or clustered application. So a failovertime of 10 seconds could be acceptable.
Ok, that still requires some form of network storage.

But in all this would make a small OS image feasible, mostly just the kernel to boot and load anything else directly from CephFS without any intermediate proxy.

And if you go further, you could think about skipping the kernel altogether and use a container instead. If live migration isn't needed.

I have been looking into using the CephFS coming with Proxmox but fear loosing enterprise support if we fiddle around with the setup.
Well, depends on what you are doing with it. Though using a (or more) VM to host a NFS Ganesha (or other) that accesses the CephFS shouldn't be any problem. Since no extra services are needed on the Proxmox VE host.
https://github.com/nfs-ganesha/nfs-ganesha/wiki/NFS-Ganesha-and-High-Availability
 
  • Like
Reactions: alexskysilk
@alexskysilk , by using a NFS based readonly image I just create a DHCP entry and boot directly via the network from the NFS server.
Maybe I made myself not properly clear on how our current setup works.
https://ltsp.org/ is a project where they use that concept for clients. We use such a setup for our servers.

Installing a system is the easy part. Maintaining and having a trusted reliable environment is the difficult part.
 
I am currently looking into the NFS Ganesha keepalived active/passive two VMs path. Adding additional cephx client authorizations on the Proxmox VE Ceph storage does not void the enterprise support, right?
 
Adding additional cephx client authorizations on the Proxmox VE Ceph storage does not void the enterprise support, right?
Why should it? :)
 
Hi @samontetro ,
so what happens if you want to reboot the CentOS7 VM?
- Do your NFS clients stall during that time?
- Do your NFS clients just reconnect?

From my point of view you have a Single Point of Failure with that single VM.

Thanks for your message though.

Rainer
Yes, rebooting the Centos7 VM freeze the clients (I use hard mount for this). But rebooting a nfs server only VM is fast. And I have large uptimes as I can migrate the VM to another proxmox server without down time for maintenance pupose of the servers.
Patrick
 
So I am following down this path now:
- On the 5 production nodes install 5 minimal CTs with NFS-Ganesha on Debian
Code:
root@nfsshares-a:~# grep '^[[:blank:]]*[^[:blank:]#;]' /etc/ganesha/ganesha.conf
NFS_CORE_PARAM
{
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 3,4;
        mount_path_pseudo = true;
}
NFS_KRB5
{
        Active_krb5 = false;
}
NFSv4
{
        RecoveryBackend = rados_ng;
        Minor_Versions =  1,2;
}
MDCACHE {
        Dir_Chunk = 0;
}
EXPORT_DEFAULTS {
        SecType = "sys";
        Squash = No_Root_Squash;
        Attr_Expiration_Time = 0;
}
CEPH
{
}
RADOS_KV
{
        UserId = "ganesharecov";
        nodeid = "nfsshares-a";
}
RADOS_URLS
{
        UserId = "ganeshaurls";
        watch_url = "rados://nfs-ganesha/ganesha-namespace/conf-nfsshares";
}
%url    rados://nfs-ganesha/ganesha-namespace/conf-nfsshares
- Add three Ceph clients
Code:
root@proxmox07:~# cat /etc/pve/priv/ceph.client.ganesha*
[client.ganesha]
        key = ???
        caps mds = "allow r path=/, allow rw path=/vol"
        caps mon = "allow r"
        caps osd = "allow class-read object_prefix rbd_children, allow rw pool=cephfs_data"
[client.ganesharecov]
        key = ???
        caps mon = "allow r"
        caps osd = "allow class-read object_prefix rbd_children, allow rw pool=nfs-ganesha"
[client.ganeshaurls]
        key = ???
        caps mon = "allow r"
        caps osd = "allow class-read object_prefix rbd_children, allow rw pool=nfs-ganesha"
- Configure the NFS Exports in the ceph dashboard
- Run keepalived on the 5 CTs and operate it using 1 active and 4 passive nodes

Problem arising:
https://lists.nfs-ganesha.org/archi...org/message/FN5QET65A6C3EOWGZTAQ6CKH5UKHGESP/

So NFS Ganesha needs to be built against the libcephfs used on the CTs, which I think should be the same as the one used on the PVE nodes.

@Alwin, Could you guys be so kind and add nfs-ganesha to your ceph repository and build it when you build a new ceph version? That would save some CPU cycles world-wide as I believe I am not the only one going down that rabbit hole...

Best regards and a healthy new year
Rainer
 
So NFS Ganesha needs to be built against the libcephfs used on the CTs, which I think should be the same as the one used on the PVE nodes.
The libcephfs in the CT can be very different from the PVE one. All depends on the Ceph version used.

@Alwin, Could you guys be so kind and add nfs-ganesha to your ceph repository and build it when you build a new ceph version? That would save some CPU cycles world-wide as I believe I am not the only one going down that rabbit hole...
NFS-Ganesha is a another project and not part of Ceph.
 
  • Like
Reactions: Rainerle
I am building NFS-Ganesha now using a Docker container and the Debian build tools.
Code:
rstumbaum@controlnode01.dc1:~/docker-nfs-ganesha-build$ cat Dockerfile
ARG DEBIAN_RELEASE="buster"
ARG CEPH_RELEASE_PVE="nautilus"

FROM debian:${DEBIAN_RELEASE} AS build-env

ARG DEBIAN_RELEASE
ARG CEPH_RELEASE_PVE

ADD http://download.proxmox.com/debian/proxmox-ve-release-6.x.gpg /etc/apt/trusted.gpg.d/proxmox-ve-release-6.x.gpg
RUN chmod 644 /etc/apt/trusted.gpg.d/proxmox-ve-release-6.x.gpg &&\
    echo "deb http://download.proxmox.com/debian/ceph-${CEPH_RELEASE_PVE} ${DEBIAN_RELEASE} main" >/etc/apt/sources.list.d/ceph.list &&\
    echo "deb http://deb.debian.org/debian ${DEBIAN_RELEASE}-backports main" >/etc/apt/sources.list.d/backports.list &&\
    echo "deb-src http://ftp.de.debian.org/debian/ sid main contrib non-free" >//etc/apt/sources.list.d/sid-src.list

RUN apt-get update
RUN apt-get install -y build-essential git-buildpackage cmake bison flex doxygen lsb-release pkgconf \
                       nfs-common ceph-common
RUN apt-get install -y -t ${DEBIAN_RELEASE}-backports libglusterfs-dev

WORKDIR /build

RUN apt-get build-dep -y libntirpc-dev &&\
    apt-get source -y --compile libntirpc-dev &&\
    dpkg -i *.deb

RUN apt-get build-dep -y nfs-ganesha &&\
    apt-get source -y --compile nfs-ganesha

VOLUME /export
rstumbaum@controlnode01.dc1:~/docker-nfs-ganesha-build$ docker build . -t nfs-ganesha-debs
.......

After building the container I just
Code:
docker run -v $HOME/docker-nfs-ganesha-build/export:/export -it nfs-ganesha-debs bash
and copy the build results from within to /export
Code:
root@c800bf9a0614:/build# cp -a *.deb /export/
root@c800bf9a0614:/build# exit
exit

Then I install these onto my LXC container, which is unrestricted and has lxc.seccomp.profile empty.

This is my ganesha.conf which works for NFSv3 and NFSv4 exports - except (hopefully only currently) .snap Ceph snapshot directories:
Code:
root@nfsshares-a:~# egrep -v '^[[:blank:]]*#|^[[:blank:]]*$' /etc/ganesha/ganesha.conf
NFS_CORE_PARAM
{
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 3,4;
        mount_path_pseudo = true;
}
NFS_KRB5
{
        Active_krb5 = false;
}
NFSv4
{
        RecoveryBackend = rados_cluster;
        Minor_Versions =  1,2;
}
MDCACHE {
        Dir_Chunk = 0;
}
EXPORT_DEFAULTS {
        SecType = "sys";
        Squash = No_Root_Squash;
        Attr_Expiration_Time = 0;
}
CEPH
{
}
RADOS_KV
{
        UserId = "ganesharecov";
}
RADOS_URLS
{
        UserId = "ganeshaurls";
        watch_url = "rados://nfs-ganesha/ganesha-namespace/conf-nfsshares";
}
%url    rados://nfs-ganesha/ganesha-namespace/conf-nfsshares
root@nfsshares-a:~#

The output of the ganesha NFSv4 recovery backend is ok:
Code:
root@nfsshares-a:~# ganesha-rados-grace --userid ganesharecov dump
cur=13 rec=0
======================================================
nfsshares-a       
nfsshares-b       
nfsshares-c       
nfsshares-d       
nfsshares-e       
root@nfsshares-a:~#

And if keepalived has to move the IPs around I still have functioning NFS exports.
 
And if keepalived has to move the IPs around I still have functioning NFS exports.
How long does the failover take?

- Run keepalived on the 5 CTs and operate it using 1 active and 4 passive nodes
This seems like an opportunity to put a load balancer in front. And distribute the clients onto those NFS servers.
https://www.haproxy.com/support/technical-notes/an-0052-en-nfs-high-availability/

Then I install these onto my LXC container, which is unrestricted and has lxc.seccomp.profile empty.
What's the reason it doesn't work in unprivileged mode?
 
  • Like
Reactions: Rainerle
How long does the failover take?
From the NFS client it is barely noticable. I currently run a cron.d reboot script like this
Code:
1-59/5 * * * * root hostname | grep -qE 'nfsshares-a' && /bin/systemctl reboot
2-59/5 * * * * root hostname | grep -qE 'nfsshares-b' && /bin/systemctl reboot
3-59/5 * * * * root hostname | grep -qE 'nfsshares-c' && /bin/systemctl reboot
4-59/5 * * * * root hostname | grep -qE 'nfsshares-d' && /bin/systemctl reboot
5-59/5 * * * * root hostname | grep -qE 'nfsshares-e' && /bin/systemctl reboot
and work on a NFS-root VM bootet from an export from these CTs.
This seems like an opportunity to put a load balancer in front. And distribute the clients onto those NFS servers.
https://www.haproxy.com/support/technical-notes/an-0052-en-nfs-high-availability/
I have 5 different subnets for these NFS-root VMs, so I plan to have one CT serving the exports per network. Should make troubleshooting easier.
What's the reason it doesn't work in unprivileged mode?
I think that I read somewhere that a NFS server needs to be privileged. Could be that was with NFS kernel server?
I will try converting it to unprivileged later.
 
From the NFS client it is barely noticable. I currently run a cron.d reboot script like this
Hm... I'd argue that a reboot will close the connection gracefully and you might want to test killed nfs servers as well. I suppose then the failover might take longer.

I have 5 different subnets for these NFS-root VMs, so I plan to have one CT serving the exports per network. Should make troubleshooting easier.
Just curious. Does that mean that all NFS servers present exports for all the networks?

I think that I read somewhere that a NFS server needs to be privileged. Could be that was with NFS kernel server?
I will try converting it to unprivileged later.
Ganesha is a user space NFS server, it shouldn't need ioctl.
 
  • Like
Reactions: Rainerle
Hm... I'd argue that a reboot will close the connection gracefully and you might want to test killed nfs servers as well. I suppose then the failover might take longer.
Good idea! Trying that now!
Just curious. Does that mean that all NFS servers present exports for all the networks?
Yes. The NFS servers have each 7 ethernet devices: admin access, Ceph Public Network, 5 storage networks dedicated to NFS traffic to the VMs. Each VM has two network interfaces: storage access and application network. Storage access is a MTU 9000 non-routed network.
Ganesha is a user space NFS server, it shouldn't need ioctl.
Already had a look at https://forum.proxmox.com/threads/unprivileged-containers.26148/ - is this still valid?
 
Excellent test! The nfs-ganesha systemd.unit file is crap! After a pkill -9 it does not start automatically again, so I am going to loose the NFS exports as soon as I am through with the cycle!
Doesn't keepalived move the IP?

EDIT: unprivileged is the default, on CT creation. Best just backup & restore to get the current CT to get it to unprivileged.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!