HA NFS service for KVM VMs on a Proxmox Cluster with Ceph

Rainerle · Dec 19, 2020

Hi,
we are migrating from a VMware ESXi setup with a NetApp NFS based shared storage.

We also did use NFS filesystems for mounts like /home or /root and application filesystems like a shared /var/www within our virtual machines and host-specific filesystems like /var/log.

Most of our VM-specific filesystems have been migrated to disks attached to the virtual machines now, but we are still wondering on how to migrate our shared filesystems.

Ideas:

Mount Cephfs from the Proxmox VE Ceph Storage (is this supported?) and fiddle around the the Cephx keys and such
Create three VMs, install Ceph and Cephfs on top of the Ceph RDB disks of Proxmox VE and mount Cephfs with in the VMs
Create a single Debian Container on one of the nodes, install NFS kernel server there and export Cephfs directories. No HA and a downtime if we have to service the Proxmox VE host.
Create a single Debian KVM VM, install NFS kernel server and export a disk. No HA - if we reboot the VM the NFS clients stall on the shared filesystem.
Create two Debian KVM VMs, one disk for OS each and one shared disk between them. On that shared disk use OCFS2 and create a heartbeat active-passive NFS server.
Create three Debian KVM VMs or CTs, install NFS Ganesha there and export CephFS directories over these.

Has anybody done something like this before? Any eperiences to share?

Best regards
Rainer

samontetro · Dec 19, 2020

I'm running a NFS server in HA with proxmox. My setup is:
- VM is CentOS7, network is 10Gb ethernet.
- Shared storage is /home
- VM disk is on the CEPH strorage shared by my 3 nodes, with 3 replication for the data.
- The data disk for the NFS share is a different storage (logical disk added) and Backup is disabled for it.
- The VM is backuped 3 times a week
- the shared storage is backed up with BackupPC every day on a user base (so they can restore any lost file themself).

With this setup, migrating the VM between hosts is immediate, and with HA downtime is really short. It is running for several years without any problem.

Rainerle · Dec 19, 2020

Hi @samontetro ,
so what happens if you want to reboot the CentOS7 VM?
- Do your NFS clients stall during that time?
- Do your NFS clients just reconnect?

From my point of view you have a Single Point of Failure with that single VM.

Thanks for your message though.

Rainer

Alwin · Dec 21, 2020

How much downtime is acceptable? And do you need the NFS server at all (eg. all linux clients)?

Rainerle · Dec 21, 2020

Hi @Alwin,

we are currently still running all of our Debian Linux VMs as PXE-booted diskless NFS-Root machines. We have all applications (in a disabled state) installed into one image, create a snapshot and assign that readonly snapshot using DHCP to the VMs. Based on the hostname a config file is assigned and by that boot scripts mount NFS/ZFS/ext4 filesystems, mount bind config files and enable applications using systemctl.

So currently we are very concerned about HA with takeover times below a 5 seconds window. We replaced previously a ZFS-based SUN Storage with NetApp since we experienced around 5 minutes take over time there...

Our future could be booting from an prepared ISO sitting on CephFS or just a RBD OS Disk snapshotted, protected, cloned and shared between many VMs.

So NFS in future would hold /home, /root and /var/www and other files shared between members of a farmed or clustered application. So a failovertime of 10 seconds could be acceptable.

I have been looking into using the CephFS coming with Proxmox but fear loosing enterprise support if we fiddle around with the setup.

Best regards
Rainer

Alwin · Dec 21, 2020

Rainerle said:
we are currently still running all of our Debian Linux VMs as PXE-booted diskless NFS-Root machines. We have all applications (in a disabled state) installed into one image, create a snapshot and assign that readonly snapshot using DHCP to the VMs. Based on the hostname a config file is assigned and by that boot scripts mount NFS/ZFS/ext4 filesystems, mount bind config files and enable applications using systemctl.

This could probably be done by a VM template and involving the snapshot feature (contrary to naming). If it should not have a state.

snapshot=<boolean>
Controls qemu’s snapshot mode feature. If activated, changes made to the disk are temporary and will be discarded when the VM is shutdown.

OFC, this will only be a good option if the VM isn't running for a long time. Since the temp. image is stored in /var/tmp/ (hardcoded) :/.

If state is acceptable then linked clones could help.

Rainerle said:
So currently we are very concerned about HA with takeover times below a 5 seconds window. We replaced previously a ZFS-based SUN Storage with NetApp since we experienced around 5 minutes take over time there...

Well, Proxmox VE needs ~2min to start recovery. So the HA needs to be done a level higher.
https://pve.proxmox.com/pve-docs/chapter-ha-manager.html

Rainerle said:
So NFS in future would hold /home, /root and /var/www and other files shared between members of a farmed or clustered application. So a failovertime of 10 seconds could be acceptable.

Ok, that still requires some form of network storage.

But in all this would make a small OS image feasible, mostly just the kernel to boot and load anything else directly from CephFS without any intermediate proxy.

And if you go further, you could think about skipping the kernel altogether and use a container instead. If live migration isn't needed.

Rainerle said:
I have been looking into using the CephFS coming with Proxmox but fear loosing enterprise support if we fiddle around with the setup.

Well, depends on what you are doing with it. Though using a (or more) VM to host a NFS Ganesha (or other) that accesses the CephFS shouldn't be any problem. Since no extra services are needed on the Proxmox VE host.
https://github.com/nfs-ganesha/nfs-ganesha/wiki/NFS-Ganesha-and-High-Availability

alexskysilk · Dec 21, 2020

Alwin said:
If state is acceptable then linked clones could help.

This will have the benefit of being MUCH faster to deploy then nfs based images

Rainerle · Dec 22, 2020

@alexskysilk , by using a NFS based readonly image I just create a DHCP entry and boot directly via the network from the NFS server.
Maybe I made myself not properly clear on how our current setup works.
https://ltsp.org/ is a project where they use that concept for clients. We use such a setup for our servers.

Installing a system is the easy part. Maintaining and having a trusted reliable environment is the difficult part.

Rainerle · Dec 22, 2020

I am currently looking into the NFS Ganesha keepalived active/passive two VMs path. Adding additional cephx client authorizations on the Proxmox VE Ceph storage does not void the enterprise support, right?

Alwin · Dec 22, 2020

Rainerle said:
Adding additional cephx client authorizations on the Proxmox VE Ceph storage does not void the enterprise support, right?

Why should it?

samontetro · Dec 26, 2020

Rainerle said:
Hi @samontetro ,
so what happens if you want to reboot the CentOS7 VM?
- Do your NFS clients stall during that time?
- Do your NFS clients just reconnect?

From my point of view you have a Single Point of Failure with that single VM.

Thanks for your message though.

Rainer

Yes, rebooting the Centos7 VM freeze the clients (I use hard mount for this). But rebooting a nfs server only VM is fast. And I have large uptimes as I can migrate the VM to another proxmox server without down time for maintenance pupose of the servers.
Patrick

Rainerle · Jan 15, 2021

So I am following down this path now:
- On the 5 production nodes install 5 minimal CTs with NFS-Ganesha on Debian

Code:

root@nfsshares-a:~# grep '^[[:blank:]]*[^[:blank:]#;]' /etc/ganesha/ganesha.conf
NFS_CORE_PARAM
{
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 3,4;
        mount_path_pseudo = true;
}
NFS_KRB5
{
        Active_krb5 = false;
}
NFSv4
{
        RecoveryBackend = rados_ng;
        Minor_Versions =  1,2;
}
MDCACHE {
        Dir_Chunk = 0;
}
EXPORT_DEFAULTS {
        SecType = "sys";
        Squash = No_Root_Squash;
        Attr_Expiration_Time = 0;
}
CEPH
{
}
RADOS_KV
{
        UserId = "ganesharecov";
        nodeid = "nfsshares-a";
}
RADOS_URLS
{
        UserId = "ganeshaurls";
        watch_url = "rados://nfs-ganesha/ganesha-namespace/conf-nfsshares";
}
%url    rados://nfs-ganesha/ganesha-namespace/conf-nfsshares

- Add three Ceph clients

Code:

root@proxmox07:~# cat /etc/pve/priv/ceph.client.ganesha*
[client.ganesha]
        key = ???
        caps mds = "allow r path=/, allow rw path=/vol"
        caps mon = "allow r"
        caps osd = "allow class-read object_prefix rbd_children, allow rw pool=cephfs_data"
[client.ganesharecov]
        key = ???
        caps mon = "allow r"
        caps osd = "allow class-read object_prefix rbd_children, allow rw pool=nfs-ganesha"
[client.ganeshaurls]
        key = ???
        caps mon = "allow r"
        caps osd = "allow class-read object_prefix rbd_children, allow rw pool=nfs-ganesha"

- Configure the NFS Exports in the ceph dashboard
- Run keepalived on the 5 CTs and operate it using 1 active and 4 passive nodes

Problem arising:
https://lists.nfs-ganesha.org/archi...org/message/FN5QET65A6C3EOWGZTAQ6CKH5UKHGESP/

So NFS Ganesha needs to be built against the libcephfs used on the CTs, which I think should be the same as the one used on the PVE nodes.

@Alwin, Could you guys be so kind and add nfs-ganesha to your ceph repository and build it when you build a new ceph version? That would save some CPU cycles world-wide as I believe I am not the only one going down that rabbit hole...

Best regards and a healthy new year
Rainer

Alwin Antreich · Jan 19, 2021

Rainerle said:
So NFS Ganesha needs to be built against the libcephfs used on the CTs, which I think should be the same as the one used on the PVE nodes.

The libcephfs in the CT can be very different from the PVE one. All depends on the Ceph version used.

Rainerle said:
@Alwin, Could you guys be so kind and add nfs-ganesha to your ceph repository and build it when you build a new ceph version? That would save some CPU cycles world-wide as I believe I am not the only one going down that rabbit hole...

NFS-Ganesha is a another project and not part of Ceph.

Rainerle · Jan 23, 2021

I am building NFS-Ganesha now using a Docker container and the Debian build tools.

Code:

rstumbaum@controlnode01.dc1:~/docker-nfs-ganesha-build$ cat Dockerfile
ARG DEBIAN_RELEASE="buster"
ARG CEPH_RELEASE_PVE="nautilus"

FROM debian:${DEBIAN_RELEASE} AS build-env

ARG DEBIAN_RELEASE
ARG CEPH_RELEASE_PVE

ADD http://download.proxmox.com/debian/proxmox-ve-release-6.x.gpg /etc/apt/trusted.gpg.d/proxmox-ve-release-6.x.gpg
RUN chmod 644 /etc/apt/trusted.gpg.d/proxmox-ve-release-6.x.gpg &&\
    echo "deb http://download.proxmox.com/debian/ceph-${CEPH_RELEASE_PVE} ${DEBIAN_RELEASE} main" >/etc/apt/sources.list.d/ceph.list &&\
    echo "deb http://deb.debian.org/debian ${DEBIAN_RELEASE}-backports main" >/etc/apt/sources.list.d/backports.list &&\
    echo "deb-src http://ftp.de.debian.org/debian/ sid main contrib non-free" >//etc/apt/sources.list.d/sid-src.list

RUN apt-get update
RUN apt-get install -y build-essential git-buildpackage cmake bison flex doxygen lsb-release pkgconf \
                       nfs-common ceph-common
RUN apt-get install -y -t ${DEBIAN_RELEASE}-backports libglusterfs-dev

WORKDIR /build

RUN apt-get build-dep -y libntirpc-dev &&\
    apt-get source -y --compile libntirpc-dev &&\
    dpkg -i *.deb

RUN apt-get build-dep -y nfs-ganesha &&\
    apt-get source -y --compile nfs-ganesha

VOLUME /export
rstumbaum@controlnode01.dc1:~/docker-nfs-ganesha-build$ docker build . -t nfs-ganesha-debs
.......

After building the container I just

Code:

docker run -v $HOME/docker-nfs-ganesha-build/export:/export -it nfs-ganesha-debs bash

and copy the build results from within to /export

Code:

root@c800bf9a0614:/build# cp -a *.deb /export/
root@c800bf9a0614:/build# exit
exit

Then I install these onto my LXC container, which is unrestricted and has lxc.seccomp.profile empty.

This is my ganesha.conf which works for NFSv3 and NFSv4 exports - except (hopefully only currently) .snap Ceph snapshot directories:

Code:

root@nfsshares-a:~# egrep -v '^[[:blank:]]*#|^[[:blank:]]*$' /etc/ganesha/ganesha.conf
NFS_CORE_PARAM
{
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 3,4;
        mount_path_pseudo = true;
}
NFS_KRB5
{
        Active_krb5 = false;
}
NFSv4
{
        RecoveryBackend = rados_cluster;
        Minor_Versions =  1,2;
}
MDCACHE {
        Dir_Chunk = 0;
}
EXPORT_DEFAULTS {
        SecType = "sys";
        Squash = No_Root_Squash;
        Attr_Expiration_Time = 0;
}
CEPH
{
}
RADOS_KV
{
        UserId = "ganesharecov";
}
RADOS_URLS
{
        UserId = "ganeshaurls";
        watch_url = "rados://nfs-ganesha/ganesha-namespace/conf-nfsshares";
}
%url    rados://nfs-ganesha/ganesha-namespace/conf-nfsshares
root@nfsshares-a:~#

The output of the ganesha NFSv4 recovery backend is ok:

Code:

root@nfsshares-a:~# ganesha-rados-grace --userid ganesharecov dump
cur=13 rec=0
======================================================
nfsshares-a       
nfsshares-b       
nfsshares-c       
nfsshares-d       
nfsshares-e       
root@nfsshares-a:~#

And if keepalived has to move the IPs around I still have functioning NFS exports.

Alwin Antreich · Jan 23, 2021

Rainerle said:
And if keepalived has to move the IPs around I still have functioning NFS exports.

How long does the failover take?

Rainerle said:
- Run keepalived on the 5 CTs and operate it using 1 active and 4 passive nodes

This seems like an opportunity to put a load balancer in front. And distribute the clients onto those NFS servers.
https://www.haproxy.com/support/technical-notes/an-0052-en-nfs-high-availability/

Rainerle said:
Then I install these onto my LXC container, which is unrestricted and has lxc.seccomp.profile empty.

What's the reason it doesn't work in unprivileged mode?

Rainerle · Jan 23, 2021

Alwin Antreich said:
How long does the failover take?

From the NFS client it is barely noticable. I currently run a cron.d reboot script like this

Code:

1-59/5 * * * * root hostname | grep -qE 'nfsshares-a' && /bin/systemctl reboot
2-59/5 * * * * root hostname | grep -qE 'nfsshares-b' && /bin/systemctl reboot
3-59/5 * * * * root hostname | grep -qE 'nfsshares-c' && /bin/systemctl reboot
4-59/5 * * * * root hostname | grep -qE 'nfsshares-d' && /bin/systemctl reboot
5-59/5 * * * * root hostname | grep -qE 'nfsshares-e' && /bin/systemctl reboot

and work on a NFS-root VM bootet from an export from these CTs.

Alwin Antreich said:
This seems like an opportunity to put a load balancer in front. And distribute the clients onto those NFS servers.
https://www.haproxy.com/support/technical-notes/an-0052-en-nfs-high-availability/

I have 5 different subnets for these NFS-root VMs, so I plan to have one CT serving the exports per network. Should make troubleshooting easier.

Alwin Antreich said:
What's the reason it doesn't work in unprivileged mode?

I think that I read somewhere that a NFS server needs to be privileged. Could be that was with NFS kernel server?
I will try converting it to unprivileged later.

Alwin Antreich · Jan 23, 2021

Rainerle said:
From the NFS client it is barely noticable. I currently run a cron.d reboot script like this

Hm... I'd argue that a reboot will close the connection gracefully and you might want to test killed nfs servers as well. I suppose then the failover might take longer.

Rainerle said:
I have 5 different subnets for these NFS-root VMs, so I plan to have one CT serving the exports per network. Should make troubleshooting easier.

Just curious. Does that mean that all NFS servers present exports for all the networks?

Rainerle said:
I think that I read somewhere that a NFS server needs to be privileged. Could be that was with NFS kernel server?
I will try converting it to unprivileged later.

Ganesha is a user space NFS server, it shouldn't need ioctl.

Rainerle · Jan 23, 2021

Alwin Antreich said:
Hm... I'd argue that a reboot will close the connection gracefully and you might want to test killed nfs servers as well. I suppose then the failover might take longer.

Good idea! Trying that now!

Alwin Antreich said:
Just curious. Does that mean that all NFS servers present exports for all the networks?

Yes. The NFS servers have each 7 ethernet devices: admin access, Ceph Public Network, 5 storage networks dedicated to NFS traffic to the VMs. Each VM has two network interfaces: storage access and application network. Storage access is a MTU 9000 non-routed network.

Alwin Antreich said:
Ganesha is a user space NFS server, it shouldn't need ioctl.

Already had a look at https://forum.proxmox.com/threads/unprivileged-containers.26148/ - is this still valid?

Rainerle · Jan 23, 2021

Rainerle said:
Good idea! Trying that now!

Excellent test! The nfs-ganesha systemd.unit file is crap! After a pkill -9 it does not start automatically again, so I am going to loose the NFS exports as soon as I am through with the cycle!
Have to add Restart=always there...

Alwin Antreich · Jan 23, 2021

Rainerle said:
Excellent test! The nfs-ganesha systemd.unit file is crap! After a pkill -9 it does not start automatically again, so I am going to loose the NFS exports as soon as I am through with the cycle!

Doesn't keepalived move the IP?

EDIT: unprivileged is the default, on CT creation. Best just backup & restore to get the current CT to get it to unprivileged.

HA NFS service for KVM VMs on a Proxmox Cluster with Ceph

Renowned Member

Active Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Distinguished Member

Renowned Member

Renowned Member

Proxmox Retired Staff

Active Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

We value your privacy