HA NFS service for KVM VMs on a Proxmox Cluster with Ceph

Rainerle · Jan 23, 2021

Alwin Antreich said:
Doesn't keepalived move the IP?

Yes it does. But as soon as the cron.d killed all the ganesha.nfsd process on all of the five CTs there is nowhere to move the IP to.

This is a part of my keepalived config:

Code:

rstumbaum@controlnode01.dc1:~$ cat keepalived/conf.d/check_proc_ganesha.conf
vrrp_script check_proc_ganesha {
       script "/usr/bin/pkill -0 ganesha.nfsd" # cheaper than pidof
       interval 1                       # check every second
}
rstumbaum@controlnode01.dc1:~$ cat keepalived/conf.d/vlan3000.conf
vrrp_instance vlan3000 {
  state BACKUP
  nopreempt
  #smtp_alert
  interface eth4
  virtual_router_id 30 # unique ID!
  priority 100
  advert_int 1
  authentication {
    # dont use pass unless on 100% secure net, its send in cleartext https://louwrentius.com/configuring-attacking-and-securing-vrrp-on-linux.html
    # auth_type PASS
    # much secure:
    auth_type AH
    auth_pass 123-3000
  }
  track_script {
      check_proc_ganesha
  }
  virtual_ipaddress {
      10.30.0.2/32
  }
}
rstumbaum@controlnode01.dc1:~$

So if there is no ganesha.nfsd running the CT is no target for a VIP.

Alwin Antreich · Jan 23, 2021

Rainerle said:
Yes it does. But as soon as the cron.d killed all the ganesha.nfsd process on all of the five CTs there is nowhere to move the IP to.

obviously.

I thought more about how the client will behave on a killed connection.

Rainerle · Jan 25, 2021

...and that was it with NFS-Ganesa:
https://github.com/nfs-ganesha/nfs-...31c730257662/src/FSAL/FSAL_CEPH/export.c#L307

Code:

         * Currently, there is no interface for looking up a snapped
         * inode, so we just bail here in that case.
         */
        if (hhdl->chk_snap != CEPH_NOSNAP)
            return ceph2fsal_error(-ESTALE);

Back to nfs-kernel-server and looking into exporting a mounted CephFS filesystem within a priviledged CT.

Rainerle · Jan 25, 2021

No idea why people are using NFS-Ganesha???

Created a fresh CT,
copied, adjusted and reloaded an apparmor profile for it:

Code:

root@proxmox07:~# cat /etc/apparmor.d/lxc/lxc-default-with-nfs2ceph
# Do not load this file.  Rather, load /etc/apparmor.d/lxc-containers, which
# will source all profiles under /etc/apparmor.d/lxc

profile lxc-container-default-nfs2ceph flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/lxc/container-base>

  # the container may never be allowed to mount devpts.  If it does, it
  # will remount the host's devpts.  We could allow it to do it with
  # the newinstance option (but, right now, we don't).
  deny mount fstype=devpts,
  mount fstype=cgroup -> /sys/fs/cgroup/**,
  mount fstype=cgroup2 -> /sys/fs/cgroup/**,
  mount fstype=ceph,
  mount fstype=nfsd,
  mount fstype=rpc_pipefs,
}
root@proxmox07:~# service apparmor reload

Assigned the apparmor profile to the CT:

Code:

root@proxmox07:~# cat /etc/pve/lxc/601.conf
arch: amd64
cores: 4
hostname: nfsshares-a
memory: 4096
nameserver: 10.20.52.1
net0: name=eth0,bridge=vmbr0,gw=10.20.52.1,hwaddr=02:10:20:52:02:40,ip=10.20.52.240/24,tag=52,type=veth
net1: name=eth1,bridge=vmbr0,hwaddr=02:10:34:08:02:40,ip=10.34.8.240/24,tag=3408,type=veth,mtu=9000
net2: name=eth2,bridge=vmbr0,hwaddr=02:10:20:56:02:40,ip=10.20.56.240/24,tag=56,type=veth,mtu=9000
net3: name=eth3,bridge=vmbr0,hwaddr=02:10:20:57:02:40,ip=10.20.57.240/24,tag=57,type=veth,mtu=9000
net4: name=eth4,bridge=vmbr0,hwaddr=02:10:30:00:02:40,ip=10.30.0.240/23,tag=3000,type=veth,mtu=9000
net5: name=eth5,bridge=vmbr0,hwaddr=02:10:31:00:02:40,ip=10.31.0.240/23,tag=3100,type=veth,mtu=9000
net6: name=eth6,bridge=vmbr0,hwaddr=02:10:32:00:02:40,ip=10.32.0.240/23,tag=3200,type=veth,mtu=9000
ostype: debian
rootfs: ceph-proxmox-VMs:vm-601-disk-0,mountoptions=noatime,size=8G
swap: 1024
unrestricted: 1
lxc.seccomp.profile:
lxc.apparmor.profile: lxc-container-default-nfs2ceph
root@proxmox07:~#

and started the CT.

Installed the Ceph from Proxmox, keepalived and nfs-kernel-server,
generated and copied a minimal /etc/ceph/ceph.conf using ceph config generate-minimal-conf on the PVE host and then configured /etc/fstab to mount the CephFS using a dedicated user.

Configured /etc/exports and enabled and started nfs-kernel-server.

Used the configuration as above for keepalived.

Cloned the config and copied the disk and the failover even with a mounted .snap Ceph Snapshots seems to work (everything being NFSv3).

Alwin Antreich · Jan 26, 2021

Rainerle said:
No idea why people are using NFS-Ganesha???

Because it is in user-space and doesn't tear down the kernel if it gets stuck/fails. Best option for containers. Also it can directly connect to various backends, without the (rather disconnected) extra layers (eg. fstab).

Rainerle · Jan 26, 2021

Alwin Antreich said:
Because it is in user-space and doesn't tear down the kernel if it gets stuck/fails. Best option for containers. Also it can directly connect to various backends, without the (rather disconnected) extra layers (eg. fstab).

But does not support properly exporting .snap directories...
At least a bug report was created for this missing feature on NFS-Ganesha : https://tracker.ceph.com/issues/48991

Rainerle · Apr 18, 2021

I can not recommend using nfs-kernel-server with a cephfs kernel client when using CephFS snaphots.

As soon as the MDS service fails over from active to a standby MDS the NFS clients die with kernel panic or services running on them just die. It seems to be a caching problem. This problem seems to be solved since our upgrade from Nautilus to Octopus (15.2.10).
Another bug causing a kernel panic with the Ceph kernel module when using snapshots is this one: https://lore.kernel.org/ceph-devel/96f023e868516b441cfd8f37ebbb56441b71b386.camel@kernel.org/T/#t

Changed the setup now to use:

nfs-kernel-server
ZFS for filesystems where we use snapshots to boot our readonly server images
CephFS kernel client mounts for shared filesystems - but here we turned off CephFS snapshots completely because of https://tracker.ceph.com/issues/50511

CephFS snapshots usage currently still is too experimental...

alexskysilk · Feb 19, 2024

Rainerle said:
I can not recommend using nfs-kernel-server with a cephfs kernel client when using CephFS snaphots.

@reinerle is this any better today?

Rainerle · Feb 20, 2024

Still running on the setup described in post #27.
The bugs I found back then should be fixed by now - but I never tried again...

If you want to try:
- Use a LXC container so you do not need to run keepalive for HA - the restart of those containers is fast enough
- Use NixOS - as it is easy to keep it up2date

Search

Search

HA NFS service for KVM VMs on a Proxmox Cluster with Ceph

Rainerle

Renowned Member

Alwin Antreich

Well-Known Member

Rainerle

Renowned Member

Rainerle

Renowned Member

Alwin Antreich

Well-Known Member

Rainerle

Renowned Member

Rainerle

Renowned Member

alexskysilk

Distinguished Member

Rainerle

Renowned Member

We value your privacy