[TUTORIAL] Proxmox HOST NFS Server OVER RDMA (ROCE)

v95klima · Jul 3, 2024

Hi
anyone have success with NFS Server over RDMA (ROCE) on the Proxmox Host directly?

I love the low energy consumption avoiding TRUENAS VMs and Windows Server VMs, and running mostly LXCs and Host services.
My network Card is a Connectx-4, and works well with SR-IOV vfs, but I'm hoping to activate an NFS on the Host with ROCE.
Thanks in advance.

I found this link for CentOS, that give the inspiration for this being possible on Proxmox Host?
https://enterprise-support.nvidia.com/s/article/howto-configure-nfs-over-rdma--roce-x

v95klima · Jul 4, 2024

Proxmox Host is on confirmed RDNA link active, enp2s0f0v4

root@epyc5:~# rdma link
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev enp2s0f0np0
link mlx5_1/1 state DOWN physical_state DISABLED netdev enp2s0f1np1
link mlx5_3/1 state DOWN physical_state DISABLED netdev enp2s0f0v1
link mlx5_4/1 state DOWN physical_state DISABLED netdev enp2s0f0v2
link mlx5_5/1 state ACTIVE physical_state LINK_UP
link mlx5_6/1 state ACTIVE physical_state LINK_UP netdev enp2s0f0v4
link mlx5_7/1 state DOWN physical_state DISABLED netdev enp2s0f0v5
link mlx5_8/1 state DOWN physical_state DISABLED netdev enp2s0f0v6
link mlx5_9/1 state DOWN physical_state DISABLED netdev enp2s0f0v7

v95klima · Jul 5, 2024

Got NFS with RDMA to work with ZFS

Still trying on the the regular under /etc/exports for non zfs folders....but seems hard for me... any help on that appreciated)

FOR ZFS

On Proxmox with NFS Server:
stop nfs server:
systemctl stop nfs-kernel-server.service

enable module and add port:
/sbin/modprobe rpcrdma
echo 'rdma 20049' | tee /proc/fs/nfsd/portlist
echo 'tcp 2049' | tee /proc/fs/nfsd/portlist
confirm with:
cat /proc/fs/nfsd/portlist

restart (not start, start seem to reset the ports)
systemctl restart nfs-kernel-server.service

zfs set sharenfs="rw=@192.168.3.3/24,no_root_squash,async" poolname/datafolder
confirm with:
exportfs -v

On Proxmox with NFS Client:
enable module:
/sbin/modprobe rpcrdma
mount 192.168.3.3:/poolname/datafolder /clientfoldername -o rdma,port=20049,async,noatime,nodiratime -vvvv

Credit to:
https://blog.sparktour.me/en/posts/2023/08/24/mount-nfs-via-rdma-on-mlnx-card/
BUT:
To work on PROXMOX I first followed the full description and downloaded and installed the latest MLNX_OFED package for Debian from NVIDIA, but the included install script inside TGZ downloaded did not agree that Proxmox = Debian 12.1.
So manually installed three packages from inside the downloaded TGZ:
dpkg -i mlnx-tools_24.04.0.2404066-1_amd64.deb
dpkg -i mlnx-ofed-kernel-utils_24.04.OFED.24.04.0.7.0.1-1_amd64.deb
dpkg -i mlnx-ofed-kernel-dkms_24.04.OFED.24.04.0.7.0.1-1_all.deb
rebooted and and several new tools and function related to MLNX_OFED

but it broke
/sbin/modprobe rpcrdma
and NFS RDMA would not work.

So then i decided to roll back by:
dpkg -r mlnx-ofed-kernel-dkms
then rebooted without the MLNX DKMS
this made the
/sbin/modprobe rpcrdma
work again.
The rest of the instructions I followed as described above!

v95klima · Jul 5, 2024

Speed test with above NFS RDMA = Full Speed!

fio --name=testfile --directory=/clientfoldername --size=2G --numjobs=10 --rw=write --bs=1000M --ioengine=libaio --fdatasync=1 --runtime=60 --time_based --group_reporting --eta-newline=1s
testfile: (g=0): rw=write, bs=(R) 1000MiB-1000MiB, (W) 1000MiB-1000MiB, (T) 1000MiB-1000MiB, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 10 processes
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
testfile: Laying out IO file (1 file / 2048MiB)
Jobs: 10 (f=10): [W(10)][4.9%][eta 00m:58s]
Jobs: 10 (f=10): [W(10)][6.6%][eta 00m:57s]
Jobs: 10 (f=10): [W(10)][8.3%][w=8008MiB/s][w=8 IOPS][eta 00m:55s]
Jobs: 10 (f=10): [W(10)][11.7%][eta 00m:53s]
Jobs: 10 (f=10): [W(10)][15.0%][w=1001MiB/s][w=1 IOPS][eta 00m:51s]
Jobs: 10 (f=10): [W(10)][18.3%][eta 00m:49s]
Jobs: 10 (f=10): [W(10)][21.7%][eta 00m:47s]
Jobs: 10 (f=10): [W(10)][25.0%][w=3000MiB/s][w=3 IOPS][eta 00m:45s]
Jobs: 10 (f=10): [W(10)][28.3%][eta 00m:43s]
Jobs: 10 (f=10): [W(10)][31.7%][w=1000MiB/s][w=1 IOPS][eta 00m:41s]
Jobs: 10 (f=10): [W(10)][35.0%][w=1000MiB/s][w=1 IOPS][eta 00m:39s]
Jobs: 10 (f=10): [W(10)][39.0%][w=1001MiB/s][w=1 IOPS][eta 00m:36s]
Jobs: 10 (f=10): [W(10)][41.7%][w=9009MiB/s][w=9 IOPS][eta 00m:35s]
Jobs: 10 (f=10): [W(10)][45.0%][w=1000MiB/s][w=1 IOPS][eta 00m:33s]
Jobs: 10 (f=10): [W(10)][48.3%][w=7000MiB/s][w=7 IOPS][eta 00m:31s]
Jobs: 10 (f=10): [W(10)][52.5%][eta 00m:28s]
Jobs: 10 (f=10): [W(10)][55.9%][w=4000MiB/s][w=4 IOPS][eta 00m:26s]
Jobs: 10 (f=10): [W(10)][59.3%][eta 00m:24s]
Jobs: 10 (f=10): [W(10)][61.7%][w=2002MiB/s][w=2 IOPS][eta 00m:23s]
Jobs: 10 (f=10): [W(10)][65.0%][w=6000MiB/s][w=6 IOPS][eta 00m:21s]
Jobs: 10 (f=10): [W(10)][68.3%][eta 00m:19s]
Jobs: 10 (f=10): [W(10)][71.7%][w=3003MiB/s][w=3 IOPS][eta 00m:17s]
Jobs: 10 (f=10): [W(10)][75.0%][w=5000MiB/s][w=5 IOPS][eta 00m:15s]
Jobs: 10 (f=10): [W(10)][78.3%][w=2000MiB/s][w=2 IOPS][eta 00m:13s]
Jobs: 10 (f=10): [W(10)][81.7%][w=1000MiB/s][w=1 IOPS][eta 00m:11s]
Jobs: 10 (f=10): [W(10)][85.0%][w=3003MiB/s][w=3 IOPS][eta 00m:09s]
Jobs: 10 (f=10): [W(10)][88.3%][eta 00m:07s]
Jobs: 10 (f=10): [W(10)][91.7%][w=7000MiB/s][w=7 IOPS][eta 00m:05s]
Jobs: 10 (f=10): [W(10)][95.0%][w=1001MiB/s][w=1 IOPS][eta 00m:03s]
Jobs: 10 (f=10): [W(10)][98.3%][w=5000MiB/s][w=5 IOPS][eta 00m:01s]
Jobs: 10 (f=10): [W(10)][100.0%][w=2000MiB/s][w=2 IOPS][eta 00m:00s]
Jobs: 2 (f=2): [f(2),_(8)][100.0%][w=9.77GiB/s][w=10 IOPS][eta 00m:00s]
testfile: (groupid=0, jobs=10): err= 0: pid=69013: Thu Jul 4 18:45:46 2024
write: IOPS=2, BW=2700MiB/s (2831MB/s)(161GiB/61119msec); 0 zone resets
slat (msec): min=954, max=7524, avg=3553.49, stdev=875.68
clat (usec): min=2, max=51049, avg=453.58, stdev=4013.11
lat (msec): min=954, max=7525, avg=3553.95, stdev=875.72
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5],
| 30.00th=[ 6], 40.00th=[ 7], 50.00th=[ 7], 60.00th=[ 8],
| 70.00th=[ 12], 80.00th=[ 61], 90.00th=[ 297], 95.00th=[ 693],
| 99.00th=[ 6390], 99.50th=[51119], 99.90th=[51119], 99.95th=[51119],
| 99.99th=[51119]
bw ( MiB/s): min=20000, max=20004, per=100.00%, avg=20000.86, stdev= 0.50, samples=155
iops : min= 20, max= 20, avg=20.00, stdev= 0.00, samples=155
lat (usec) : 4=9.09%, 10=58.79%, 20=9.09%, 50=0.61%, 100=6.67%
lat (usec) : 250=4.24%, 500=5.45%, 750=1.21%, 1000=1.21%
lat (msec) : 2=1.21%, 4=0.61%, 10=1.21%, 100=0.61%
fsync/fdatasync/sync_file_range:
sync (msec): min=7, max=301, avg=116.38, stdev=63.11
sync percentiles (msec):
| 1.00th=[ 12], 5.00th=[ 31], 10.00th=[ 42], 20.00th=[ 59],
| 30.00th=[ 71], 40.00th=[ 87], 50.00th=[ 113], 60.00th=[ 133],
| 70.00th=[ 148], 80.00th=[ 169], 90.00th=[ 203], 95.00th=[ 234],
| 99.00th=[ 279], 99.50th=[ 292], 99.90th=[ 300], 99.95th=[ 300],
| 99.99th=[ 300]
cpu : usr=1.22%, sys=28.97%, ctx=1180399, majf=755122, minf=5863483
IO depths : 1=240.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,165,0,0 short=231,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=2700MiB/s (2831MB/s), 2700MiB/s-2700MiB/s (2831MB/s-2831MB/s), io=161GiB (173GB), run=61119-61119msec

Second test only 10 seconds duration gave even better results:

Run status group 0 (all jobs):
WRITE: bw=2889MiB/s (3029MB/s), 2889MiB/s-2889MiB/s (3029MB/s-3029MB/s), io=30.3GiB (32.5GB), run=10732-10732msec

v95klima · Jul 5, 2024

Same real world file transfer speeds as inside a SMB Direct Windows Server session with RDMA.
The ZFS pool is on a PCIE Gen 3 NVME and CPU 2.5 GHz, boost max 3.0 GHz.
Client is PCIE Gen 5 NVME and 6 GHz CPU,

v95klima · Jul 6, 2024

OK, got it to work in regular NFS using /etc/exports
with the above activated, ports and rpcrdma,
On PROXMOX HOST then do:

in /etc/default/nfs-common:
NEED_STATD="no"
NEED_IDMAPD="yes"

/etc/default/nfs-kernel-server.
RPCNFSDOPTS="-N 2 -N 3"
RPCMOUNTDOPTS="--manage-gids -N 2 -N 3"

in /etc/nfs.conf:
vers3=n
vers4=y
vers4.0=n
vers4.1=n
vers4.2=y
rdma=y
rdma-port=20049

systemctl restart nfs-kernel-server.service

in /etc/exports
/hostfoldername 192.168.3.3/24(rw,async,insecure,no_subtree_check,no_root_squash)
exportfs -ra
exportfs -v

On PROXMOX Client do
mount -o rdma,port=20049,vers=4.2 192.168.3.3:/hostfoldername /clientfoldername -vvvv

check with:
nfsstat -m
/hostfoldername from 192.168.3.3:/clientfoldername Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=192.168.3.1,local_lock=none,addr=192.168.3.3

With this direct method there is no ZFS cache helping out so in my case file transfer 1.6GByte -1.8 Gbyte/s to/from the two proxmox servers.

v95klima · Jul 6, 2024

Same fio test as above:

60 seconds:
Run status group 0 (all jobs):
WRITE: bw=2863MiB/s (3002MB/s), 2863MiB/s-2863MiB/s (3002MB/s-3002MB/s), io=175GiB (188GB), run=62525-62525msec

waltar · Jul 29, 2024

Hello v95klima, there's a typo inside your exports because you export to a host (...3.3) or to a network (...3.0/24), change to
exports: /hostfoldername 192.168.3.0/24(rw,sync,no_subtree_check,no_root_squash) ; systemctl restart nfs-kernel-server .
Even I prefere on the client for your nfs(4) mount nconnect=8 instead of rdma (which could be still enabled for your SMB connection) like
mount -o nconnect=8,vers=4.2 192.168.3.3:/hostfoldername /clientfoldername -vvvv (which prevent for emerging system stalls on host or client while transfering) and could be even faster, try yourself the different mount please (nconnect and rdma could be combined also).
Fileserver with edr100, 26x18TB hdd raid6 xfs, first run of fio --name=testfile --directory=/clientfoldername --size=2G --numjobs=10 --rw=write --bs=1000M --ioengine=libaio --fdatasync=1 --runtime=60 --time_based --group_reporting --eta-newline=1s
...
Run status group 0 (all jobs):
WRITE: bw=3790MiB/s (3974MB/s), 3790MiB/s-3790MiB/s (3974MB/s-3974MB/s), io=228GiB (244GB), run=61477-61477msec
... second run
Run status group 0 (all jobs):
WRITE: bw=4073MiB/s (4271MB/s), 4073MiB/s-4073MiB/s (4271MB/s-4271MB/s), io=245GiB (263GB), run=61622-61622msec
...
And a change of "write" to "read" in the fio option "--rw=...":
Run status group 0 (all jobs):
READ: bw=10.6GiB/s (11.4GB/s), 10.6GiB/s-10.6GiB/s (11.4GB/s-11.4GB/s), io=641GiB (688GB), run=60407-60407msec
Good work and good luck.

v95klima · Jul 30, 2024

waltar said:
Hello v95klima, there's a typo inside your exports because you export to a host (...3.3) or to a network (...3.0/24), change to
exports: /hostfoldername 192.168.3.0/24(rw,sync,no_subtree_check,no_root_squash) ; systemctl restart nfs-kernel-server .
Even I prefere on the client for your nfs(4) mount nconnect=8 instead of rdma (which could be still enabled for your SMB connection) like
mount -o nconnect=8,vers=4.2 192.168.3.3:/hostfoldername /clientfoldername -vvvv (which prevent for emerging system stalls on host or client while transfering) and could be even faster, try yourself the different mount please (nconnect and rdma could be combined also).
Fileserver with edr100, hdd raid6 xfs, first run of fio --name=testfile --directory=/clientfoldername --size=2G --numjobs=10 --rw=write --bs=1000M --ioengine=libaio --fdatasync=1 --runtime=60 --time_based --group_reporting --eta-newline=1s
...
Run status group 0 (all jobs):
WRITE: bw=3790MiB/s (3974MB/s), 3790MiB/s-3790MiB/s (3974MB/s-3974MB/s), io=228GiB (244GB), run=61477-61477msec
... second run
Run status group 0 (all jobs):
WRITE: bw=4073MiB/s (4271MB/s), 4073MiB/s-4073MiB/s (4271MB/s-4271MB/s), io=245GiB (263GB), run=61622-61622msec
...
And a change of "write" to "read" in the fio option "--rw=...":
Run status group 0 (all jobs):
READ: bw=10.6GiB/s (11.4GB/s), 10.6GiB/s-10.6GiB/s (11.4GB/s-11.4GB/s), io=641GiB (688GB), run=60407-60407msec
Good work and good luck.

Well done! 100G Inifiband with Raid6 xfs! Thanks!

waltar · Jul 30, 2024

Another tip: If you use nfs(3/4*) rdma mount you CANNOT see interface throughput directly (by sar) but over eg a nagios plugin:
/usr/lib64/nagios/plugins/check_ib_bandwidth.py -d mlx5_0
OK:|rx_mlx5_0=3503B tx_mlx5_0=9243B
A nfs mount "WITHOUT rdma" (with or without nconnect) show network throughput ongoing with "sar -n DEV 1" (sysstat package).
09:13:12 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
09:13:13 AM eno1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:13 AM eno2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:13 AM eno3 1.00 2.00 0.06 0.97 0.00 0.00 0.00
09:13:13 AM eno4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:13 AM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:13 AM ib0 69.00 4450.00 11.97 8837.51 0.00 0.00 0.00
...

[TUTORIAL] Proxmox HOST NFS Server OVER RDMA (ROCE)

v95klima

Member

v95klima

Member

v95klima

Member

v95klima

Member

v95klima

Member

Attachments

v95klima

Member

v95klima

Member

waltar

Famous Member

v95klima

Member

waltar

Famous Member

We value your privacy