CEPH in PVE 7.3 can not working with RDMA/RoCE?

realguob · Aug 24, 2023

Hi, PVE geekers:

I build a pve cluster on three server (with ceph), with pve & ceph package version as follow:

Code:

root@node01:~# pveversion
pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve)

root@node01:~# ceph --version
ceph version 16.2.13 (b81a1d7f978c8d41cf452da7af14e190542d2ee2) pacific (stable)

root@node01:~# dpkg -l |grep ceph
ii  ceph                                 16.2.13-pve1                            amd64        distributed storage and file system
ii  ceph-base                            16.2.13-pve1                            amd64        common ceph daemon libraries and management tools
ii  ceph-common                          16.2.13-pve1                            amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-fuse                            16.2.13-pve1                            amd64        FUSE-based client for the Ceph distributed file system
ii  ceph-mds                             16.2.13-pve1                            amd64        metadata server for the ceph distributed file system
ii  ceph-mgr                             16.2.13-pve1                            amd64        manager for the ceph distributed storage system
ii  ceph-mgr-modules-core                16.2.13-pve1                            all          ceph manager modules which are always enabled
ii  ceph-mon                             16.2.13-pve1                            amd64        monitor server for the ceph storage system
ii  ceph-osd                             16.2.13-pve1                            amd64        OSD server for the ceph storage system
...

Ceph works fine when using public and cluster network with tcp/ip, and the performance is not as expected (3x10 Samsung 960GB SATA SSD).

Code:

root@node01:~# ceph -s
  cluster:
    id:     87d52299-0504-4c6f-8882-3bc27b85cc53
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node01,node02,node03 (age 22s)
    mgr: node02(active, since 20s), standbys: node01, node03
    osd: 30 osds: 30 up (since 14s), 30 in (since 37m)

  data:
    pools:   2 pools, 129 pgs
    objects: 42 objects, 19 B
    usage:   2.6 TiB used, 26 TiB / 29 TiB avail
    pgs:     129 active+clean

My ceph.conf with tcp/ip network:

Code:

root@node01:~# cat /etc/ceph/ceph.conf
[global]
    auth_client_required = cephx
    auth_cluster_required = cephx
    auth_service_required = cephx
    cluster_network = 192.168.2.1/24
    fsid = 87d52299-0504-4c6f-8882-3bc27b85cc53
    mon_allow_pool_delete = true
    mon_host = 192.168.1.1 192.168.1.2 192.168.1.3
    ms_bind_ipv4 = true
    ms_bind_ipv6 = false
    osd_pool_default_min_size = 2
    osd_pool_default_size = 3
    public_network = 192.168.1.1/24

There are 4 Mellanox adapters (Mellanox Technologies MT27710 Family [ConnectX-4 Lx]) installed in each server, and I installed the MLNX_OFED driver from NVIDIA (MLNX_OFED_LINUX-5.8-3.0.7.0-debian11.3-x86_64).

As following:

Code:

root@node01:~# dpkg -l |grep " mlnx"
ii  knem-dkms                            1.1.4.90mlnx2-OFED.23.07.0.2.2.1        all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ethtool                         5.18-1.58307                            amd64        This utility allows querying and changing settings such as speed,
ii  mlnx-iproute2                        5.19.0-1.58307                          amd64        This utility allows querying and changing settings such as speed,
ii  mlnx-ofed-all                        5.8-3.0.7.0                             all          MLNX_OFED all installer package  (with DKMS support)
ii  mlnx-ofed-kernel-dkms                5.8-OFED.5.8.3.0.7.1                    all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ofed-kernel-utils               5.8-OFED.5.8.3.0.7.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
ii  mlnx-tools                           5.8.0-1.lts.58307                       amd64        Userspace tools to restart and tune MLNX_OFED kernel modules

I'v tested the rdma network with the following tools, all seems to work fine:

- ibping
- ucmatose
- rping
- ibv_rc_pingpong
- ibv_srq_pingpong
- ibv_xsrq_pingpong
- rdma_xserver/rdma_xclient
- rdma_server/rdma_client

Also, `hca_self_test.ofed` , `ibdev2netdev`, `ibstat`, `ibv_devices`.... commands shows all works fine:

Code:

root@node01:/opt# hca_self_test.ofed

---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 8
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... MLNX_OFED_LINUX-5.8-3.0.7.0 (OFED-5.8-3.0.7): 5.15.74-1-pve
Host Driver RPM Check .................. PASS
Firmware on CA #0 NIC .................. v14.32.1010
Firmware on CA #1 NIC .................. v14.32.1010
Firmware on CA #2 NIC .................. v14.32.1010
Firmware on CA #3 NIC .................. v14.32.1010
Firmware on CA #4 NIC .................. v14.32.1010
Firmware on CA #5 NIC .................. v14.32.1010
Firmware on CA #6 NIC .................. v14.32.1010
Firmware on CA #7 NIC .................. v14.32.1010
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 3
Port State of Port #1 on CA #0 (NIC)..... DOWN (Ethernet)
Port State of Port #1 on CA #1 (NIC)..... UP 1X EDR (Ethernet)
Port State of Port #1 on CA #2 (NIC)..... DOWN (Ethernet)
Port State of Port #1 on CA #3 (NIC)..... DOWN (Ethernet)
Port State of Port #1 on CA #4 (NIC)..... UP 1X EDR (Ethernet)
Port State of Port #1 on CA #5 (NIC)..... DOWN (Ethernet)
Port State of Port #1 on CA #6 (NIC)..... UP 1X EDR (Ethernet)
Port State of Port #1 on CA #7 (NIC)..... DOWN (Ethernet)
Error Counter Check on CA #0 (NIC)...... PASS
Error Counter Check on CA #1 (NIC)...... PASS
Error Counter Check on CA #2 (NIC)...... PASS
Error Counter Check on CA #3 (NIC)...... PASS
Error Counter Check on CA #4 (NIC)...... PASS
Error Counter Check on CA #5 (NIC)...... PASS
Error Counter Check on CA #6 (NIC)...... PASS
Error Counter Check on CA #7 (NIC)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (NIC) ............... 30:c6:d7:00:00:8f:fa:01
Node GUID on CA #1 (NIC) ............... 30:c6:d7:00:00:8f:fa:02
Node GUID on CA #2 (NIC) ............... 30:c6:d7:00:00:8f:f9:9d
Node GUID on CA #3 (NIC) ............... 30:c6:d7:00:00:8f:f9:9e
Node GUID on CA #4 (NIC) ............... 30:c6:d7:00:00:8f:fa:ff
Node GUID on CA #5 (NIC) ............... 30:c6:d7:00:00:8f:fb:00
Node GUID on CA #6 (NIC) ............... 30:c6:d7:00:00:8f:f9:b3
Node GUID on CA #7 (NIC) ............... 30:c6:d7:00:00:8f:f9:b4
------------------ DONE ---------------------

root@node01:/opt# ibdev2netdev
mlx5_0 port 1 ==> ens1f0np0 (Down)
mlx5_1 port 1 ==> ens1f1np1 (Up)
mlx5_2 port 1 ==> ens2f0np0 (Down)
mlx5_3 port 1 ==> ens2f1np1 (Down)
mlx5_4 port 1 ==> ens4f0np0 (Up)
mlx5_5 port 1 ==> ens4f1np1 (Down)
mlx5_6 port 1 ==> ens5f0np0 (Up)
mlx5_7 port 1 ==> ens5f1np1 (Down)

So, I tried to enable rdma for ceph according to the tips of [BRING UP CEPH RDMA - DEVELOPER'S GUIDE]https://enterprise-support.nvidia.com/s/article/bring-up-ceph-rdma---developer-s-guide).

This is my new-added options in ceph.conf - [global] section:

Code:

    #Enable ceph with RDMA:
    ms_async_op_threads = 8    #default 3
    # ms_type = async
    ms_public_type = async+posix    #keep frontend with posix
    ms_cluster_type = async+rdma    #for setting backend only to RDMA
    ms_async_rdma_type = rdma    #default ib
    ms_async_rdma_device_name = mlx5_6
    ms_async_rdma_cluster_device_name = ens5f0np0
    ms_async_rdma_roce_ver = 2
    ms_async_rdma_gid_idx = 3
    ms_async_rdma_local_gid = 0000:0000:0000:0000:0000:ffff:c0a8:0201

And, referring to step 11 of the above blog, I modified the ceph service file (ceph-mgr@.service, ceph-mon@.service, ceph-osd@.service, ceph-mds@.service).

File `/etc/security/limits.conf` is updated too.

I understand that by default /etc/ceph/ceph.conf is a link to /etc/pve/ceph.conf (from pvefs) for cluster configuration file synchronization.

In order to use roce v2, I deleted the link and placed /etc/ceph/ceph.conf separately for each server, the only difference is that the `ms_async_rdma_local_gid` in `ceph.conf` is obtained from the `show_gids` command of each server itself.

Here is a full ceph.conf file on server 1 (node01):

Code:

root@node01:~# cat /etc/ceph/ceph.conf
[global]
  auth_client_required = cephx
  auth_cluster_required = cephx
  auth_service_required = cephx
  cluster_network = 192.168.2.1/24
  fsid = 87d52299-0504-4c6f-8882-3bc27b85cc53
  mon_allow_pool_delete = true
  mon_host = 192.168.1.1 192.168.1.2 192.168.1.3
  ms_bind_ipv4 = true
  ms_bind_ipv6 = false
  osd_pool_default_min_size = 2
  osd_pool_default_size = 3
  public_network = 192.168.1.1/24

  #Enable ceph with RDMA:
  ms_async_op_threads = 8    #default 3
  # ms_type = async
  ms_public_type = async+posix    #keep frontend with posix
  ms_cluster_type = async+rdma    #for setting backend only to RDMA
  ms_async_rdma_type = rdma    #default ib
  ms_async_rdma_device_name = mlx5_6
  ms_async_rdma_cluster_device_name = ens5f0np0
  ms_async_rdma_roce_ver = 2
  ms_async_rdma_gid_idx = 3
  ms_async_rdma_local_gid = 0000:0000:0000:0000:0000:ffff:c0a8:0201    #This is the only difference.

[client]
  keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.node01]
  public_addr = 192.168.1.1

[mon.node02]
  public_addr = 192.168.1.2

[mon.node03]
  public_addr = 192.168.1.3

When I start `ceph.target` on all servers, `ceph -s` shows everything fine for the first few seconds, then all pgs become unknown (100.000% pg unknown), all ceph-osd processes crashed. Cluster health status becomes HEALTH_WARN. As following:

Start ceph.target:

Code:

root@node01:~# ceph -s
  cluster:
    id:     87d52299-0504-4c6f-8882-3bc27b85cc53
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node01,node02,node03 (age 19s)
    mgr: node02(active, since 16s), standbys: node01, node03
    osd: 30 osds: 30 up (since 58m), 30 in (since 95m)

  data:
    pools:   2 pools, 129 pgs
    objects: 42 objects, 19 B
    usage:   2.6 TiB used, 26 TiB / 29 TiB avail
    pgs:     129 active+clean

root@node01:~# ps -ef |grep ceph
ceph     414046      1  3 23:25 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ceph-crash
ceph     414047      1 37 23:25 ?        00:00:00 /usr/bin/ceph-mgr -f --cluster ceph --id node03 --setuser ceph --setgroup ceph
ceph     414048      1 12 23:25 ?        00:00:00 /usr/bin/ceph-mon -f --cluster ceph --id node03 --setuser ceph --setgroup ceph
ceph     414083      1  2 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 20 --setuser ceph --setgroup ceph
ceph     414094      1  3 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 21 --setuser ceph --setgroup ceph
ceph     414103      1  2 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 22 --setuser ceph --setgroup ceph
ceph     414106      1  2 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 23 --setuser ceph --setgroup ceph
ceph     414108      1  3 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 24 --setuser ceph --setgroup ceph
ceph     414109      1  2 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 25 --setuser ceph --setgroup ceph
ceph     414110      1  3 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 26 --setuser ceph --setgroup ceph
ceph     414113      1  3 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 27 --setuser ceph --setgroup ceph
ceph     414114      1  2 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 28 --setuser ceph --setgroup ceph
ceph     414115      1  2 23:25 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 29 --setuser ceph --setgroup ceph
root     414580 414456  0 23:25 ?        00:00:00 bash -c ps -ef |grep ceph
root     414584 414580  0 23:25 ?        00:00:00 grep ceph

After a few seconds...

Code:

root@node01:~# ceph -s
  cluster:
    id:     87d52299-0504-4c6f-8882-3bc27b85cc53
    health: HEALTH_WARN
            Reduced data availability: 129 pgs inactive

  services:
    mon: 3 daemons, quorum node01,node02,node03 (age 2m)
    mgr: node02(active, since 2m), standbys: node01, node03
    osd: 30 osds: 30 up (since 60m), 30 in (since 97m)

  data:
    pools:   2 pools, 129 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             129 unknown

root@node01:~# ps -ef |grep ceph
root     460031      1  0 23:24 ?        00:00:01 /usr/bin/rbd -p ceph_pool01 -c /etc/pve/ceph.conf --auth_supported cephx -n client.admin --keyring /etc/pve/priv/ceph/ceph_pool01.keyring ls
ceph     460967      1  0 23:25 ?        00:00:02 /usr/bin/python3.9 /usr/bin/ceph-crash
ceph     460968      1  0 23:25 ?        00:00:02 /usr/bin/ceph-mgr -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
ceph     460969      1  0 23:25 ?        00:00:02 /usr/bin/ceph-mon -f --cluster ceph --id node01 --setuser ceph --setgroup ceph
root     474811 314176  0 23:31 pts/0    00:00:00 grep ceph

Can anyone help to solve it?
Can anyone help to solve it?
Can anyone help to solve it?

I was wondering, has anyone successfully enabled RDMA with ceph in PVE?

PS:
I have truncated all ceph logs before starting ceph with rdma, so, the attached file is a complete ceph startup log with rdma option configured in ceph.conf.[/CODE]

gurubert · Aug 25, 2023

Have you tested if the Ceph binaries from Proxmox have RDMA support?

# strings /usr/bin/ceph-osd |grep -i rdma

realguob · Aug 28, 2023

gurubert said:
Have you tested if the Ceph binaries from Proxmox have RDMA support?

# strings /usr/bin/ceph-osd |grep -i rdma

it gives lots of output, looks like it is rdma supported. i'm not so sure about this.

output from the command is in the file i uploaded. please... & tks!

realguob · Aug 30, 2023

hi all,
i tested it in pve 8.0, it works fine.

itNGO · Sep 2, 2023

realguob said:
hi all,
i tested it in pve 8.0, it works fine.

Do you have some guide about, what you gave configured?

realguob · Oct 10, 2023

itNGO said:
Do you have some guide about, what you gave configured?

here is the config list:

ms_async_op_threads = 8
ms_async_max_op_threads = 16
ms_type = async+rdma
ms_public_type = async+posix
ms_cluster_type = async+rdma
ms_async_rdma_type = ib
ms_async_transport_type = rdma
ms_async_rdma_device_name = mlx5_2
ms_async_rdma_cluster_device_name = eth3
ms_async_rdma_roce_ver = 2
ms_async_rdma_gid_idx = 3
ms_async_rdma_local_gid = 0000:0000:0000:0000:0000:ffff:c0a8:0285 #Get by show_gids on each node.

itNGO · Nov 29, 2023

realguob said:
here is the config list:

ms_async_op_threads = 8
ms_async_max_op_threads = 16
ms_type = async+rdma
ms_public_type = async+posix
ms_cluster_type = async+rdma
ms_async_rdma_type = ib
ms_async_transport_type = rdma
ms_async_rdma_device_name = mlx5_2
ms_async_rdma_cluster_device_name = eth3
ms_async_rdma_roce_ver = 2
ms_async_rdma_gid_idx = 3
ms_async_rdma_local_gid = 0000:0000:0000:0000:0000:ffff:c0a8:0285 #Get by show_gids on each node.

Hi, can you get a bit more in detail about "ms_async_rdma_local_gid"?

Is this something that has to be set in each node? Or do I add multiple of them in the one shared /etc/ceph/ceph.conf?

Thanks...

Superfish1000 · Dec 10, 2023

itNGO said:
Is this something that has to be set in each node? Or do I add multiple of them in the one shared /etc/ceph/ceph.conf?

I wasn't sure on this at first either. From @realguob's original note though it looks like they need to be un-linked so they can be edited individually.

realguob said:
In order to use roce v2, I deleted the link and placed /etc/ceph/ceph.conf separately for each server, the only difference is that the `ms_async_rdma_local_gid` in `ceph.conf` is obtained from the `show_gids` command of each server itself.

Right now I'm looking to see if there is a way I can specify the GUIDs for each node without needing to unlink them. I'll update if I find anything.

alexskysilk · Dec 10, 2023

@realguob I played around with ceph over roce some years ago, and at the time I found it without use (performance improvement was negligible.) can you expound on your experience with it? (what hardware, any perf numbers tcp/roce, etc.)

realguob · Dec 11, 2023

itNGO said:
Hi, can you get a bit more in detail about "ms_async_rdma_local_gid"?

Is this something that has to be set in each node? Or do I add multiple of them in the one shared /etc/ceph/ceph.conf?

Thanks...

Hi, this ms_async_rdma_local_gids has to be set in each node, with the GID you found on each node for the right port.

u can get the port GID on each node by using the command "show_gids" like this:

Code:

root@node01:~# show_gids
DEV    PORT    INDEX    GID                    IPv4          VER    DEV
---    ----    -----    ---                    ------------      ---    ---
......
mlx5_6    1    2    0000:0000:0000:0000:0000:ffff:c0a8:0201    192.168.2.1      v1    ens5f0np0
mlx5_6    1    3    0000:0000:0000:0000:0000:ffff:c0a8:0201    192.168.2.1      v2    ens5f0np0              <--- this is where the GID come from.
......

The above v1 and v2 means the roce version, i think.

On pve ceph cluster, you have to unlink the shared ceph.conf first on each node, then create one separately for each node in /etc/ceph/, add the rdma configs.

realguob · Dec 11, 2023

Superfish1000 said:
I wasn't sure on this at first either. From @realguob's original note though it looks like they need to be un-linked so they can be edited individually.

Right now I'm looking to see if there is a way I can specify the GUIDs for each node without needing to unlink them. I'll update if I find anything.

i've tried, but find nothing. if you can, please share it...

realguob · Dec 11, 2023

alexskysilk said:
@realguob I played around with ceph over roce some years ago, and at the time I found it without use (performance improvement was negligible.) can you expound on your experience with it? (what hardware, any perf numbers tcp/roce, etc.)

So far, I feel the same way as you, & i have no plans to enable rdma in production.

alexskysilk · Dec 11, 2023

ROCE might come into play once crimson is deployed, as it will allow multiple threads per osd and bottlenecks could shift to the interconnect. but thats probably a couple of years away.

Superfish1000 · Dec 11, 2023

Okay, so... If I'm understanding this correctly it is possible to do this, though you can't do it in the config file.

It looks like CEPH is shifting away from a config file to a configuration database. This appears to be the only way you can do it.
https://documentation.suse.com/ses/...-configuration.html#cha-ceph-configuration-db

You should be able to do this with masks.
Ref: https://docs.ceph.com/en/latest/rados/configuration/ceph-conf/#monitor-configuration-database
What I tested successfully was using the following command to set a relatively benign setting that I could then verify on the host.

If I understand right, using the set flag vs tell command for a temporary change.

ceph config set global/host:Node-4 osd_scrub_begin_hour 11

You can then verify the setting by querying an OSD on the host.

ceph daemon osd.1 config show | grep osd_scrub_begin_hour

This will independently set the values across the hosts.

I'm testing this now to see if it works.

Superfish1000 · Dec 15, 2023

So from my testing it does appear that the commands I mentioned above work.
This should allow you to set these values on a per node basis.

As for performance changes, @realguob, @alexskysilk, what kind of performance changes did you see if any?
I am pretty sure I identified other serious issues with cluster layout, but I noticed zero change so far from switching to RDMA.
I'm pretty sure now that the issue is more my WAL and DB setup coupled with the low number of HDDs I have rather than the network throughput.

alexskysilk · Dec 15, 2023

Superfish1000 said:
what kind of performance changes did you see if any?

effectively, none. rdma can come into play performance wise if/when you end up bottlenecked at the network layer; in order to get to that point you need to
1. have a sufficiently large deployment (hundreds if not thousands of osds)
2. fast enough osds (NVME)
3. sufficient load to stress the subsystem (if you have 1 and 2 handled, you'd need to generate millions of iops.

even then, you'd gain maybe 10-20% improvement under specific conditions. I didnt have a testbed that would fit in that criteria.

this MIGHT change when ceph OSDs switch to crimson (https://docs.ceph.com/en/latest/dev/crimson/crimson/) as it is truly multithreaded- which would allow saturation of individual OSDs.

Search

Search

CEPH in PVE 7.3 can not working with RDMA/RoCE?

realguob

New Member

Attachments

gurubert

Distinguished Member

realguob

New Member

Attachments

realguob

New Member

itNGO

Renowned Member

realguob

New Member

itNGO

Renowned Member

Superfish1000

Member

alexskysilk

Distinguished Member

realguob

New Member

realguob

New Member

realguob

New Member

alexskysilk

Distinguished Member

Superfish1000

Member

Superfish1000

Member

alexskysilk

Distinguished Member