Ceph blustore over RDMA performance gain

Gerhard W. Recher

Active Member
Mar 10, 2017
158
8
38
Munich
i Just succeed to have udaddy and rping happy i followed elurex suggestion... but
Code:
unpack build OFED driver
cd to DEBS

dpkg --force-overwrite  -i *.deb
reboot
and all rdma pingers are happy

but change ceph.conf to RDMA still mon and mgr are unhappy with memlock ! but i set these to infinity!

I'm totally lost in this moment have no glue why mon and mgr are failing to invoke RDMA !
 

alexskysilk

Renowned Member
Oct 16, 2015
804
105
63
Chatsworth, CA
www.skysilk.com
i Just succeed to have udaddy and rping happy i followed elurex suggestion... but
Code:
unpack build OFED driver
cd to DEBS

dpkg --force-overwrite  -i *.deb
reboot
and all rdma pingers are happy

but change ceph.conf to RDMA still mon and mgr are unhappy with memlock ! but i set these to infinity!

I'm totally lost in this moment have no glue why mon and mgr are failing to invoke RDMA !
changing to rdma means restarting all monitors; the process will be the same as https://pve.proxmox.com/wiki/Ceph_Server#Disabling_Cephx
 

Gerhard W. Recher

Active Member
Mar 10, 2017
158
8
38
Munich

alexskysilk

Renowned Member
Oct 16, 2015
804
105
63
Chatsworth, CA
www.skysilk.com
i rebooted whole cluster all 4 nodes !
do you have a running cluseter with RDMA ? which steps do you made to get it running in detail ?
I'm totally stuck ...
I did have it working last year and determined the performance gain (loss) to be negligible and not worth the effort. Without actually re-attempting it I cant see anything you're doing wrong, specifically...
 

Gerhard W. Recher

Active Member
Mar 10, 2017
158
8
38
Munich
@elurex ... please finally post your exact steps along with configuration file modifications you made ...
 

Gerhard W. Recher

Active Member
Mar 10, 2017
158
8
38
Munich
I did have it working last year and determined the performance gain (loss) to be negligible and not worth the effort. Without actually re-attempting it I cant see anything you're doing wrong, specifically...
@elurex claims 25% more speed, not loss , but gave no more response to my questions in this thread as you see ... :(
 

alexskysilk

Renowned Member
Oct 16, 2015
804
105
63
Chatsworth, CA
www.skysilk.com
@elurex claims 25% more speed, not loss , but gave no more response to my questions in this thread as you see ... :(

Your overall experience is going to be dependent on a number of factors, including (but not limited to) your drives, the CPU/RAM loadout of the nodes, the interface speed, etc, etc. I also benchmarked jewel with filestore. In other words- YMMV.
 

elurex

Active Member
Oct 28, 2015
204
11
38
Taiwan
i Just succeed to have udaddy and rping happy i followed elurex suggestion... but
Code:
unpack build OFED driver
cd to DEBS

dpkg --force-overwrite  -i *.deb
reboot
and all rdma pingers are happy

but change ceph.conf to RDMA still mon and mgr are unhappy with memlock ! but i set these to infinity!

I'm totally lost in this moment have no glue why mon and mgr are failing to invoke RDMA !

sometimes my ceph-mon@[server id].service ceph-mgr@[server id].service does not start automatically, I will have to do the following

Code:
systemctl enable ceph-mon@[server id].service
systemctl enable ceph-mgr@[server id].service

systemctl start ceph-mon@[server id].service
systemctl start ceph-mgr@[server id].service

Code:
root@epyc2:/lib/systemd/system# cat ceph-disk@.service
[Unit]
Description=Ceph disk activation: %f
After=local-fs.target
Wants=local-fs.target

[Service]
Type=oneshot
KillMode=none
Environment=CEPH_DISK_TIMEOUT=10000
ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log
-stdout trigger --sync %f'
TimeoutSec=0
LimitMEMLOCK=infinity

root@epyc2:/lib/systemd/system# cat ceph-mgr@.service
[Unit]
Description=Ceph cluster manager daemon
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=ceph-mgr.target

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/default/ceph
Environment=CLUSTER=ceph

ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=10
StartLimitInterval=30min
StartLimitBurst=3
LimitMEMLOCK=infinity

[Install]
WantedBy=ceph-mgr.target

root@epyc2:/lib/systemd/system# cat ceph-mon@.service
[Unit]
Description=Ceph cluster monitor daemon

# According to:
#   http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget
# these can be removed once ceph-mon will dynamically change network
# configuration.
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target

PartOf=ceph-mon.target

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/default/ceph
Environment=CLUSTER=ceph
ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecReload=/bin/kill -HUP $MAINPID
PrivateDevices=yes
ProtectHome=true
ProtectSystem=full
PrivateTmp=true
TasksMax=infinity
Restart=on-failure
StartLimitInterval=30min
StartLimitBurst=5
RestartSec=10
LimitMEMLOCK=infinity
PrivateDevices=no

[Install]
WantedBy=ceph-mon.target

root@epyc2:/lib/systemd/system# cat ceph-osd@.service
[Unit]
Description=Ceph object storage daemon osd.%i
After=network-online.target local-fs.target time-sync.target ceph-mon.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=ceph-osd.target

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/default/ceph
Environment=CLUSTER=ceph
ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i
ExecReload=/bin/kill -HUP $MAINPID
ProtectHome=true
ProtectSystem=full
PrivateTmp=true
TasksMax=infinity
Restart=on-failure
StartLimitInterval=30min
StartLimitBurst=30
RestartSec=20s
LimitMEMLOCK=infinity
PrivateDevices=no

[Install]
WantedBy=ceph-osd.target

its computex week.... busy as hell
 

Gerhard W. Recher

Active Member
Mar 10, 2017
158
8
38
Munich
sometimes my ceph-mon@[server id].service ceph-mgr@[server id].service does not start automatically, I will have to do the following

Code:
systemctl enable ceph-mon@[server id].service
systemctl enable ceph-mgr@[server id].service

systemctl start ceph-mon@[server id].service
systemctl start ceph-mgr@[server id].service

Code:
root@epyc2:/lib/systemd/system# cat ceph-disk@.service
[Unit]
Description=Ceph disk activation: %f
After=local-fs.target
Wants=local-fs.target

[Service]
Type=oneshot
KillMode=none
Environment=CEPH_DISK_TIMEOUT=10000
ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log
-stdout trigger --sync %f'
TimeoutSec=0
LimitMEMLOCK=infinity

root@epyc2:/lib/systemd/system# cat ceph-mgr@.service
[Unit]
Description=Ceph cluster manager daemon
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=ceph-mgr.target

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/default/ceph
Environment=CLUSTER=ceph

ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=10
StartLimitInterval=30min
StartLimitBurst=3
LimitMEMLOCK=infinity

[Install]
WantedBy=ceph-mgr.target

root@epyc2:/lib/systemd/system# cat ceph-mon@.service
[Unit]
Description=Ceph cluster monitor daemon

# According to:
#   http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget
# these can be removed once ceph-mon will dynamically change network
# configuration.
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target

PartOf=ceph-mon.target

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/default/ceph
Environment=CLUSTER=ceph
ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecReload=/bin/kill -HUP $MAINPID
PrivateDevices=yes
ProtectHome=true
ProtectSystem=full
PrivateTmp=true
TasksMax=infinity
Restart=on-failure
StartLimitInterval=30min
StartLimitBurst=5
RestartSec=10
LimitMEMLOCK=infinity
PrivateDevices=no

[Install]
WantedBy=ceph-mon.target

root@epyc2:/lib/systemd/system# cat ceph-osd@.service
[Unit]
Description=Ceph object storage daemon osd.%i
After=network-online.target local-fs.target time-sync.target ceph-mon.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=ceph-osd.target

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/default/ceph
Environment=CLUSTER=ceph
ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i
ExecReload=/bin/kill -HUP $MAINPID
ProtectHome=true
ProtectSystem=full
PrivateTmp=true
TasksMax=infinity
Restart=on-failure
StartLimitInterval=30min
StartLimitBurst=30
RestartSec=20s
LimitMEMLOCK=infinity
PrivateDevices=no

[Install]
WantedBy=ceph-osd.target

its computex week.... busy as hell

thx ! this was a match winner, i stored them in /etc/systemd/system .... my fault apparently ..

YMMD :)

Normal TCP/ip:
Code:
Total time run:         60.019108
Total writes made:      41911
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2793.18
Stddev Bandwidth:       42.2206
Max bandwidth (MB/sec): 2876
Min bandwidth (MB/sec): 2680
Average IOPS:           698
Stddev IOPS:            10
Max IOPS:               719
Min IOPS:               670
Average Latency(s):     0.0229107
Stddev Latency(s):      0.00582898
Max latency(s):         0.236009
Min latency(s):         0.0101753

RDMA enabled:
Code:
Total time run:         60.020801
Total writes made:      46247
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     3082.06
Stddev Bandwidth:       116.199
Max bandwidth (MB/sec): 3280
Min bandwidth (MB/sec): 2652
Average IOPS:           770
Stddev IOPS:            29
Max IOPS:               820
Min IOPS:               663
Average Latency(s):     0.0207637
Stddev Latency(s):      0.00805267
Max latency(s):         0.214858
Min latency(s):         0.00835243
 

Gerhard W. Recher

Active Member
Mar 10, 2017
158
8
38
Munich
seems to be ok, but one windows2016 vm does not start all other kvm and containers start and behave expected ...
perhaps because a snapshot has been made before ?


Code:
/home/builder/source/ceph-12.2.5/src/msg/async/rdma/Infiniband.cc: In function 'void Infiniband::set_dispatcher(RDMADispatcher*)' thread 7faefe99a280 time 2018-06-09 12:43:37.183257
/home/builder/source/ceph-12.2.5/src/msg/async/rdma/Infiniband.cc: 779: FAILED assert(!d ^ !dispatcher)
ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7faeedc036d2]
2: (()+0x45003a) [0x7faeeddb903a]
3: (RDMAStack::RDMAStack(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x97e) [0x7faeeddc7a3e]
4: (NetworkStack::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x212) [0x7faeeddad022]
5: (AsyncMessenger::AsyncMessenger(CephContext*, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long)+0xeb5) [0x7faeedda16d5]
6: (Messenger::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, unsigned long)+0x10f) [0x7faeedd4ecdf]
7: (Messenger::create_client_messenger(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x14a) [0x7faeedd4f3aa]
8: (librados::RadosClient::connect()+0x85) [0x7faefc85ccf5]
9: (rados_connect()+0x1f) [0x7faefc8093af]
10: (()+0x58b872) [0x55af0585b872]
11: (()+0x51ddb2) [0x55af057eddb2]
12: (()+0x51f306) [0x55af057ef306]
13: (()+0x51ff1b) [0x55af057eff1b]
14: (()+0x51ee44) [0x55af057eee44]
15: (()+0x520061) [0x55af057f0061]
16: (()+0x55f7d1) [0x55af0582f7d1]
17: (()+0x34103f) [0x55af0561103f]
18: (()+0x34ba31) [0x55af0561ba31]
19: (()+0x5fbfea) [0x55af058cbfea]
20: (main()+0x1476) [0x55af054ed696]
21: (__libc_start_main()+0xf1) [0x7faef8c692e1]
22: (()+0x22425a) [0x55af054f425a]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
TASK ERROR: start failed: command '/usr/bin/kvm -id 120 -name wsus -chardev 'socket,id=qmp,path=/var/run/qemu-server/120.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/120.pid -daemonize -smbios 'type=1,uuid=ac188274-8f76-4369-9ec1-4e98e18cfed5' -smp '16,sockets=2,cores=8,maxcpus=16' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/120.vnc,x509,password -no-hpet -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed' -m 16096 -object 'iothread,id=iothread-virtio1' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -chardev 'socket,path=/var/run/qemu-server/120.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:cda4de8d061' -drive 'if=none,id=drive-ide0,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:vmpool/vm-120-disk-3:mon_host=192.168.100.141;192.168.100.142;192.168.100.143;192.168.100.144:auth_supported=none,if=none,id=drive-scsi2,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi2,id=scsi2' -drive 'file=rbd:vmpool/vm-120-disk-1:mon_host=192.168.100.141;192.168.100.142;192.168.100.143;192.168.100.144:auth_supported=none,if=none,id=drive-virtio0,cache=none,format=raw,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=101' -drive 'file=rbd:vmpool/vm-120-disk-2:mon_host=192.168.100.141;192.168.100.142;192.168.100.143;192.168.100.144:auth_supported=none,if=none,id=drive-virtio1,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb,iothread=iothread-virtio1' -netdev 'type=tap,id=net0,ifname=tap120i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=C6:17:91:A8:66:9C,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -global 'kvm-pit.lost_tick_policy=discard'' 
failed: exit code 1
 

elurex

Active Member
Oct 28, 2015
204
11
38
Taiwan
rbd map command does not work under ceph over rdma (which is pve default use to load ceph block device into kernel),
you need to use rbd nbd map command in order to mount ceph block storage..
.
this is really going into ceph territories and not what PVE can provide.

I would suggest check up more rbd nbd map and rbd nbd unmap command first
 
Last edited:

elurex

Active Member
Oct 28, 2015
204
11
38
Taiwan
You can actually use the mlnx_install script BUT you have to remove the pve packages first- pve conflicts with the mlnx packages but NOT THE OTHER WAY AROUND, so you can reinstall proxmox-ve after ofed- like so:
  1. Apt-get remove proxmox-ve pve*
  2. Apt-get install pve-kernel-x.xx.x-x-pve pve-headers (the kernel will be uninstalled in step 1)
  3. Navigate to ofed directory
  4. ./mlnxofedinstall --skip-distro-check --force (choose additional switches are relevent)
  5. /etc/init.d/openibd restart
  6. Apt-get install proxmox-ve

Your method works great and much more elegantly done. No error message at all (but uninstalling the whole pve is a bit Nerve-racking
Code:
Device #1:
----------

  Device Type:      ConnectX3
  Part Number:      MCX354A-FCB_A2-A5
  Description:      ConnectX-3 VPI adapter card; dual-port QSFP; FDR IB (56Gb/s) and 40GigE; PCIe3.0 x8 8GT/s; RoHS R6
  PSID:             MT_1090120019
  PCI Device Name:  41:00.0
  Port1 GUID:       0002c90300e75721
  Port2 GUID:       0002c90300e75722
  Versions:         Current        Available
     FW             2.42.5000      2.42.5000
     PXE            3.4.0752       3.4.0752

  Status:           Up to date


Log File: /tmp/MLNX_OFED_LINUX.2278026.logs/fw_update.log
Device (41:00.0):
        41:00.0 Infiniband controller: Mellanox Technologies MT27500 Family [ConnectX-3]
        Link Width: x8
        PCI Link Speed: 8GT/s

Installation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart

Only a few notes for other to take care of.
  1. Even if the node is in a cluster, it will be still be there once you restart corosync & pve-cluster service, if it has ceph osd, it would be better to set noout
  2. in order to uninstall pve package, you must run touch '/please-remove-proxmox-ve'
  3. in order to install pve kernel and header, you must scp /usr/share/proxmox-ve/pve-apt-hook to that node
  4. my personal preference for install mlnxofed is ./mlnxofedinstall --skip-distro-check --force --without-fw-update --dkms
  5. first time install proxmox-ve will error, but can be fixed by systemctl stop pvestatd.service & systemctl start pvestatd.service then later manually dpkg --configure pve-manager proxmox-ve
    Code:
    Job for pvestatd.service failed because the control process exited with error code.
    See "systemctl status pvestatd.service" and "journalctl -xe" for details.
    dpkg: error processing package pve-manager (--configure):
     subprocess installed post-installation script returned error exit status 1
    dpkg: dependency problems prevent configuration of proxmox-ve:
     proxmox-ve depends on pve-manager; however:
      Package pve-manager is not configured yet.
    
    dpkg: error processing package proxmox-ve (--configure):
     dependency problems - leaving unconfigured
    Processing triggers for libc-bin (2.24-11+deb9u3) ...
    Processing triggers for pve-ha-manager (2.0-5) ...
    Processing triggers for systemd (232-25+deb9u2) ...
    Errors were encountered while processing:
     pve-manager
     proxmox-ve
    E: Sub-process /usr/bin/dpkg returned an error code (1)
  6. apt install rbd-nbd
 
Last edited:

Gerhard W. Recher

Active Member
Mar 10, 2017
158
8
38
Munich
rbd map command does not work under ceph over rdma (which is pve default use to load ceph block device into kernel),
you need to use rbd nbd map command in order to mount ceph block storage..
.
this is really going into ceph territories and not what PVE can provide.

I would suggest check up more rbd nbd map and rbd nbd unmap command first
why are other kvm's running ? only this 2016 with snapshots is troublesome .... without install rbd nbd ....
I will consolidate the snapshot and give it another try this afternoon.
 

Gerhard W. Recher

Active Member
Mar 10, 2017
158
8
38
Munich
i removed snapshot from vm, and made a try with rdma ... same results....

how to mange now this ?
start command for vm are manged by proxmux gui ....
I thought defining rdma for ceph is a transparent action, how have you manged this within proxmox ?

I have no glue, i'm lost in a maze ...
 

elurex

Active Member
Oct 28, 2015
204
11
38
Taiwan
You're a brave man... I would never do this on a node in a cluster ;) I evict the node first and re-add when done.
well... there is always zfs rollback if anything goes wrong...

Currently I turn off RDMA over Ceph already. It is because when memory set to infinite, the monitor will run out of memory and hang periodically, and the only way to avoid that is to cronjob restart the service which in my mind is not ready for prime time. The performance and also the latency reduction is definitely works wonderful as Mellanox has advertised. But they need to fix the memory control issue asap.... setting it to infinity is not a feasible solution.
 

jsterr

Active Member
Jul 24, 2020
218
38
33
31
Has someone tried RDMA with Ceph since 2018? Is it easier to accomplish these days or is it still a hassle with lots of errors?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!