OpenSM is not detecting ib ports

tincboy

Renowned Member
Apr 13, 2010
466
5
83
I'm connecting two Proxmox servers with ConnectX-3 cards, they can see each other in Ethernet mode but in ib mode, the link remains in Initiating state despite physical link up status.
Also in opensm logs I can see that it cant detect ib ports while there are active. and both cards are in the same model and firmware version.
Any suggestion on how to fix it?

Code:
ibstat
CA 'mlx4_0'
        CA type: MT4103
        Number of ports: 2
        Firmware version: 2.32.5290
        Hardware version: 0
        Node GUID: 0x480fcffffff428b0
        System image GUID: 0x480fcffffff428b3
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02514868
                Port GUID: 0x480fcffffff428b1
                Link layer: InfiniBand
        Port 2:
                State: Initializing
                Physical state: LinkUp
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02514868
                Port GUID: 0x480fcffffff428b2
                Link layer: InfiniBand

Code:
systemctl status opensm
● opensm.service - LSB: Start opensm subnet manager.
   Loaded: loaded (/etc/init.d/opensm; generated)
   Active: active (exited) since Fri 2019-08-09 08:03:47 EDT; 8min ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 0 (limit: 9830)
   Memory: 0B
   CGroup: /system.slice/opensm.service

Aug 09 08:03:47 systemd[1]: Starting LSB: Start opensm subnet manager....
Aug 09 08:03:47 opensm[4674]: No infiniband adapters found.
Aug 09 08:03:47 systemd[1]: Started LSB: Start opensm subnet manager..


Code:
 ibstat -p
0x480fcffffff428b1
0x480fcffffff428b2
 
After running "modprobe ib_umad" on both Proxmox nodes, OpenSM detects the ports but still the link is in initialization status:
Code:
service opensm status
● opensm.service - LSB: Start opensm subnet manager.
   Loaded: loaded (/etc/init.d/opensm; generated)
   Active: active (exited) since Fri 2019-08-09 08:18:48 EDT; 10min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 7598 ExecStart=/etc/init.d/opensm start (code=exited, status=0/SUCCESS)

Aug 09 08:18:48 com1 opensm[7598]: Starting opensm on 0x4a0fcffffef445c2:
Aug 09 08:18:48 com1 systemd[1]: Started LSB: Start opensm subnet manager..
Aug 09 08:18:48 com1 OpenSM[7603]: /var/log/opensm.0x4a0fcffffef445c1.log log file opened
Aug 09 08:18:48 com1 OpenSM[7603]: OpenSM 3.3.21
Aug 09 08:18:48 com1 OpenSM[7606]: /var/log/opensm.0x4a0fcffffef445c2.log log file opened
Aug 09 08:18:48 com1 OpenSM[7606]: OpenSM 3.3.21
Aug 09 08:18:48 com1 OpenSM[7603]: Entering DISCOVERING state
Aug 09 08:18:48 com1 OpenSM[7606]: Entering DISCOVERING state
Aug 09 08:18:49 com1 OpenSM[7603]: Exiting SM
Aug 09 08:18:49 com1 OpenSM[7606]: Exiting SM
Code:
tail -100 /var/log/opensm.0x4a0fcffffef445c1.log
Aug 09 08:17:50 483441 [70FB4B80] 0x03 -> OpenSM 3.3.21
Aug 09 08:17:50 483610 [70FB4B80] 0x80 -> OpenSM 3.3.21
Aug 09 08:17:50 487231 [70FB4B80] 0x02 -> osm_vendor_init: 1000 pending umads specified
Aug 09 08:17:50 487569 [70FB4B80] 0x80 -> Entering DISCOVERING state
Aug 09 08:17:50 487777 [70FB4B80] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x4a0fcffffef445c1
Aug 09 08:17:50 547918 [70FB4B80] 0x01 -> osm_vendor_open_port: ERR 5422: Unable to find requested CA guid 0x4a0fcffffef445c1
Aug 09 08:17:50 547951 [70FB4B80] 0x01 -> osm_vendor_bind: ERR 5424: Unable to open port 0x4a0fcffffef445c1
Aug 09 08:17:50 547961 [70FB4B80] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
Aug 09 08:17:50 547971 [70FB4B80] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR)
Aug 09 08:17:50 547991 [70FB4B80] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind
Aug 09 08:17:50 548000 [70FB4B80] 0x01 -> osm_congestion_control_shutdown: ERR C108: No previous bind
Aug 09 08:17:50 548159 [70FB4B80] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind
Aug 09 08:17:50 550423 [70FB4B80] 0x80 -> Exiting SM
Aug 09 08:18:48 941155 [65DF7B80] 0x03 -> OpenSM 3.3.21
Aug 09 08:18:48 941228 [65DF7B80] 0x80 -> OpenSM 3.3.21
Aug 09 08:18:48 943710 [65DF7B80] 0x02 -> osm_vendor_init: 1000 pending umads specified
Aug 09 08:18:48 943962 [65DF7B80] 0x80 -> Entering DISCOVERING state
Aug 09 08:18:48 944159 [65DF7B80] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x4a0fcffffef445c1
Aug 09 08:18:49 002560 [65DF7B80] 0x01 -> osm_vendor_open_port: ERR 5422: Unable to find requested CA guid 0x4a0fcffffef445c1
Aug 09 08:18:49 002613 [65DF7B80] 0x01 -> osm_vendor_bind: ERR 5424: Unable to open port 0x4a0fcffffef445c1
Aug 09 08:18:49 002633 [65DF7B80] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
Aug 09 08:18:49 002645 [65DF7B80] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR)
Aug 09 08:18:49 002671 [65DF7B80] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind
Aug 09 08:18:49 002682 [65DF7B80] 0x01 -> osm_congestion_control_shutdown: ERR C108: No previous bind
Aug 09 08:18:49 002834 [65DF7B80] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind
Aug 09 08:18:49 005219 [65DF7B80] 0x80 -> Exiting SM
 
After running "modprobe ib_umad" on both Proxmox nodes, OpenSM detects the ports but still the link is in initialization status:
Code:
service opensm status
● opensm.service - LSB: Start opensm subnet manager.
   Loaded: loaded (/etc/init.d/opensm; generated)
   Active: active (exited) since Fri 2019-08-09 08:18:48 EDT; 10min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 7598 ExecStart=/etc/init.d/opensm start (code=exited, status=0/SUCCESS)

Aug 09 08:18:48 com1 opensm[7598]: Starting opensm on 0x4a0fcffffef445c2:
Aug 09 08:18:48 com1 systemd[1]: Started LSB: Start opensm subnet manager..
Aug 09 08:18:48 com1 OpenSM[7603]: /var/log/opensm.0x4a0fcffffef445c1.log log file opened
Aug 09 08:18:48 com1 OpenSM[7603]: OpenSM 3.3.21
Aug 09 08:18:48 com1 OpenSM[7606]: /var/log/opensm.0x4a0fcffffef445c2.log log file opened
Aug 09 08:18:48 com1 OpenSM[7606]: OpenSM 3.3.21
Aug 09 08:18:48 com1 OpenSM[7603]: Entering DISCOVERING state
Aug 09 08:18:48 com1 OpenSM[7606]: Entering DISCOVERING state
Aug 09 08:18:49 com1 OpenSM[7603]: Exiting SM
Aug 09 08:18:49 com1 OpenSM[7606]: Exiting SM
Code:
tail -100 /var/log/opensm.0x4a0fcffffef445c1.log
Aug 09 08:17:50 483441 [70FB4B80] 0x03 -> OpenSM 3.3.21
Aug 09 08:17:50 483610 [70FB4B80] 0x80 -> OpenSM 3.3.21
Aug 09 08:17:50 487231 [70FB4B80] 0x02 -> osm_vendor_init: 1000 pending umads specified
Aug 09 08:17:50 487569 [70FB4B80] 0x80 -> Entering DISCOVERING state
Aug 09 08:17:50 487777 [70FB4B80] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x4a0fcffffef445c1
Aug 09 08:17:50 547918 [70FB4B80] 0x01 -> osm_vendor_open_port: ERR 5422: Unable to find requested CA guid 0x4a0fcffffef445c1
Aug 09 08:17:50 547951 [70FB4B80] 0x01 -> osm_vendor_bind: ERR 5424: Unable to open port 0x4a0fcffffef445c1
Aug 09 08:17:50 547961 [70FB4B80] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
Aug 09 08:17:50 547971 [70FB4B80] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR)
Aug 09 08:17:50 547991 [70FB4B80] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind
Aug 09 08:17:50 548000 [70FB4B80] 0x01 -> osm_congestion_control_shutdown: ERR C108: No previous bind
Aug 09 08:17:50 548159 [70FB4B80] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind
Aug 09 08:17:50 550423 [70FB4B80] 0x80 -> Exiting SM
Aug 09 08:18:48 941155 [65DF7B80] 0x03 -> OpenSM 3.3.21
Aug 09 08:18:48 941228 [65DF7B80] 0x80 -> OpenSM 3.3.21
Aug 09 08:18:48 943710 [65DF7B80] 0x02 -> osm_vendor_init: 1000 pending umads specified
Aug 09 08:18:48 943962 [65DF7B80] 0x80 -> Entering DISCOVERING state
Aug 09 08:18:48 944159 [65DF7B80] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x4a0fcffffef445c1
Aug 09 08:18:49 002560 [65DF7B80] 0x01 -> osm_vendor_open_port: ERR 5422: Unable to find requested CA guid 0x4a0fcffffef445c1
Aug 09 08:18:49 002613 [65DF7B80] 0x01 -> osm_vendor_bind: ERR 5424: Unable to open port 0x4a0fcffffef445c1
Aug 09 08:18:49 002633 [65DF7B80] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
Aug 09 08:18:49 002645 [65DF7B80] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR)
Aug 09 08:18:49 002671 [65DF7B80] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind
Aug 09 08:18:49 002682 [65DF7B80] 0x01 -> osm_congestion_control_shutdown: ERR C108: No previous bind
Aug 09 08:18:49 002834 [65DF7B80] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind
Aug 09 08:18:49 005219 [65DF7B80] 0x80 -> Exiting SM
And this also got fixed by "modprobe xprtrdma"
I've add them to /etc/modules to auto load them after reboot
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!