Proxmox cluster over Infiniband issues

N

netrik

Guest
Hi,

We are running a 3 node proxmox cluster over an infiniband network. We are using iSCSI based shared storage running over IPoIB over that same IB network. When our back up run at night sometimes we get cluster issues, always on a specific node. The node drops out of the cluster, is fenced and rebooted.
The load on the proxmox nodes is quite low, traffic over the IB link is no where near capacity. Is this a multicast issue?
Can we get more logging on cluster so we can get a better view into what is going on?

We see no errors in syslog or messages, no errors in the infiniband opensm.
The Corosync log shows configuration change, the node leaving and rejoining:

Corosync log:
Apr 29 00:19:13 corosync [TOTEM ] A processor failed, forming new configuration.
Apr 29 00:19:25 corosync [CLM ] CLM CONFIGURATION CHANGE
Apr 29 00:19:25 corosync [CLM ] New Configuration:
Apr 29 00:19:25 corosync [CLM ] r(0) ip(10.100.2.102)
Apr 29 00:19:25 corosync [CLM ] r(0) ip(10.100.2.151)
Apr 29 00:19:25 corosync [CLM ] Members Left:
Apr 29 00:19:25 corosync [CLM ] r(0) ip(10.100.2.101)
Apr 29 00:19:25 corosync [CLM ] Members Joined:
Apr 29 00:19:25 corosync [QUORUM] Members[2]: 2 3
Apr 29 00:19:25 corosync [CLM ] CLM CONFIGURATION CHANGE
Apr 29 00:19:25 corosync [CLM ] New Configuration:
Apr 29 00:19:25 corosync [CLM ] r(0) ip(10.100.2.102)
Apr 29 00:19:25 corosync [CLM ] r(0) ip(10.100.2.151)
Apr 29 00:19:25 corosync [CLM ] Members Left:
Apr 29 00:19:25 corosync [CLM ] Members Joined:
Apr 29 00:19:25 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 29 00:19:25 corosync [CPG ] chosen downlist: sender r(0) ip(10.100.2.102) ; members(old:3 left:1)
Apr 29 00:19:25 corosync [MAIN ] Completed service synchronization, ready to provide service.
Apr 29 00:20:51 corosync [CLM ] CLM CONFIGURATION CHANGE
Apr 29 00:20:51 corosync [CLM ] New Configuration:
Apr 29 00:20:51 corosync [CLM ] r(0) ip(10.100.2.102)
Apr 29 00:20:51 corosync [CLM ] r(0) ip(10.100.2.151)
Apr 29 00:20:51 corosync [CLM ] Members Left:
Apr 29 00:20:51 corosync [CLM ] Members Joined:
Apr 29 00:20:51 corosync [CLM ] CLM CONFIGURATION CHANGE
Apr 29 00:20:51 corosync [CLM ] New Configuration:
Apr 29 00:20:51 corosync [CLM ] r(0) ip(10.100.2.101)
Apr 29 00:20:51 corosync [CLM ] r(0) ip(10.100.2.102)
Apr 29 00:20:51 corosync [CLM ] r(0) ip(10.100.2.151)
Apr 29 00:20:51 corosync [CLM ] Members Left:
Apr 29 00:20:51 corosync [CLM ] Members Joined:
Apr 29 00:20:51 corosync [CLM ] r(0) ip(10.100.2.101)
Apr 29 00:20:51 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 29 00:20:51 corosync [QUORUM] Members[3]: 1 2 3
Apr 29 00:20:51 corosync [QUORUM] Members[3]: 1 2 3
Apr 29 00:20:51 corosync [CPG ] chosen downlist: sender r(0) ip(10.100.2.102) ; members(old:2 left:0)
Apr 29 00:20:51 corosync [MAIN ] Completed service synchronization, ready to provide service.

Fenced log:
Apr 29 00:19:25 fenced fencing node servm1
Apr 29 00:19:41 fenced fence servm1 success
 
Thanks Dietmar,

Will do and will let you know what I find.

Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!