Proxmox cluster goes crazy

itvietnam

Renowned Member
Aug 11, 2015
132
4
83
Hi,

My cluster suddenly reboot in the morning and all hardware node become dead, VM can not start, master node dead and LRM services is wait_for_agent_lock.

VM status: fence and can not start

Firstly, may i know what may lead to this problem? VM can not start due to master dead (attachment)?

I tried to shutdown all node, except 1 node (called node19) and modify totem bindnetaddr back to specific IP address instead of network address.

totem {
cluster_name: clustername
config_version: 34
interface {
bindnetaddr: 10.10.30.0
ringnumber: 0
}
interface {
bindnetaddr: 10.20.30.0
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}

totem {
cluster_name: clustername
config_version: 35
interface {
bindnetaddr: 10.10.30.169
ringnumber: 0
}
interface {
bindnetaddr: 10.20.30.169
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}

after edit and reboot this node (node19), pvecm status only see this node. I decided to start back all remaining node, they can not see this node19. This looks like separate cluster.

Secondly, is there anyway to update node19 back to this cluster? i tried copy a backup of corosync.conf override to /etc/pve/corosync.conf (after pvecm e 1) but does not success.

After nearly 10 hours downtime, i have to disable all ha-manager config and start VPS manually. I think this cluster has wide problem (we have nearly 10 clusters but this cluster have incident every month).

ring1 network always has this error: (multicast enabled)

systemctl status corosync.service
â corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2018-10-17 17:42:38 +07; 4 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3514 (corosync)
Tasks: 2 (limit: 23347)
Memory: 66.6M
CPU: 1h 44min 33.832s
CGroup: /system.slice/corosync.service
ââ3514 /usr/sbin/corosync -f

Oct 21 22:23:55 node109 corosync[3514]: notice [TOTEM ] Retransmit List: 1aeccce
Oct 21 22:23:55 node109 corosync[3514]: [TOTEM ] Retransmit List: 1aeccce
Oct 22 01:11:10 node109 corosync[3514]: error [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:11:10 node109 corosync[3514]: [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:11:11 node109 corosync[3514]: notice [TOTEM ] Automatically recovered ring 1
Oct 22 01:11:11 node109 corosync[3514]: [TOTEM ] Automatically recovered ring 1
Oct 22 01:12:51 node109 corosync[3514]: error [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:12:51 node109 corosync[3514]: [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:12:52 node109 corosync[3514]: notice [TOTEM ] Automatically recovered ring 1
Oct 22 01:12:52 node109 corosync[3514]: [TOTEM ] Automatically recovered ring 1
 

Attachments

  • 2018-10-22_01-43-46.png
    2018-10-22_01-43-46.png
    35.3 KB · Views: 11
after edit and reboot this node (node19), pvecm status only see this node. I decided to start back all remaining node, they can not see this node19. This looks like separate cluster.
Hi,

Is there anyway to force join node19 back to cluster?
 
This is the log i got before reboot time (extract from 1 node in cluster).
 

Attachments

  • 2018-10-24_01-13-43.png
    2018-10-24_01-13-43.png
    124.7 KB · Views: 11
Is the tiome on all nodes correct?
Yes, they are correct and same date time. They are using the same NTP server.

perhaps trouble with the switch (multicast)
I'm still working on it. But we do the test with omping many times (more than 10 minutes) and they are work fine.

We do have 2 ring network defined in corosync config (attachment). These setting on separate switch so how they can corrupt both at the same time?

  • ring0: 10Gbps SFP+
  • ring1: 1Gbps Ethernet
 

Attachments

  • 2018-10-24_02-10-13.png
    2018-10-24_02-10-13.png
    20.9 KB · Views: 4
Yes, they are correct and same date time. They are using the same NTP server.


I'm still working on it. But we do the test with omping many times (more than 10 minutes) and they are work fine.

We do have 2 ring network defined in corosync config (attachment). These setting on separate switch so how they can corrupt both at the same time?

  • ring0: 10Gbps SFP+
  • ring1: 1Gbps Ethernet
Hi,
are both ring-networks on the same switch, or on different switches?

If I see the posting right, only two nodes are effected?

Is on all nodes the content from /etc/pve/.members correct?

Udo
 
Hi Udo,

are both ring-networks on the same switch, or on different switches?

They are running on differences switch. You can see attachment.
  • ring0 network running on bonding 2 x 10Gbps SFP+
  • ring1 network running on bonding 2 x 1Gbps RJ45.
If I see the posting right, only two nodes are effected?
11 nodes in cluster rebooted at the same time.

Is on all nodes the content from /etc/pve/.members correct?

I see some differences from line 2 and 3 of each node:
  • Some nodes has the same version 19
  • and some nodes has different number.
 

Attachments

  • 2018-10-24_03-00-27.png
    2018-10-24_03-00-27.png
    56.8 KB · Views: 8
  • pve-members.png
    pve-members.png
    33 KB · Views: 7
i got some sound from log and Google.
 

Attachments

  • 2018-10-24_01-13-43.png
    2018-10-24_01-13-43.png
    124.7 KB · Views: 3
  • 2018-10-24_03-44-10.png
    2018-10-24_03-44-10.png
    50 KB · Views: 2
Hi Udo,

They are running on differences switch. You can see attachment.
  • ring0 network running on bonding 2 x 10Gbps SFP+
  • ring1 network running on bonding 2 x 1Gbps RJ45.
Hi,
ok - are all nodes on the 10GB-switch and the 1GB-switch, or do you have different switches on 10 + 1 Gb running?
11 nodes in cluster rebooted at the same time.
This happens, because they all loose the quorum and was self-fenced.
Are all nodes rebootet before with the ring0+ring1 config?
This sound for me like an network problem on the switches...
I see some differences from line 2 and 3 of each node:
  • Some nodes has the same version 19
  • and some nodes has different number.
Not all nodes have quorum - so the nodes without quorum can have the lower number?! (or with set expected 1, can also have an single node the higher number...).

But why don't have all nodes an IP address?
Can you post the content of /etc/pve/corosync.conf (from an node which is in the cluster, and from the nodes, where are not in the cluster now).

Udo
 
  • Like
Reactions: itvietnam
Hi, thanks for your time.

Hi,
ok - are all nodes on the 10GB-switch and the 1GB-switch, or do you have different switches on 10 + 1 Gb running?
No, we have separate dual switch. 2 x 10Gbps with active/passive mode for storage_public_network_access_to_CEPH_external/migration/backup/ring0. We do full backup on weekend. Bandwidth full and saturated then cluster got reboot.

And we use a single SG300 for ring1 network and LAN network between all VM.

do you have different switches on 10 + 1 Gb running?
We didn't mixed between 10Gbps + 1Gbps.

Are all nodes rebootet before with the ring0+ring1 config?

Yes, all cluster is usually reboot at backup time, network full and cluster rebooted. So we decided to add more ring1. This done many month before.

We usually see this error of ring0 network (sometime ring1) but they are quickly recover after 1 second. Is this normal as they are recovery themselves?

2018-10-25_00-14-27.png

This sound for me like an network problem on the switches...

We test omping for 10 minutes and things ok. Currently we have plan to replace Cisco SG300 to Cisco 3560G. But we are not sure this is the root cause of this incident.

But why don't have all nodes an IP address?

I'm not sure understand your question. Could you explain more?

Attachment files includes info from node17 (not in cluster):
  • HA manager status:
  • corosync before incident
  • corosync after change totem address and increase config_version to 36
And node09 (in cluster now): corosync.conf
 

Attachments

  • corosync-running-cluster-now.txt
    1.7 KB · Views: 1
  • node17_corosync_after_incident.txt
    1.7 KB · Views: 1
  • node17_corosync_before_incident.txt
    1.7 KB · Views: 0
  • node17-ha-manager-status.txt
    30.2 KB · Views: 3
qm list from node17 we could all VM stopped now, maybe no quorum so they can not start. Currently we have shutdown all switch ports of this node except management port for debug access only.

2018-10-25_00-36-32.png

Can we avoid split brain VM when we join this node back to cluster? Or what you recommend (safest way) for this node:
  • add this node back to cluster with pvecm add -force option
  • or format and reinstall as new name
  • or format and reinstall with the same name
 
qm list from node17 we could all VM stopped now, maybe no quorum so they can not start. Currently we have shutdown all switch ports of this node except management port for debug access only.

View attachment 8502

Can we avoid split brain VM when we join this node back to cluster? Or what you recommend (safest way) for this node:
  • add this node back to cluster with pvecm add -force option
  • or format and reinstall as new name
  • or format and reinstall with the same name
Hi,
IMHO you don't need an rejoin, or new installation, because the node are still in the cluster - but has issues to join the cluster communication.

I would first stop the fence-automatic on all nodes to avoid an rebooting of the nodes due an quorum lost:
Code:
systemctl stop pve-ha-lrm.service pve-ha-crm.service
What happens if you restart corosync on the node, which leaf the cluster?
Code:
systemctl restart corosync
Control with pvecm (for the gui it's perhaps nessesary to restart further daemons).

Udo
 
  • Like
Reactions: itvietnam
systemctl restart corosync
if we restart corosync service, will this node update latest version corosync.conf of this cluster and join back? as they has out of date corosync version now.
  • this node: config_version is 36
  • running cluster: config_version is 37 now.
Thanks,
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!