Proxmox cluster goes crazy

itvietnam · Oct 21, 2018

Hi,

My cluster suddenly reboot in the morning and all hardware node become dead, VM can not start, master node dead and LRM services is wait_for_agent_lock.

VM status: fence and can not start

Firstly, may i know what may lead to this problem? VM can not start due to master dead (attachment)?

I tried to shutdown all node, except 1 node (called node19) and modify totem bindnetaddr back to specific IP address instead of network address.

totem {
cluster_name: clustername
config_version: 34
interface {
bindnetaddr: 10.10.30.0
ringnumber: 0
}
interface {
bindnetaddr: 10.20.30.0
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}

totem {
cluster_name: clustername
config_version: 35
interface {
bindnetaddr: 10.10.30.169
ringnumber: 0
}
interface {
bindnetaddr: 10.20.30.169
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}

after edit and reboot this node (node19), pvecm status only see this node. I decided to start back all remaining node, they can not see this node19. This looks like separate cluster.

Secondly, is there anyway to update node19 back to this cluster? i tried copy a backup of corosync.conf override to /etc/pve/corosync.conf (after pvecm e 1) but does not success.

After nearly 10 hours downtime, i have to disable all ha-manager config and start VPS manually. I think this cluster has wide problem (we have nearly 10 clusters but this cluster have incident every month).

ring1 network always has this error: (multicast enabled)

systemctl status corosync.service
â corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2018-10-17 17:42:38 +07; 4 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3514 (corosync)
Tasks: 2 (limit: 23347)
Memory: 66.6M
CPU: 1h 44min 33.832s
CGroup: /system.slice/corosync.service
ââ3514 /usr/sbin/corosync -f

Oct 21 22:23:55 node109 corosync[3514]: notice [TOTEM ] Retransmit List: 1aeccce
Oct 21 22:23:55 node109 corosync[3514]: [TOTEM ] Retransmit List: 1aeccce
Oct 22 01:11:10 node109 corosync[3514]: error [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:11:10 node109 corosync[3514]: [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:11:11 node109 corosync[3514]: notice [TOTEM ] Automatically recovered ring 1
Oct 22 01:11:11 node109 corosync[3514]: [TOTEM ] Automatically recovered ring 1
Oct 22 01:12:51 node109 corosync[3514]: error [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:12:51 node109 corosync[3514]: [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:12:52 node109 corosync[3514]: notice [TOTEM ] Automatically recovered ring 1
Oct 22 01:12:52 node109 corosync[3514]: [TOTEM ] Automatically recovered ring 1

itvietnam · Oct 23, 2018

itvietnam said:
after edit and reboot this node (node19), pvecm status only see this node. I decided to start back all remaining node, they can not see this node19. This looks like separate cluster.

Hi,

Is there anyway to force join node19 back to cluster?

itvietnam · Oct 23, 2018

This is the log i got before reboot time (extract from 1 node in cluster).

udo · Oct 23, 2018

Hi,
perhaps trouble with the switch (multicast) - see here: https://pve.proxmox.com/wiki/Multicast_notes

Is the tiome on all nodes correct?

Udo

itvietnam · Oct 23, 2018

udo said:
Is the tiome on all nodes correct?

Yes, they are correct and same date time. They are using the same NTP server.

udo said:
perhaps trouble with the switch (multicast)

I'm still working on it. But we do the test with omping many times (more than 10 minutes) and they are work fine.

We do have 2 ring network defined in corosync config (attachment). These setting on separate switch so how they can corrupt both at the same time?

ring0: 10Gbps SFP+
ring1: 1Gbps Ethernet

udo · Oct 23, 2018

itvietnam said:
Yes, they are correct and same date time. They are using the same NTP server.

I'm still working on it. But we do the test with omping many times (more than 10 minutes) and they are work fine.

We do have 2 ring network defined in corosync config (attachment). These setting on separate switch so how they can corrupt both at the same time?

ring0: 10Gbps SFP+

ring1: 1Gbps Ethernet

Hi,
are both ring-networks on the same switch, or on different switches?

If I see the posting right, only two nodes are effected?

Is on all nodes the content from /etc/pve/.members correct?

Udo

itvietnam · Oct 23, 2018

Hi Udo,

udo said:
are both ring-networks on the same switch, or on different switches?

They are running on differences switch. You can see attachment.

ring0 network running on bonding 2 x 10Gbps SFP+
ring1 network running on bonding 2 x 1Gbps RJ45.

udo said:
If I see the posting right, only two nodes are effected?

11 nodes in cluster rebooted at the same time.

udo said:
Is on all nodes the content from /etc/pve/.members correct?

I see some differences from line 2 and 3 of each node:

Some nodes has the same version 19
and some nodes has different number.

itvietnam · Oct 23, 2018

i got some sound from log and Google.

udo · Oct 24, 2018

itvietnam said:
Hi Udo,

They are running on differences switch. You can see attachment.

ring0 network running on bonding 2 x 10Gbps SFP+

ring1 network running on bonding 2 x 1Gbps RJ45.

Hi,
ok - are all nodes on the 10GB-switch and the 1GB-switch, or do you have different switches on 10 + 1 Gb running?

11 nodes in cluster rebooted at the same time.

This happens, because they all loose the quorum and was self-fenced.
Are all nodes rebootet before with the ring0+ring1 config?
This sound for me like an network problem on the switches...

I see some differences from line 2 and 3 of each node:

Some nodes has the same version 19

and some nodes has different number.

Not all nodes have quorum - so the nodes without quorum can have the lower number?! (or with set expected 1, can also have an single node the higher number...).

But why don't have all nodes an IP address?
Can you post the content of /etc/pve/corosync.conf (from an node which is in the cluster, and from the nodes, where are not in the cluster now).

Udo

itvietnam · Oct 24, 2018

Hi, thanks for your time.

udo said:
Hi,
ok - are all nodes on the 10GB-switch and the 1GB-switch, or do you have different switches on 10 + 1 Gb running?

No, we have separate dual switch. 2 x 10Gbps with active/passive mode for storage_public_network_access_to_CEPH_external/migration/backup/ring0. We do full backup on weekend. Bandwidth full and saturated then cluster got reboot.

And we use a single SG300 for ring1 network and LAN network between all VM.

udo said:
do you have different switches on 10 + 1 Gb running?

We didn't mixed between 10Gbps + 1Gbps.

udo said:
Are all nodes rebootet before with the ring0+ring1 config?

Yes, all cluster is usually reboot at backup time, network full and cluster rebooted. So we decided to add more ring1. This done many month before.

We usually see this error of ring0 network (sometime ring1) but they are quickly recover after 1 second. Is this normal as they are recovery themselves?

udo said:
This sound for me like an network problem on the switches...

We test omping for 10 minutes and things ok. Currently we have plan to replace Cisco SG300 to Cisco 3560G. But we are not sure this is the root cause of this incident.

udo said:
But why don't have all nodes an IP address?

I'm not sure understand your question. Could you explain more?

Attachment files includes info from node17 (not in cluster):

HA manager status:
corosync before incident
corosync after change totem address and increase config_version to 36

And node09 (in cluster now): corosync.conf

itvietnam · Oct 24, 2018

qm list from node17 we could all VM stopped now, maybe no quorum so they can not start. Currently we have shutdown all switch ports of this node except management port for debug access only.

Can we avoid split brain VM when we join this node back to cluster? Or what you recommend (safest way) for this node:

add this node back to cluster with pvecm add -force option
or format and reinstall as new name
or format and reinstall with the same name

udo · Oct 26, 2018

itvietnam said:
qm list from node17 we could all VM stopped now, maybe no quorum so they can not start. Currently we have shutdown all switch ports of this node except management port for debug access only.

View attachment 8502

Can we avoid split brain VM when we join this node back to cluster? Or what you recommend (safest way) for this node:

add this node back to cluster with pvecm add -force option

or format and reinstall as new name

or format and reinstall with the same name

Hi,
IMHO you don't need an rejoin, or new installation, because the node are still in the cluster - but has issues to join the cluster communication.

I would first stop the fence-automatic on all nodes to avoid an rebooting of the nodes due an quorum lost:

Code:

systemctl stop pve-ha-lrm.service pve-ha-crm.service

What happens if you restart corosync on the node, which leaf the cluster?

Code:

systemctl restart corosync

Control with pvecm (for the gui it's perhaps nessesary to restart further daemons).

Udo

itvietnam · Oct 26, 2018

udo said:
systemctl restart corosync

if we restart corosync service, will this node update latest version corosync.conf of this cluster and join back? as they has out of date corosync version now.

this node: config_version is 36
running cluster: config_version is 37 now.

Thanks,

Search

Search

Proxmox cluster goes crazy

itvietnam

Renowned Member

Attachments

itvietnam

Renowned Member

itvietnam

Renowned Member

Attachments

udo

Distinguished Member

itvietnam

Renowned Member

Attachments

udo

Distinguished Member

itvietnam

Renowned Member

Attachments

itvietnam

Renowned Member

Attachments

udo

Distinguished Member

itvietnam

Renowned Member

Attachments

itvietnam

Renowned Member

udo

Distinguished Member

itvietnam

Renowned Member

We value your privacy