Proxmox 4.4 all nodes rebooted

Volker Lieder · Nov 21, 2017

Hi Folks,
we have a proxmox cluster of 4 nodes, installed proxmox 4.4.
While migrate vms from one node01 to another, all hardware nodes rebooted and i cant find a hint in a logfile what happened. As storage we configured ceph with 25 osds and 4 monitors (now reduced to 3).
Any ideas where i can look what happened?
We think node01 has had a problem and send an information to node02-04 to reboot, too.
But we have a quorum of 3 so it should doesnt matter if node01 will get down. Or i am thinking wrong?

For further information, we are ongoing to upgrade cluster to 5.1, node02-04 are already on ceph juwel, node01 should follow this evening. Could that be the problem that it is the last node on "hammer"?

Regards,
Volker

Alwin · Nov 21, 2017

Volker Lieder said:
Could that be the problem that it is the last node on "hammer"?

That shouldn't be the problem. Is your corosync running on a separate dedicated network or does it share its resources?

Mark B. · Nov 21, 2017

It runs on our management network, which no other traffic except our access to the WebGUI of the Proxmox Cluster. As these accesses are normally never more than two at the same time I think this is not really noticable in terms of traffic. Ceph runs on a dedicatet Infiniband network.

Volker Lieder · Nov 21, 2017

What does this type of "totem" mean in corosync config?
Here is always the ip of node01

totem {
cluster_name: uCloud
config_version: 8
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.16.70.11
ringnumber: 0
}

}

Shouldnt it be 172.16.70.0 on a /24 network and not the ip of node01?

Regards,
Volker

Alwin · Nov 22, 2017

@Mark B., are you working with Volker on the same issue? If not, please open up a new thread.

man corosync.conf

bindnetaddr
This specifies the network address the corosync executive should bind to.

bindnetaddr should be an IP address configured on the system, or a network address.

For example, if the local interface is 192.168.5.92 with netmask 255.255.255.0, you should set bindnetaddr to 192.168.5.92 or 192.168.5.0. If the local interface is 192.168.5.92 with netmask
255.255.255.192, set bindnetaddr to 192.168.5.92 or 192.168.5.64, and so forth.

Mark B. · Nov 22, 2017

Sorry. Yes indeed, we are working together on this issue.
Meanwhile we changed the bindnetaddress to the network address.

Alwin · Nov 22, 2017

What is your migration network setting? Maybe it sits on the management network, this could lead to the cluster traffic being interrupted.

man pvecm

Migration Network
By default, Proxmox VE uses the network in which cluster communication takes place to send the
migration traffic. This is not optimal because sensitive cluster traffic can be disrupted and
this network may not have the best bandwidth available on the node.

Setting the migration network parameter allows the use of a dedicated network for the entire
migration traffic. In addition to the memory, this also affects the storage traffic for offline
migrations.

The migration network is set as a network in the CIDR notation. This has the advantage that you
do not have to set individual IP addresses for each node. Proxmox VE can determine the real
address on the destination node from the network specified in the CIDR form. To enable this,
the network must be specified so that each node has one, but only one IP in the respective
network.

Volker Lieder · Nov 23, 2017

Hi,
we also see that before all nodes rebootet, the network device was in full usage while vm migration. We now use infiniband for migration traffic and dedicated lan for corosync. I think this task is done now.
Regards
Volker

Search

Search

Proxmox 4.4 all nodes rebooted

Volker Lieder

Well-Known Member

Alwin

Proxmox Retired Staff

Mark B.

Active Member

Volker Lieder

Well-Known Member

Alwin

Proxmox Retired Staff

Mark B.

Active Member

Alwin

Proxmox Retired Staff

Volker Lieder

Well-Known Member