Proxmox 4.4 all nodes rebooted

Volker Lieder

Well-Known Member
Nov 6, 2017
50
3
48
44
Hi Folks,
we have a proxmox cluster of 4 nodes, installed proxmox 4.4.
While migrate vms from one node01 to another, all hardware nodes rebooted and i cant find a hint in a logfile what happened. As storage we configured ceph with 25 osds and 4 monitors (now reduced to 3).
Any ideas where i can look what happened?
We think node01 has had a problem and send an information to node02-04 to reboot, too.
But we have a quorum of 3 so it should doesnt matter if node01 will get down. Or i am thinking wrong?

For further information, we are ongoing to upgrade cluster to 5.1, node02-04 are already on ceph juwel, node01 should follow this evening. Could that be the problem that it is the last node on "hammer"?

Regards,
Volker
 
Could that be the problem that it is the last node on "hammer"?
That shouldn't be the problem. Is your corosync running on a separate dedicated network or does it share its resources?
 
It runs on our management network, which no other traffic except our access to the WebGUI of the Proxmox Cluster. As these accesses are normally never more than two at the same time I think this is not really noticable in terms of traffic. Ceph runs on a dedicatet Infiniband network.
 
What does this type of "totem" mean in corosync config?
Here is always the ip of node01

totem {
cluster_name: uCloud
config_version: 8
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.16.70.11
ringnumber: 0
}

}

Shouldnt it be 172.16.70.0 on a /24 network and not the ip of node01?

Regards,
Volker
 
Last edited:
@Mark B., are you working with Volker on the same issue? If not, please open up a new thread.

man corosync.conf
bindnetaddr
This specifies the network address the corosync executive should bind to.

bindnetaddr should be an IP address configured on the system, or a network address.

For example, if the local interface is 192.168.5.92 with netmask 255.255.255.0, you should set bindnetaddr to 192.168.5.92 or 192.168.5.0. If the local interface is 192.168.5.92 with netmask
255.255.255.192, set bindnetaddr to 192.168.5.92 or 192.168.5.64, and so forth.
 
What is your migration network setting? Maybe it sits on the management network, this could lead to the cluster traffic being interrupted.

man pvecm
Migration Network
By default, Proxmox VE uses the network in which cluster communication takes place to send the
migration traffic. This is not optimal because sensitive cluster traffic can be disrupted and
this network may not have the best bandwidth available on the node.

Setting the migration network parameter allows the use of a dedicated network for the entire
migration traffic. In addition to the memory, this also affects the storage traffic for offline
migrations.

The migration network is set as a network in the CIDR notation. This has the advantage that you
do not have to set individual IP addresses for each node. Proxmox VE can determine the real
address on the destination node from the network specified in the CIDR form. To enable this,
the network must be specified so that each node has one, but only one IP in the respective
network.
 
Hi,
we also see that before all nodes rebootet, the network device was in full usage while vm migration. We now use infiniband for migration traffic and dedicated lan for corosync. I think this task is done now.
Regards
Volker
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!