Reboot Issue

Fumareddu · Mar 16, 2024

Hello everyone,

Thank you for this beautiful project.

I am having a really peculiar issue with a proxmox installation.

The setup is a 2 servers cluster + a rasperry as Qdevice for quorum(proxmox version 8.03). The 2 servers are Dell r730xd so it's enterprise stuff. we got replication going and HA and everything worked fine for about a couple of months. After a while though we got messages from zabbix that the only vm that was running on node 2 went offline and online. We checked the node 2 logs and true enough out of the blue we saw --Reboot-- and learned that apparently the node rebooted itself. The vm that was running on it migrated correctly to node 1 and back to node 2 as soon as it went back online.

This thing kept happening randomly, sometimes in short intervals (like 2 days) sometime with weeks apart between each event.

the cluster is in production but the issue has only manifested outside of working hours and with minimal load on the node.

We Thought that the replication could somehow create the issue during backup hours (which starts outside of working hours) and tried to turn it off during those hours. no change.

The cluster has it's dedicated NIC with a direct cable between nodes for replication

It doesn't seem to be an hardware issue because the node passes all his self tests during reboot and correctly restart proxmox, i reckon something like this could be related to some CPU or RAM issue but we should then hang during POST.
We thought of some power supply issue but again: the nodes has a backup PSU and it's connected to the same UPS of node 1 which has never rebooted. Also NUT server didn't report any power loss.

We tried to investigate into the logs of node 2 but apart from the ---reboot--- message nothing stands out.

If we inspect logs on node 1 we can confirm that quorum is preserved and the decision is made to migrate the VM back and forth from node 2 to node 1 but that's it
here is a pastebin: https://pastebin.com/Y3VizsPd

We were thinking of getting subscription given these are production servers anyway and try to update to 8.1 but then again it only happens on one of the nodes and the hardware is twinned.

We tried researching a bit here and there but with no luck

Anywone has suggestions?

jsterr · Mar 17, 2024

Please post your /etc/network/interfaces file and your /etc/pve/corosync.conf this is most likely a corosync-issue.
Do you have corosync running on its own nic?

Fumareddu · Mar 18, 2024

jsterr said:
Please post your /etc/network/interfaces file and your /etc/pve/corosync.conf this is most likely a corosync-issue.
Do you have corosync running on its own nic?

Hey thank you for your reply.
Corosync is running on the cluster network for the most part. Only Qdevice is on the management please read below for info on why

Here are interfaces:

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.29.30/24
        gateway 192.168.29.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
#Management network

auto vmbr1
iface vmbr1 inet static
        address 10.99.99.3/24
        bridge-ports eno2
        bridge-stp off
        bridge-fd 0
#Cluster Network

auto vmbr2
iface vmbr2 inet static
        address 10.99.10.3/24
        bridge-ports eno3
        bridge-stp off
        bridge-fd 0
#Backup Network

and Corosync:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pvea1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.99.99.2
  }
  node {
    name: pvea2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.99.99.3
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 192.168.29.4
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: *************
  config_version: 4
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

192.168.29.x is our management network, Users network is a vlan daughter of the management on the same nic.
10.99.99.x is the cluster network, it is a linux bridge between the two nodes via a dedicated nic for each one.
10.99.10.x is a bridge with a NAS for backup purpose, yet again on a dedicated NIC.
Quorum is
- Node 1: 1 vote
- node 2: 1 vote
- Qdevice: 1 vote.
  - qdevice lives in the management network but is able to speak to the cluster network. Reason for this is that it is also our NUT server and handles the UPS.

Do you think that Qdevice being on the management NIC is causing the issue? but why only for node 2? and why never on high network load scenarios?

jsterr · Mar 18, 2024

Fumareddu said:
Hey thank you for your reply.
Corosync is running on the cluster network for the most part. Only Qdevice is on the management please read below for info on why

Here are interfaces:

Code:

auto lo iface lo inet loopback iface eno1 inet manual iface eno2 inet manual iface eno3 inet manual iface eno4 inet manual auto vmbr0 iface vmbr0 inet static address 192.168.29.30/24 gateway 192.168.29.1 bridge-ports eno1 bridge-stp off bridge-fd 0 #Management network auto vmbr1 iface vmbr1 inet static address 10.99.99.3/24 bridge-ports eno2 bridge-stp off bridge-fd 0 #Cluster Network auto vmbr2 iface vmbr2 inet static address 10.99.10.3/24 bridge-ports eno3 bridge-stp off bridge-fd 0 #Backup Network

and Corosync:

Code:

logging { debug: off to_syslog: yes } nodelist { node { name: pvea1 nodeid: 1 quorum_votes: 1 ring0_addr: 10.99.99.2 } node { name: pvea2 nodeid: 2 quorum_votes: 1 ring0_addr: 10.99.99.3 } } quorum { device { model: net net { algorithm: ffsplit host: 192.168.29.4 tls: on } votes: 1 } provider: corosync_votequorum } totem { cluster_name: ************* config_version: 4 interface { linknumber: 0 } ip_version: ipv4-6 link_mode: passive secauth: on version: 2 }

192.168.29.x is our management network, Users network is a vlan daughter of the management on the same nic.

10.99.99.x is the cluster network, it is a linux bridge between the two nodes via a dedicated nic for each one.

10.99.10.x is a bridge with a NAS for backup purpose, yet again on a dedicated NIC.

Quorum is

Node 1: 1 vote

node 2: 1 vote

Qdevice: 1 vote.

qdevice lives in the management network but is able to speak to the cluster network. Reason for this is that it is also our NUT server and handles the UPS.

Do you think that Qdevice being on the management NIC is causing the issue? but why only for node 2? and why never on high network load scenarios?

Hello! Not sure but you do not need a vmbr to use corosync. Just put the IP directly on eno2,reduce unneeded complexity and network layers. Also please configure a second Ring, you dont have any redundancy if your eno2 fails.

I would also put the qdevice in the same Network, Just put 2 ips on your qdevice port.

Fumareddu · Mar 18, 2024

jsterr said:
Hello! Not sure but you do not need a vmbr to use corosync. Just put the IP directly on eno2,reduce unneeded complexity and network layers. Also please configure a second Ring, you dont have any redundancy if your eno2 fails.

I would also put the qdevice in the same Network, Just put 2 ips on your qdevice port.

Hey Thank you for your reply.
I created linux bridges because i read it somewhere as the way to go for connecting two nodes directly with a network cable, maybe i dreamt it? You say i should try to just assign the nic the same class and be done with it? i can try that but it might take a while since being in production an all it's not something i would do remotely nor during working hours. Might have to wait for the right chance to make this modification on site during a weekend.

What i can do more easily is configure a second direct link as backup for corosync for the two nodes with the spare Nics. Also it might explain, if a nic were slightly defective, the issue. IT should also circumnavigate the previous issue. Is there documentation on how to do it? i assume add a ring entry and set up IP's on the ports and push config version up by one?

About the Qdevice it is on the main network and connected to the main switch, not directly to the nodes and not handled by a nic on the cluster. It was the first thing we thought about, but a question pertain:
If it was the weak link shouldn't he become the member that resulted as disconnected instead of node 2 that is directly connected to node 1?

not to mention that the reboot issue only happens during non working hours, sometime even out of back-up hours when load on both network and system is minimal.

This has been going on for 3 months now and it's been the only consistent fact about it

BobhWasatch · Mar 18, 2024

Fumareddu said:
I created linux bridges because i read it somewhere as the way to go for connecting two nodes directly with a network cable, maybe i dreamt it?

For 100 Mbit and slower you needed a switch or hub or crossover cable to connect two ports together so the transmit and receive pairs were connected correctly. Gigabit and higher will figure that out for themselves. You never needed a Linux bridge for that though.

Don't know about the rest of your problem though, sorry.

emunt6 · Mar 18, 2024

Do not bridge two interface together, it will create an L2 loop.
Better solution to use bond (master/slave mode) interface, do not use bridge.

Code:

Example:


Logical topo:

node1          node2
bond0 <------> bond0



Pysical topo:

node1                node2
nic1_port0 <------> nic1_port0
nic2_port0 <------> nic2_port0

Fumareddu · Mar 19, 2024

emunt6 said:
Do not bridge two interface together, it will create an L2 loop.
Better solution to use bond (master/slave mode) interface, do not use bridge.

Code:

Example: Logical topo: node1 node2 bond0 <------> bond0 Pysical topo: node1 node2 nic1_port0 <------> nic1_port0 nic2_port0 <------> nic2_port0

Thank you,
i will do that as soon as i am on site

Fumareddu · Jun 6, 2024

hey,
i know it's been a long time but figured i would leave our learnings so if anyone stumble upon this might help them as well.

In the end the soultion wasn't anything suggested into this topic (although i still thanks those suggestions as they pushed into a more optimized way of handling corosync.

After several tries what really gave us an idea was the fact that the reboot was happening, although at random, always around the same hour.

That's the time we had scheduled our backups. What's even more strange is that most of the days backup would complete without issue which threw us off, but after really digging into it we realized, by crossing the exact time with the backup logs that the even alway happened between 95% and 97% of a specific VM, aways the same.
On the reason why this would occur only at random days we never had any idea, nor the actual solution to the issue, because luckily for us, the VM in question was pending to be decommissioned in a few days anyway.

Since we have shut it down and removed it from the backup jobs we haven't had a reboot event and it's now more than 2 months. Considering we had at least one a week on a random day i would say the matter seems to be resolved.

The VM in question was nothing special, a thinstuff terminal server with windows 10 pro installed an 64GB RAM. It never gave an issue during working hours only triggered the reboot during 95-97% backup. Go figure

Search

Search

Reboot Issue

Fumareddu

New Member

jsterr

Renowned Member

Fumareddu

New Member

jsterr

Renowned Member

Fumareddu

New Member

BobhWasatch

Famous Member

emunt6

Active Member

Fumareddu

New Member

Fumareddu

New Member

We value your privacy