Reboot Issue

Fumareddu

New Member
Oct 6, 2023
6
1
3
Hello everyone,

Thank you for this beautiful project.

I am having a really peculiar issue with a proxmox installation.

The setup is a 2 servers cluster + a rasperry as Qdevice for quorum(proxmox version 8.03). The 2 servers are Dell r730xd so it's enterprise stuff. we got replication going and HA and everything worked fine for about a couple of months. After a while though we got messages from zabbix that the only vm that was running on node 2 went offline and online. We checked the node 2 logs and true enough out of the blue we saw --Reboot-- and learned that apparently the node rebooted itself. The vm that was running on it migrated correctly to node 1 and back to node 2 as soon as it went back online.

This thing kept happening randomly, sometimes in short intervals (like 2 days) sometime with weeks apart between each event.

the cluster is in production but the issue has only manifested outside of working hours and with minimal load on the node.

We Thought that the replication could somehow create the issue during backup hours (which starts outside of working hours) and tried to turn it off during those hours. no change.

The cluster has it's dedicated NIC with a direct cable between nodes for replication

It doesn't seem to be an hardware issue because the node passes all his self tests during reboot and correctly restart proxmox, i reckon something like this could be related to some CPU or RAM issue but we should then hang during POST.
We thought of some power supply issue but again: the nodes has a backup PSU and it's connected to the same UPS of node 1 which has never rebooted. Also NUT server didn't report any power loss.

We tried to investigate into the logs of node 2 but apart from the ---reboot--- message nothing stands out.

If we inspect logs on node 1 we can confirm that quorum is preserved and the decision is made to migrate the VM back and forth from node 2 to node 1 but that's it
here is a pastebin: https://pastebin.com/Y3VizsPd

We were thinking of getting subscription given these are production servers anyway and try to update to 8.1 but then again it only happens on one of the nodes and the hardware is twinned.

We tried researching a bit here and there but with no luck

Anywone has suggestions?
 
Please post your /etc/network/interfaces file and your /etc/pve/corosync.conf this is most likely a corosync-issue.
Do you have corosync running on its own nic?
 
  • Like
Reactions: Kingneutron
Please post your /etc/network/interfaces file and your /etc/pve/corosync.conf this is most likely a corosync-issue.
Do you have corosync running on its own nic?
Hey thank you for your reply.
Corosync is running on the cluster network for the most part. Only Qdevice is on the management please read below for info on why

Here are interfaces:
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.29.30/24
        gateway 192.168.29.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
#Management network

auto vmbr1
iface vmbr1 inet static
        address 10.99.99.3/24
        bridge-ports eno2
        bridge-stp off
        bridge-fd 0
#Cluster Network

auto vmbr2
iface vmbr2 inet static
        address 10.99.10.3/24
        bridge-ports eno3
        bridge-stp off
        bridge-fd 0
#Backup Network
and Corosync:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pvea1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.99.99.2
  }
  node {
    name: pvea2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.99.99.3
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 192.168.29.4
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: *************
  config_version: 4
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

  • 192.168.29.x is our management network, Users network is a vlan daughter of the management on the same nic.
  • 10.99.99.x is the cluster network, it is a linux bridge between the two nodes via a dedicated nic for each one.
  • 10.99.10.x is a bridge with a NAS for backup purpose, yet again on a dedicated NIC.
  • Quorum is
    • Node 1: 1 vote
    • node 2: 1 vote
    • Qdevice: 1 vote.
      • qdevice lives in the management network but is able to speak to the cluster network. Reason for this is that it is also our NUT server and handles the UPS.
Do you think that Qdevice being on the management NIC is causing the issue? but why only for node 2? and why never on high network load scenarios?
 
Hey thank you for your reply.
Corosync is running on the cluster network for the most part. Only Qdevice is on the management please read below for info on why

Here are interfaces:
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.29.30/24
        gateway 192.168.29.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
#Management network

auto vmbr1
iface vmbr1 inet static
        address 10.99.99.3/24
        bridge-ports eno2
        bridge-stp off
        bridge-fd 0
#Cluster Network

auto vmbr2
iface vmbr2 inet static
        address 10.99.10.3/24
        bridge-ports eno3
        bridge-stp off
        bridge-fd 0
#Backup Network
and Corosync:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pvea1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.99.99.2
  }
  node {
    name: pvea2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.99.99.3
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 192.168.29.4
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: *************
  config_version: 4
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

  • 192.168.29.x is our management network, Users network is a vlan daughter of the management on the same nic.
  • 10.99.99.x is the cluster network, it is a linux bridge between the two nodes via a dedicated nic for each one.
  • 10.99.10.x is a bridge with a NAS for backup purpose, yet again on a dedicated NIC.
  • Quorum is
    • Node 1: 1 vote
    • node 2: 1 vote
    • Qdevice: 1 vote.
      • qdevice lives in the management network but is able to speak to the cluster network. Reason for this is that it is also our NUT server and handles the UPS.
Do you think that Qdevice being on the management NIC is causing the issue? but why only for node 2? and why never on high network load scenarios?
Hello! Not sure but you do not need a vmbr to use corosync. Just put the IP directly on eno2,reduce unneeded complexity and network layers. Also please configure a second Ring, you dont have any redundancy if your eno2 fails.

I would also put the qdevice in the same Network, Just put 2 ips on your qdevice port.
 
  • Like
Reactions: Kingneutron
Hello! Not sure but you do not need a vmbr to use corosync. Just put the IP directly on eno2,reduce unneeded complexity and network layers. Also please configure a second Ring, you dont have any redundancy if your eno2 fails.

I would also put the qdevice in the same Network, Just put 2 ips on your qdevice port.
Hey Thank you for your reply.
I created linux bridges because i read it somewhere as the way to go for connecting two nodes directly with a network cable, maybe i dreamt it? You say i should try to just assign the nic the same class and be done with it? i can try that but it might take a while since being in production an all it's not something i would do remotely nor during working hours. Might have to wait for the right chance to make this modification on site during a weekend.

What i can do more easily is configure a second direct link as backup for corosync for the two nodes with the spare Nics. Also it might explain, if a nic were slightly defective, the issue. IT should also circumnavigate the previous issue. Is there documentation on how to do it? i assume add a ring entry and set up IP's on the ports and push config version up by one?

About the Qdevice it is on the main network and connected to the main switch, not directly to the nodes and not handled by a nic on the cluster. It was the first thing we thought about, but a question pertain:
If it was the weak link shouldn't he become the member that resulted as disconnected instead of node 2 that is directly connected to node 1?

not to mention that the reboot issue only happens during non working hours, sometime even out of back-up hours when load on both network and system is minimal.

This has been going on for 3 months now and it's been the only consistent fact about it
 
Last edited:
I created linux bridges because i read it somewhere as the way to go for connecting two nodes directly with a network cable, maybe i dreamt it?
For 100 Mbit and slower you needed a switch or hub or crossover cable to connect two ports together so the transmit and receive pairs were connected correctly. Gigabit and higher will figure that out for themselves. You never needed a Linux bridge for that though.

Don't know about the rest of your problem though, sorry.
 
Do not bridge two interface together, it will create an L2 loop.
Better solution to use bond (master/slave mode) interface, do not use bridge.

Code:
Example:


Logical topo:

node1          node2
bond0 <------> bond0



Pysical topo:

node1                node2
nic1_port0 <------> nic1_port0
nic2_port0 <------> nic2_port0
 
Do not bridge two interface together, it will create an L2 loop.
Better solution to use bond (master/slave mode) interface, do not use bridge.

Code:
Example:


Logical topo:

node1          node2
bond0 <------> bond0



Pysical topo:

node1                node2
nic1_port0 <------> nic1_port0
nic2_port0 <------> nic2_port0
Thank you,
i will do that as soon as i am on site
 
hey,
i know it's been a long time but figured i would leave our learnings so if anyone stumble upon this might help them as well.

In the end the soultion wasn't anything suggested into this topic (although i still thanks those suggestions as they pushed into a more optimized way of handling corosync.

After several tries what really gave us an idea was the fact that the reboot was happening, although at random, always around the same hour.

That's the time we had scheduled our backups. What's even more strange is that most of the days backup would complete without issue which threw us off, but after really digging into it we realized, by crossing the exact time with the backup logs that the even alway happened between 95% and 97% of a specific VM, aways the same.
On the reason why this would occur only at random days we never had any idea, nor the actual solution to the issue, because luckily for us, the VM in question was pending to be decommissioned in a few days anyway.

Since we have shut it down and removed it from the backup jobs we haven't had a reboot event and it's now more than 2 months. Considering we had at least one a week on a random day i would say the matter seems to be resolved.

The VM in question was nothing special, a thinstuff terminal server with windows 10 pro installed an 64GB RAM. It never gave an issue during working hours only triggered the reboot during 95-97% backup. Go figure
 
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!