Proxmox 6.4 reboot issue

choman · Jul 14, 2021

I admin a simple 2 proxmox server setup (I know, 3 would be better)

I just got both servers upgraded to 6.4, and I noticed an issue. For explanation sake, let's
call them proxmox1 and proxmox 2.

When I went to reboot proxmox2, the usual would happen. Any running VMs would power down
and the system rebooted. This process seemed to take a while, by my estimate ~ 20min. Anyway,
about 17 min into the process, proxmox1 would reboot.

This happened repeatably 3 times. so I would say not a fluke. Anyway help would be greatly
appreciated.

Glowsome · Jul 15, 2021

a bit unclear, but you seems you are running a 2-node Cluster ?

Question : for a 2-node cluster some additional config needed to be made ...
=> Could it be that when you upgraded the definition of having a 2-node cluster got erased, so its now killing the other node while working with the other ?

choman · Jul 15, 2021

Glowsome said:
a bit unclear, but you seems you are running a 2-node Cluster ?

Question : for a 2-node cluster some additional config needed to be made ...
=> Could it be that when you upgraded the definition of having a 2-node cluster got erased, so its now killing the other node while working with the other ?

Unsure, never has that happen in the past. Although VMs are migrating without issue AFAIK. However, I certainly could try killing the cluster and re-establishing the cluster if that seems like a good way forward?

Glowsome · Jul 15, 2021

i would first check if the additional settings needed for a 2-node lcuster are still correct / in place.

choman · Jul 17, 2021

Can you point me at the "additional settings" that you are referring to. I am not familiar with these, other than corosync for qurom but I am not running that yet

Glowsome · Jul 18, 2021

The 'additional settings' all point towards the way you implemented your Cluster.
This makes it quite hard for me to assume your current situation.
As you have stated :
- you run a 2-Node cluster
- corosync contains the specific reference as such.

But the great unknown/not disclosed is :
- do you have shared storage ?
- are you using HA ?
- are you using dlm ?
- are you using lvmlockd ?

The only situation where i can think of is that a different process is still detecting as-if the cluster is malfunctioning, and sends a poison-pill/STONITH's the other node you are working on, just because it lost comms.

If the only method implemented is the reachabillity of nodes over (a single) network will mean if it gets disrupted it will fall apart, and the one node you are working on will assume its malfunctioning and send a kill-command.

A 2-node cluster isnt called a 'poor-mans-cluster' for nothing, cuz when a node goes down in essence you are in a single-server situation.

Most likely logs will indicate exactly what has happened on the node that went down on you when working on the other node.
This will need to be examined prior to even attemting to advise a couse of action.

choman · Jul 18, 2021

So I stood up corosync on a stand-alone server. didn't help (didn't think it would)
I am not running HA yet, but now that I have corosync running I might

Last error message in syslog is "trying acquire cfs lock 'file-replication_cfg' ..."
I hate to think replication is causing this issue.

choman · Jul 18, 2021

Also note: I moved all VMs to proxmox2, rebooted proxmox1 and all good
When I move all VM to proxmox1 and reboot proxmox2. proxmox1 eventually self-reboots

choman · Jul 20, 2021

Think an update will resolve this 6 --> 7?

choman · Jul 20, 2021

I removed the cephs setup on proxmox1
I have a qdevice for quorum

Problem appears resolved. Will monitor

Search

Search

Proxmox 6.4 reboot issue

choman

New Member

Glowsome

Renowned Member

choman

New Member

Glowsome

Renowned Member

choman

New Member

Glowsome

Renowned Member

choman

New Member

choman

New Member

choman

New Member

choman

New Member

We value your privacy