Proxmox 6.4 reboot issue

Jul 14, 2021
7
0
1
Omaha, Ne
I admin a simple 2 proxmox server setup (I know, 3 would be better)

I just got both servers upgraded to 6.4, and I noticed an issue. For explanation sake, let's
call them proxmox1 and proxmox 2.

When I went to reboot proxmox2, the usual would happen. Any running VMs would power down
and the system rebooted. This process seemed to take a while, by my estimate ~ 20min. Anyway,
about 17 min into the process, proxmox1 would reboot.

This happened repeatably 3 times. so I would say not a fluke. Anyway help would be greatly
appreciated.
 
a bit unclear, but you seems you are running a 2-node Cluster ?

Question : for a 2-node cluster some additional config needed to be made ...
=> Could it be that when you upgraded the definition of having a 2-node cluster got erased, so its now killing the other node while working with the other ?
 
  • Like
Reactions: choman
a bit unclear, but you seems you are running a 2-node Cluster ?

Question : for a 2-node cluster some additional config needed to be made ...
=> Could it be that when you upgraded the definition of having a 2-node cluster got erased, so its now killing the other node while working with the other ?
Unsure, never has that happen in the past. Although VMs are migrating without issue AFAIK. However, I certainly could try killing the cluster and re-establishing the cluster if that seems like a good way forward?
 
The 'additional settings' all point towards the way you implemented your Cluster.
This makes it quite hard for me to assume your current situation.
As you have stated :
- you run a 2-Node cluster
- corosync contains the specific reference as such.

But the great unknown/not disclosed is :
- do you have shared storage ?
- are you using HA ?
- are you using dlm ?
- are you using lvmlockd ?

The only situation where i can think of is that a different process is still detecting as-if the cluster is malfunctioning, and sends a poison-pill/STONITH's the other node you are working on, just because it lost comms.

If the only method implemented is the reachabillity of nodes over (a single) network will mean if it gets disrupted it will fall apart, and the one node you are working on will assume its malfunctioning and send a kill-command.

A 2-node cluster isnt called a 'poor-mans-cluster' for nothing, cuz when a node goes down in essence you are in a single-server situation.

Most likely logs will indicate exactly what has happened on the node that went down on you when working on the other node.
This will need to be examined prior to even attemting to advise a couse of action.
 
So I stood up corosync on a stand-alone server. didn't help (didn't think it would)
I am not running HA yet, but now that I have corosync running I might

Last error message in syslog is "trying acquire cfs lock 'file-replication_cfg' ..."
I hate to think replication is causing this issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!