Add node - reboot all vm's

dignus

Renowned Member
Feb 12, 2009
157
11
83
Hi,

I have a proxmox cluster running with 13 nodes. I wanted to add a node to it. Somehow it then decided to restart all running VM's. All 200-something VM's were restarted. Finally all VM's come back online, I see the 14th node in the cluster, but it appears offline, as is one other node.

The last thing I saw was "restarting services", before the VM I was doing my work from was restarted as well.

Any idea on how to find the root cause of this and how to prevent this from happening? I have to say this is the 2nd time we've seen this. First time I blamed murphy, but happening twice...
 
Hi,

I have a proxmox cluster running with 13 nodes. I wanted to add a node to it. Somehow it then decided to restart all running VM's. All 200-something VM's were restarted. Finally all VM's come back online, I see the 14th node in the cluster, but it appears offline, as is one other node.

The last thing I saw was "restarting services", before the VM I was doing my work from was restarted as well.

Any idea on how to find the root cause of this and how to prevent this from happening? I have to say this is the 2nd time we've seen this. First time I blamed murphy, but happening twice...
Hi,
I guess it's HA-related... are all your VMs are under ha-control?

Udo
 
Are you sure you only have 1 IGMP querier active in your network (the second one/others needs to be silent)?
Can you see anything in the logs of your switch(es) that can clarify something (STP/loop-errors?)?

What fencing device do you use and how about the timers when this occurs?
 
Last edited:
Only thing active on this (physical) network is this proxmox cluster. Fencing is set to software. Not sure about the timers, all default settings anyway. Didn't expect this result, so wasn't logged into any one of the physical boxes.
 
Didn't notice anything special really, but I don't have access to the logs anymore, they rotated already.
What if I try it again? Would it be beneficial to temporarily turn off HA for the VM's?
 
I think if you only disable HA for the VM's the node will still crash. Only change you will acknowledge is that the VM's running on the crashed node will not be moved automatically to another node (cause HA is disabled). If you want to test this you need to disable HA for the VM's and disable watchdog fencing (do not disable watchdog fecing only, because if cluster traffic is lost during the test, all HA VM's will crash). But I doubt how much information such a test will give you. A real test will probably give much more information in logs, but I understand you don't want to test this on a setup with production load. So this test might worth a shot.
 
Happened to me now the same thing.
I add one new node (was 3) and all 3 reboot, let all vms offline.
pve-manager/6.1-11/f2f18736 (running kernel: 5.3.18-3-pve)

I will look other option after today. The evaluation ended.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!