[SOLVED] HA VMs and Containers Freeze on Node Thermal Shutdown

dperron-ss

New Member
Jan 8, 2019
3
0
1
33
Hi folks,

I'm having an annoying issue with Proxmox HA + Ceph clusters on HP Proliant hardware. Whenever a component fails (Fan in this case), the server powers itself down to prevent data loss. The problem is that this tells Proxmox HA to shutdown all VMs and CTs living on that node as part of a graceful shutdown procedure. A number of important services get stuck in a "Frozen" state when this happens, and do not fail over to another node.

I know there's a great deal of wisdom out there in the community, and I was hoping somebody could shed some light on this for me.

Thanks in advance for any guidance!
 
Which pveversion to you have?
Code:
pveversion -v

On a powerdown all HA services gets stopped, and by default, if you trigger a reboot they will get marked "frozen", but on a plain poweroff they should not marked as such and then get recovered to another node.
At least that's the case with current version of Proxmox VE/HA manager.

With the newest version (at the moment of writing only available in the pvetest repository) of HA manager you can overwrite this behaviour to always recover or always freeze, see:
https://bugzilla.proxmox.com/show_bug.cgi?id=1378
 
  • Like
Reactions: guletz
I think we've settled on the root of the problem, the HP Proliant in question got itself stuck in a reboot loop, I'll look into its BIOS to see if we can have it not power itself back on after a failure.

Thanks for pointing me to the config option to force failover, that looks like it will solve my problem when the package updates are pushed to the public repository.

For posterity here's the output of pveversion -v:

Code:
root@rogers-pve1:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
 
Installed the testing repository and set the config option, can confirm that shutting down PVE nodes results in the desired behaviour.

Thanks for the help!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!