[SOLVED] HA VMs and Containers Freeze on Node Thermal Shutdown

dperron-ss

New Member
Jan 8, 2019
3
0
1
28
Hi folks,

I'm having an annoying issue with Proxmox HA + Ceph clusters on HP Proliant hardware. Whenever a component fails (Fan in this case), the server powers itself down to prevent data loss. The problem is that this tells Proxmox HA to shutdown all VMs and CTs living on that node as part of a graceful shutdown procedure. A number of important services get stuck in a "Frozen" state when this happens, and do not fail over to another node.

I know there's a great deal of wisdom out there in the community, and I was hoping somebody could shed some light on this for me.

Thanks in advance for any guidance!
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
1,695
258
103
South Tyrol/Italy
Which pveversion to you have?
Code:
pveversion -v
On a powerdown all HA services gets stopped, and by default, if you trigger a reboot they will get marked "frozen", but on a plain poweroff they should not marked as such and then get recovered to another node.
At least that's the case with current version of Proxmox VE/HA manager.

With the newest version (at the moment of writing only available in the pvetest repository) of HA manager you can overwrite this behaviour to always recover or always freeze, see:
https://bugzilla.proxmox.com/show_bug.cgi?id=1378
 
  • Like
Reactions: guletz

dperron-ss

New Member
Jan 8, 2019
3
0
1
28
I think we've settled on the root of the problem, the HP Proliant in question got itself stuck in a reboot loop, I'll look into its BIOS to see if we can have it not power itself back on after a failure.

Thanks for pointing me to the config option to force failover, that looks like it will solve my problem when the package updates are pushed to the public repository.

For posterity here's the output of pveversion -v:

Code:
root@rogers-pve1:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
 

dperron-ss

New Member
Jan 8, 2019
3
0
1
28
Installed the testing repository and set the config option, can confirm that shutting down PVE nodes results in the desired behaviour.

Thanks for the help!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!