Why do some of my Proxmox hosts occasionally fail to reboot?

Jan 12, 2015
94
2
28
I first noticed this issue with Proxmox 5.4 and suspect it has everything to do with Systemd integration, and very little with Proxmox itself, but it bites me in the backside every so often so would like to bring it up and see if there's something I should be doing differently or a fix for it, or what.

When I have kernel updates to apply on my Proxmox 5.4 hosts, I set an off-hours at job to run a script which calls 'qm shutdown' on all running VMs, unmounts all iSCSI volumes (my VMs run from SSD remote storage), and then issues a reboot command for the Proxmox hypervisor itself. When Proxmox gets back from the reboot, the volumes mount automatically, the VMs start up, and life goes on. Years ago, we used to simply issue a reboot from an at job (much simpler) but VMs would often boot with corrupted filesystems so we surmised this was the wrong way to go about it.

The problem I'm having is there are times when the reboot command fails completely. I've never seen anything like it. Maybe twice in 2019. Not a huge percentage of the time, but it's baffling and a real pain to deal with. I'll remote in around 6am to take a quick look around and one, or several, Proxmox hosts will just be sitting there. The automated reboot command didn't work. I'll try to reboot by hand withreboot, init 6, shutdown -tr now, systemctl reboot .. nothing works and no reboot happens. systemctl will sometimes return a message that it's scheduling a reboot for a minute or so into the future but yeah, it never happens. I never find any related errors in syslog or dmesg. It happened again this past weekend and I'm ready to look into ipmi rebooting or something lower level or some kind of remote power relay. Am I jsut doing this wrong? Any suggestions?

Thanks
 
hi,

maybe you can try stopping all the guests first. like: pvesh create /nodes/localhost/stopall and then once that's finished try reboot?
 
Thanks! Is this functionally different that qm shutdown VMID?
Regarding the failing reboot command however, All VMs are already in the stopped state when the reboot fails.
 
Regarding the failing reboot command however, All VMs are already in the stopped state when the reboot fails.

what is blocking the reboot then? isn't there anything in the syslog or the journal? probably some process is stopping you from rebooting, like maybe an NFS mount or something similar.
 
All the VMs were down and the iSCSI storage unmounted/logged-out (iSCSI entries missing from /proc/mounts). I looked in syslog and dmesg and there wasn't any indication of a problem. I have had NFS mounts (long time ago, we don't run NFS) block reboots before but usually the terminal will hang while the process is waiting. You don't even get a cursor back after hitting enter. Not even with CTRL+C. This isn't like that though. The reboot command returns immediately and nothing happens. CTRL+ALT+DEL on the remote KVM console fails as well (almost like OOM issues, but again, that should leave something in syslog). The only way to drop the box is a physical reset or power.

Like I said, this only happens a few times a year but it is maddening when it does. If I can find a networkable relay gizmo that hooks up to the reset switch I'm buying a dozen, otherwise I guess we'll live with it. It doesn't sound like a known issue with an easy fix.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!