Why do some of my Proxmox hosts occasionally fail to reboot?

jimmyjoe · Jan 13, 2020

I first noticed this issue with Proxmox 5.4 and suspect it has everything to do with Systemd integration, and very little with Proxmox itself, but it bites me in the backside every so often so would like to bring it up and see if there's something I should be doing differently or a fix for it, or what.

When I have kernel updates to apply on my Proxmox 5.4 hosts, I set an off-hours at job to run a script which calls 'qm shutdown' on all running VMs, unmounts all iSCSI volumes (my VMs run from SSD remote storage), and then issues a reboot command for the Proxmox hypervisor itself. When Proxmox gets back from the reboot, the volumes mount automatically, the VMs start up, and life goes on. Years ago, we used to simply issue a reboot from an at job (much simpler) but VMs would often boot with corrupted filesystems so we surmised this was the wrong way to go about it.

The problem I'm having is there are times when the reboot command fails completely. I've never seen anything like it. Maybe twice in 2019. Not a huge percentage of the time, but it's baffling and a real pain to deal with. I'll remote in around 6am to take a quick look around and one, or several, Proxmox hosts will just be sitting there. The automated reboot command didn't work. I'll try to reboot by hand withreboot, init 6, shutdown -tr now, systemctl reboot .. nothing works and no reboot happens. systemctl will sometimes return a message that it's scheduling a reboot for a minute or so into the future but yeah, it never happens. I never find any related errors in syslog or dmesg. It happened again this past weekend and I'm ready to look into ipmi rebooting or something lower level or some kind of remote power relay. Am I jsut doing this wrong? Any suggestions?

Thanks

oguz · Jan 13, 2020

hi,

maybe you can try stopping all the guests first. like: pvesh create /nodes/localhost/stopall and then once that's finished try reboot?

jimmyjoe · Jan 13, 2020

Thanks! Is this functionally different that qm shutdown VMID?
Regarding the failing reboot command however, All VMs are already in the stopped state when the reboot fails.

oguz · Jan 14, 2020

jimmyjoe said:
Regarding the failing reboot command however, All VMs are already in the stopped state when the reboot fails.

what is blocking the reboot then? isn't there anything in the syslog or the journal? probably some process is stopping you from rebooting, like maybe an NFS mount or something similar.

jimmyjoe · Jan 14, 2020

All the VMs were down and the iSCSI storage unmounted/logged-out (iSCSI entries missing from /proc/mounts). I looked in syslog and dmesg and there wasn't any indication of a problem. I have had NFS mounts (long time ago, we don't run NFS) block reboots before but usually the terminal will hang while the process is waiting. You don't even get a cursor back after hitting enter. Not even with CTRL+C. This isn't like that though. The reboot command returns immediately and nothing happens. CTRL+ALT+DEL on the remote KVM console fails as well (almost like OOM issues, but again, that should leave something in syslog). The only way to drop the box is a physical reset or power.

Like I said, this only happens a few times a year but it is maddening when it does. If I can find a networkable relay gizmo that hooks up to the reset switch I'm buying a dozen, otherwise I guess we'll live with it. It doesn't sound like a known issue with an easy fix.

jimmyjoe · Jan 15, 2020

follow up: Ran across this on https://serverfault.com. I'm going to try it next time:

Code:

echo 1 > /proc/sys/kernel/sysrq 
echo b > /proc/sysrq-trigger

If anyone else runs across this, this probably doesn't sync filesystems so take heed.

Search

Search

Why do some of my Proxmox hosts occasionally fail to reboot?

jimmyjoe

Active Member

oguz

Proxmox Retired Staff

jimmyjoe

Active Member

oguz

Proxmox Retired Staff

jimmyjoe

Active Member

jimmyjoe

Active Member

We value your privacy