Proxmox 4.0 suddenly can't SSH to any VMs; lxc-stop won't work

MikeC

Renowned Member
Jan 11, 2016
72
0
71
Bay Area, California
Hello all.

We have a proxmox server with 12 VMs on it. an hour ago I started getting reports that nobody was able to SSH into any of the VMs. I tried and confirmed that the login was hanging at the same spot on all servers. I noticed pveproxy was not responding either, but that seems to have been the case since I tried enabling cluster last month (it didn't work). After a pveproxy stop/start, I could access the web interface but cannot get a console to any of the VMs! Also, the Shutdown and Stop commands time out. I can still access services running on the VMs, but can't access them at all vi SSH or console. Logs don't show anything suspicious.

Any ideas on what I can try? If I have to reboot the proxmox server I'd rather be able to do a safe shutdown of all the VMs first.
 
using zfs-local. The RAID is OK and online. Usage is 12%, so it's not a disk space issue.

trying to Stop a VM nets this message:

can't lock file '/run/lock/lxc/pve-config-106.lock' - got timeout (500)
 
You cannot login into VMs with ssh - this indicates a problem inside the VMs, with storage, or maybe a network problem. So I guess this is unrelated to pvedaemon (or any other service on the host).
 
Thanks, Dietmar. Containers in this case. Well, so far there is no indication what it is from the logs, but I could neither ssh nor console (using the Proxmox GUI) into ANY of them. Like, *ALL* of them, except for the proxmox node itself. No problem there. That's an indication to me that the problems exists at the common system level: Proxmox. After struggling with this for an hour I decided to reboot the box. After issuing 'reboot' from the proxmox node, I noticed in the GUI that proxmox had sent shutdowns to all the containers, which is expected. However, they never stopped. The process to shut them down continued waiting for another 45 minutes before I issues a 'reboot -f' to bypass the init system and shut down the box (there was no indication that proxmox was going to all of the sudden successfully shut down any of them, and time is critical). After the reboot I was unable to start any of the containers, receiving a 'no quorum (500)' error. I removed the cluster-specific configs and restarted the proxmox-related processes and was able to start the containers. I'm now conducting an after-action review, and the first glaring question is what happened and why. Thanks for the responses, Dietmar, I appreciate the insight, but until I can see otherwise, I'm blaming proxmox for this incident. Having to reboot a VM node when you can't ensure all the containers have been properly stopped is extremely bad. So far I'm lucky this node did not have priority 1 servers on any of the VMs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!