Proxmox host locked up

testicleo

New Member
Nov 6, 2014
6
0
1
I have a pretty new installation which had been running flawlessly until this morning. I noticed that the interface was sluggish and the I/O delay was up around 65%. I began by shutting down suspected vm's to see if this corrected. I waited some time before rebooting the host. It rebooted fine but I am unable to access the web front end or ssh into the box. It is by all accounts, completely locked up. I was able to reach each of the VM's that had spun up and shut them down, one by one, until they were all shut down. It's a single host with the vm's on a Thecus 8900 via NFS. I do not have access to the console at the moment but am guessing that is my only chance to see what is happening, just curious what should I look at once I do get visibility back to find out what exactly caused the extreme I/O delay. Could also potentially be a bad drive on the host itself or a degraded RAID...just looking for any suggestions anyone might have to check to for root cause. Thanks!
 
I think you will have luck by checking syslog to see if it registered anything. Are those all KVM VMs? Thats my first "Go To" when something goes sideways.
 
Yes, they are all KVM VM's. I will check the syslog when I get access to the console later and update with anything of interest. Just seems odd that the system itself would lock up so hard but the actual VM's run pretty smooth. Just starting out with Proxmox so perhaps that is not so odd. Hopefully something minor, or actually, I would be okay with a hardware failure somewhere on the host as it just needs a good reason to be booted out. Thanks for the heads up.
 
Hi,
the gui can have problems if a defined storage isn't accessible (and with IO-trouble you will also get high wait-usage and an high load).
But ssh should run fine! Something wrong with the IP-config/routing?
Can you ping the host?

Udo
 
Yep. Host is responsive to ping, switch is not throwing any errors on the port. Checked the storage as well, all seems fine there and can access it fine, logs look clear there etc. SSH is unresponsive (just sits) and the gui just spins and spins. With that being said, almost certainly an issue with the host as I tried to eliminate anything else in the picture for resolution. Thanks for the suggestion.
 
What do you get if you restart the following services:
#service pveproxy restart
#service pvedaemon restart
#service pvestatd restart

These most of the time fixes gui issue. But as Udo said, not being able to SSH into host indeed is very odd. Not being able to access GUI while VMs running just fine could be a sign that something is up with storage connection. It used to be common issue when NFS share went sideways during backups or something else. Not sure if it is still an issue as i have not seen it in a while. But usually a hard reboot clears this issue. The only thing you can do is to access the console and see where it is sitting at. If at the prompt than there might be hope. But if it stuck somewhere you will know about it.
 
As an update, I finally made it to the console and it was stuck, no prompt but a screen full of messages like:proxmox kernel: kvm: 100771: cpu3 unhandled rdmsr: 0xc00110000
A hard boot seems to have brought it back, but a quick search on errors like this brought me to another thread on the forum where the problem and resolution seemed sporadic at best. Some thought it was NIC drivers, some tried different vm configs. Either way, makes me nervous a bit. Syslog is not very helpful as it doesn't say much about what exactly happened and the rotated log doesn't say much at all. Either way it had the host pretty hosed. I am using nfs based storage but I can't imagine that is an issue? The religion of iSCSI or NFS is an interesting one but I usually relegate it to preference vs true performance etc. Anyone with any thoughts or what to specifically go look for? It had run for a pretty solid week without fail and seemed to have a panic and crumble but so far I am not seeing anything specific. Thanks for any insight.
 
The formatting of the error actually obscured the message....what I had a screen full of were: proxmox kernel: kvm: 100771: cpu3 unhandled rdmsr: 0xc001100d
 
See if you could boot into rescue mode. Then from storage.cfg disable the nfs storage try to boot up. If it still the same, then disable all VM from auto starting then boot up. See if the host boots without issue.
 
Thanks, I actually had to hard boot it and when I did, everything was fine. I/O delay has stayed at about .02% and all of the VM's are working as expected. I was more interested in why the logs show the above error? Anyone else seen anything like that? Some research has said that it is cosmetic and nothing to be concerned about, other comments seem to indicate it is an issue and shouldn't be occurring. Thanks for any input.