after upgrade to 3.4.6/102d4547 o proxmox host unresponsive, but its VMs running

rgproxmox1

Member
Feb 4, 2013
41
0
6
I updated a cluster to the latest release (official, we have a license for PMX) and things have been running smoothly until one of the 3 hosts became unresponsive: SSH, Web and even console never come back - for SSH and console it asks for the login and it never comes back after that. I have checked that VMs are running and I have "moved" manually some of the ones that I shutdown to another one of the hosts by moving the config files to the directory where the config files of the recipient host keeps the config files for itself and it seemed to work. I am concerned about the following:

1. First of all, has someone seen this scenario? When does it typically happen? It had never happened to us after 2-3 years of using Proxmox. It would seem like a bug on this release. What information do you need when the host comes back up and is responsive to investigate?
2. What will happen when I reset (which seems to be the only way out of this condition after shutting down all the VMs controlled by it) ? I have no idea if the config directory will be replicated from the other nodes or it will still "think" it still owns those VMs

Thoughts? Thanks
 
1. It really shouldn't happen at all. It's hard to say what a possible cause is. Look in the logs/dmesg if anything seems weird...
Did the VM continued to run on the hung Node?
2. Normally, if the node works again as expected, it should get the new status (that the node got moved, and so one) from the other nodes and everything should work fine.
It can be that you must restart the pve-cluster service on the failed node.
Code:
/etc/init.d/pve-cluster restart
But as the cause for the failure is unknown it cannot be 100% guaranteed. Maybe isolate the node first from important shared resources, so it cannot cause harm if it boots up an tries to access resources it hasn't exclusive access anymore (e.g. shared storage). But that's worst case, usual everything should be fine.
 
After the host was reset, I went to the /var/logs directory and there's nothing there, but the boot messages (it stopped logging things at 3am of the day of the failure, when I'm going to assume the failure took place). The only messages are the boot messages when the host was rebooting after it was reset.
The VMs kept on working all along.
After the boot, thankfully it got the changes to the config files that I made by hand in the other hosts, so it didn't try to bring up the same VMs in 2 hosts.
I did forget to mention that the local disk of the Host is in the NAS as iSCSI. I have seem MANY failures when there's iSCSI "hiccups" and the host local drive becomes read only (Is there a way to prevent this and have it such that it can come back as read/write? I understand the dangers of it, but after seeing it so many times, I'd rather take that risk), but in every one of those times, I could SSH into the host and to a minor degree I could use the screens. Usually I'd just issue the reboot of the host with the "forced" option in that case. This time I could not do anything with that host. tcpdump's of that host packets to other hosts showed that it was sending packets, but of zero length all of them (from port 8006).

Without any clues at the moment, don't know what the issue was.
 
Good to hear that VMs and everything kept working.

Ah okay, that explains a lot. AFAIK there is no way to prevent it and bring it back as read/write, iSCSI protocol limitation.
Yeah a bit strange, but I can imagine that the iSCSI hiccup and a little coincidence may caused that. I for now haven't any clue either what happened, to be honest.

8006 is the port from the PVE interface and these were probably "ack" packages, normally nothing to worry. No other traffic was recorded, though? Then the cluster communication from this machine was dead.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!