Proxmox 3.1 kernel crash that takes other servers offline too...

Jan 26, 2011
82
2
6
We have a couple dozen Proxmox servers, and about once a month, one of them will have a kernel panic and lock up. The worst part about these lock ups is that when it's a node that is on a separate switch, all other Proxmox servers on that switch will stop responding until we can find the server that has actually crashed and reboot it. When we reported this issue here, we were advised to upgrade to Proxmox 3.1 and we've been in the process of doing that for the past several months. Unfortunately, one of the servers on 3.1 locked up with a kernel panic on Friday, and again all Proxmox servers that were on that same switch were unreachable until we could locate the crashed server and reboot it. Well, almost all Proxmox servers on the switch... I found it interesting that the Proxmox servers on that same switch that were still on version 1.9 were unaffected.

Here is a screen shot of the console of the crashed server:

2014-01-27_1431.png

Here is a screen shot of what the rest of the unreachable servers had spewing to their console (all on the same switch, same version of proxmox, master on different switch):

2014-01-27_1441.png

Here is the pveversion information from the locked server (the other affected nodes should have the same output as they were all installed from the same Proxmox iso):

pveversion -v
proxmox-ve-2.6.32: 3.1-109 (running kernel: 2.6.32-23-pve)
pve-manager: 3.1-3 (running version: 3.1-3/dc0e9b0e)
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-7
qemu-server: 3.1-1
pve-firmware: 1.0-23
libpve-common-perl: 3.0-6
libpve-access-control: 3.0-6
libpve-storage-perl: 3.0-10
pve-libspice-server1: 0.12.4-1
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.0-2

Two questions:

1. Any clues what would be causing the kernel panic (see first image)?

2. Why would other servers on the same switch and version of Proxmox be knocked off the network until the locked server is rebooted? (Note: There were other servers on the same switch that were running the older version of proxmox that were unaffected. Also, no other Proxmox servers in the same 3.1 cluster were affected that were not on that same switch.)

Thanks,

Curtis
 
Thanks for your response.

What nics do you use in those servers?
And what kind of switch do you use (brand, model)?

The NICs are onboard... Supermicro X9SCL-F. The motherboard specs say they are Intel 82579LM and 82574L (2x Gigabit Ethernet LAN ports).

The switch that the affected servers are connected to is a Cisco SG102-24. However, we are on our third switch... this same issue happened under Proxmox 1.9 and we tried two other switches with that version (one Cisco and one D-Link... I'd have to dig to get the model numbers on those). Since it's happened with 3 different switches (3 different models, 2 brands), I am less suspect of the switches. And, the fact that it seems to only affect servers that are running the same version of Proxmox, also makes me less suspicious of the switch.
 
Last edited:
What are the settings on the switch for igmp snooping?

It's an unmanaged switch, and so there are no settings to view. The feature list does not mention igmp snooping, however:

http://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps10863/datasheet_C78-582017.html

Thanks for your thoughts. I'm also interested to know if anyone has any ideas about the server that had the kernel panic (probably even more interested in that, since even if we solve why other servers were affected, having lockups is not acceptable).
 
You could try latest kernel: pve-kernel-2.6.32-26-pve

I thought Proxmox 3.1 would have included the most recent stable kernel. Where do I get the latest kernel? Also, is there a kernel change log somewhere?

Thanks again for your response.
 
Login to the host as root and do this on the command line:
apt-get update && apt-get full-upgrade

Since the above includes a kernel upgrade you need to reboot the host.
 
Login to the host as root and do this on the command line:
apt-get update && apt-get full-upgrade

Since the above includes a kernel upgrade you need to reboot the host.

apt-get update produces the following error:

Err https://enterprise.proxmox.com wheezy/pve-enterprise amd64 Packages
The requested URL returned error: 401

apt-get full-upgrade then gives this error:

E: Invalid operation full-upgrade

I suppose this is because we have not yet purchased a subscription?
 
apt-get update produces the following error:

Err https://enterprise.proxmox.com wheezy/pve-enterprise amd64 Packages
The requested URL returned error: 401
https://pve.proxmox.com/wiki/Package_repositories#Proxmox_VE_No-Subscription_Repository
apt-get full-upgrade then gives this error:

E: Invalid operation full-upgrade

I suppose this is because we have not yet purchased a subscription?
Sorry. apt-get dist-upgrade
 
So, if I understand the link you provided, if we want to upgrade our kernel, we have to use the non-production "test" repo or their subscription based one. We don't have time for testing. We need a stable platform and I'm not interested in purchasing a subscription until after we've determined that we can find a stable version of Proxmox. In fact, since the one crashed server affected network connectivity of all other servers on the same Proxmox cluster and switch, it makes me suspicious of the stability of Proxmox clusters in general. At this point, it seems like it would make more sense to go with OpenVZ's recommended platform (CentOS 6) and give up on Proxmox. For KVM users, I can see a benefit in Proxmox, but for OpenVZ, I'm not really seeing much benefit, and we can't afford this instability. I'm going to do some research on what it would take for us to move to a "pure" OpenVZ environment. I may be back, depending on what I find.

Really appreciate your help, however. :-)
 
You could try simply to install only the kernel:
wget http://download.proxmox.com/debian/...pve-kernel-2.6.32-26-pve_2.6.32-114_amd64.deb
wget http://download.proxmox.com/debian/...ve-headers-2.6.32-26-pve_2.6.32-114_amd64.deb
wget http://download.proxmox.com/debian/...tion/binary-amd64/pve-firmware_1.0-23_all.deb

The simply dpkg -i pve-kernel-2.6.32-26-pve_2.6.32-114_amd64.deb pve-headers-2.6.32-26-pve_2.6.32-114_amd64.deb pve-firmware_1.0-23_all.deb

The above versions correspond to the ones in stable.
 
you even do not have one managed switch?
if yes i would do a port mirror on the interfaces to see what happens - otherwise you can only guessing;

a possible reason could be that there is a broadcast storm on your network causing this kernel panic and adapter resets on the other nodes because of the large number of packets receiving;

managed switches with stp turned on could help and you would see it in the logs;
 
You could try simply to install only the kernel:
wget http://download.proxmox.com/debian/...pve-kernel-2.6.32-26-pve_2.6.32-114_amd64.deb
wget http://download.proxmox.com/debian/...ve-headers-2.6.32-26-pve_2.6.32-114_amd64.deb
wget http://download.proxmox.com/debian/...tion/binary-amd64/pve-firmware_1.0-23_all.deb

The simply dpkg -i pve-kernel-2.6.32-26-pve_2.6.32-114_amd64.deb pve-headers-2.6.32-26-pve_2.6.32-114_amd64.deb pve-firmware_1.0-23_all.deb

The above versions correspond to the ones in stable.

Thanks for the tip on how to upgrade the kernel. Since I'm not finding any evidence that another kernel upgrade will help, my first priority is to get these machines out of the cluster so it doesn't happen again. I've started a separate thread for that (http://forum.proxmox.com/threads/17682-Safely-remove-node-from-cluster-without-deleting-containers). Perhaps I'll be back later, but for now, we don't have time to experiment and need to insure that issue does not come back. If the problem resurfaces, at least we'll know then that it's not Proxmox clustering that caused the issue. On the other hand, since this issue has been happening about once per month, if the problem goes away, we probably won't be able to risk it to come back.
 
Yes, our main switch is managed, and for some reason it doesn't happen there. But we have sub-groups of servers that we wanted on private switches and we didn't need managed switches for them. If we had time to replace the switches with managed switches, we'd do that, but we don't have time for that. It seems the broadcast storm is somehow being caused by the locked up server (rebooting the one server instantly brought the other servers in the cluster back online) and servers that were not part of the cluster on that same switch were unaffected.

I appreciate the advice, but for now, we're going to just remove these servers from the cluster as we can't risk the problem coming back (http://forum.proxmox.com/threads/17682-Safely-remove-node-from-cluster-without-deleting-containers).