We have a couple dozen Proxmox servers, and about once a month, one of them will have a kernel panic and lock up. The worst part about these lock ups is that when it's a node that is on a separate switch, all other Proxmox servers on that switch will stop responding until we can find the server that has actually crashed and reboot it. When we reported this issue here, we were advised to upgrade to Proxmox 3.1 and we've been in the process of doing that for the past several months. Unfortunately, one of the servers on 3.1 locked up with a kernel panic on Friday, and again all Proxmox servers that were on that same switch were unreachable until we could locate the crashed server and reboot it. Well, almost all Proxmox servers on the switch... I found it interesting that the Proxmox servers on that same switch that were still on version 1.9 were unaffected.
Here is a screen shot of the console of the crashed server:
Here is a screen shot of what the rest of the unreachable servers had spewing to their console (all on the same switch, same version of proxmox, master on different switch):
Here is the pveversion information from the locked server (the other affected nodes should have the same output as they were all installed from the same Proxmox iso):
pveversion -v
proxmox-ve-2.6.32: 3.1-109 (running kernel: 2.6.32-23-pve)
pve-manager: 3.1-3 (running version: 3.1-3/dc0e9b0e)
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-7
qemu-server: 3.1-1
pve-firmware: 1.0-23
libpve-common-perl: 3.0-6
libpve-access-control: 3.0-6
libpve-storage-perl: 3.0-10
pve-libspice-server1: 0.12.4-1
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.0-2
Two questions:
1. Any clues what would be causing the kernel panic (see first image)?
2. Why would other servers on the same switch and version of Proxmox be knocked off the network until the locked server is rebooted? (Note: There were other servers on the same switch that were running the older version of proxmox that were unaffected. Also, no other Proxmox servers in the same 3.1 cluster were affected that were not on that same switch.)
Thanks,
Curtis
Here is a screen shot of the console of the crashed server:
Here is a screen shot of what the rest of the unreachable servers had spewing to their console (all on the same switch, same version of proxmox, master on different switch):
Here is the pveversion information from the locked server (the other affected nodes should have the same output as they were all installed from the same Proxmox iso):
pveversion -v
proxmox-ve-2.6.32: 3.1-109 (running kernel: 2.6.32-23-pve)
pve-manager: 3.1-3 (running version: 3.1-3/dc0e9b0e)
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-7
qemu-server: 3.1-1
pve-firmware: 1.0-23
libpve-common-perl: 3.0-6
libpve-access-control: 3.0-6
libpve-storage-perl: 3.0-10
pve-libspice-server1: 0.12.4-1
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.0-2
Two questions:
1. Any clues what would be causing the kernel panic (see first image)?
2. Why would other servers on the same switch and version of Proxmox be knocked off the network until the locked server is rebooted? (Note: There were other servers on the same switch that were running the older version of proxmox that were unaffected. Also, no other Proxmox servers in the same 3.1 cluster were affected that were not on that same switch.)
Thanks,
Curtis