Kernel panic

gosha

Well-Known Member
Oct 20, 2014
302
26
58
Russia
3-nodes Proxmox 4.0 cluster (HP DL380 Gen8)

pic1.png

I needed to stop one of the servers.
I migrate all VMs from this server to another server and press Shutdown button in GUI.
After server servicing, I turned it on. The server boots normally, but after a short time the server stopped responding (in GUI).
In the server's console (via iLO), I found "Kernel panic" (see pic):

pic2.png

I rebooted the server and then it worked fine.
I repeated it to another server, and also received a kernel panic after loading and normal work after re-reboot. :(
I repeated it to third server... the same kernel panic...

On Proxmox 3.x on the same servers such situations did not happen...

Why is this happening? This is new fencing? :confused:
 
Last edited:
# pveversion -v
proxmox-ve: 4.0-22 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
fence-agents-pve: not correctly installed
^^^^^

Wow!!! :confused:
 
We uploaded new package to pve-no-subscription repository. Please can you test with new kernel.
 
I just installed the latest updates and reboot all servers:

pveversion -v
proxmox-ve: 4.1-25 (running kernel: 4.2.6-1-pve)
pve-manager: 4.0-64 (running version: 4.0-64/fc76ac6c)
pve-kernel-4.2.6-1-pve: 4.2.6-25
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-39
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-37
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-16
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
fence-agents-pve: not correctly installed


I tried shutdown and boot again one of the servers and "Kernel panic" did not happen.
But I see again "fence-agents-pve: not correctly installed" in last line. :(
This is normal?
 
I tried shutdown and boot again one of the servers and "Kernel panic" did not happen.

great

But I see again "fence-agents-pve: not correctly installed" in last line. :(
This is normal?

That is a useless warning, because the package is currently not used. Either ignore the
warning, or install the packages:

# apt-get install fence-agents-pve

I will fix that with the next release.
 
Dietmar, thanks!

I installed this package and...

# pveversion -v
....
fence-agents-pve: 4.0.20-1

---
Best Regards!
Gosha
 
Gosha, we had the same problem. Could you please confirm that latest version resolves it ?

Have you tried migrating VMs from one node to another using this latest version ? No kernel panic or reboots happened ?

Thanks!
H
 
Gosha, we had the same problem. Could you please confirm that latest version resolves it ?

Have you tried migrating VMs from one node to another using this latest version ? No kernel panic or reboots happened ?

Thanks!
H

Hi!

I tried to Shutdown and start again one server and one times only. And I did not get the kernel panic.
I will try to repeat these steps several times in the near future.

I just tried to online migrate VM (two times). Its Ok. See picture.

pic3.png

No kernel panic and reboots.

--
Best regards!
 
Last edited:
Hi!

Today, I shutdown one server (cn1) again.
And a few minutes later all running VMs on the remaining servers (cn2,cn3) are stopped:

pic1.png

All the VMs have been stopped by stop (not shutdown).
This is very bad...
:(


P.S.
I boot first server and few minutes later (after restoring ceph-storage) I repeated shutdown this server, and then the two remaining servers was rebooted...

pic2.png

Horror! :(
 

Attachments

  • pic1.png
    pic1.png
    152 KB · Views: 9
Last edited:
Third attempt.

I repeated. Boot first server and few minutes later (after restoring ceph-storage) I repeated shutdown this server.
Remaining servers was normal. No stop VMs and reboots.

Hm...
:confused:
As a result - the latest update solved the problem kernel panic but does not solve the problem restart the servers (may be watchdog incorrect working?).
 
Last edited:
Maybe you are just connected the dead node? A shutdown of one node does not stop VMs on other nodes.


May be was such situation - after shutdown cn1 I'm moving away for a while. If we assume that while I was away, the remaining two servers rebooted
and HA did not have time to run the VMs when I come back...
However, after my return, I began writing a message on a forum, take a screenshot, it took a lot of time ... HA was time to start all VMs...

While I could not come up with an explanation of this situation... :confused:
 
May be was such situation - after shutdown cn1 I'm moving away for a while. If we assume that while I was away, the remaining two servers rebooted
and HA did not have time to run the VMs when I come back...
However, after my return, I began writing a message on a forum, take a screenshot, it took a lot of time ... HA was time to start all VMs...

While I could not come up with an explanation of this situation... :confused:

In addition to the above...
May be after loading two servers HA do not have time to run the VMs due to recovery ceph-storage, which also takes some time (my OSD without SSD-journal...).
:confused:
 
Last edited:
You did not even mentioned that you use HA in the initial post, so I have no real idea what you are doing.

Sorry. I did not know that it matters...
I assumed that the cluster without HA does not make sense... for me exactly...
Really in case of stopping the server, I'll be forced to remain without automatically migrated VMs?
:(
 
Last edited:
Cluster without HA features:
- single management UI
- live/offline VM migration

To be honest, I see platform level HA as ancient and obsoleted. Automatically starting the VM somewhere else is pretty useless in most cases I've encountered (of course not all). Examples:

- Databases: you do app-level HA (master/slave, master/master or replica sets)
- Stateless apps: why wouldn't you start multiple instances from the beginning ?
- Stateful realtime apps (e.g. PBX): you will lose state (current calls) anyway, but you can start multiple instances from the beginning

Even more, having shared storage, why would you risk a crashed hypervisor to corrupt the single data instance between HA instances?

I may be wrong, of course, but that's my view on HA regarding apps I've touched along the years.
 
Cluster without HA features:
- single management UI
- live/offline VM migration

HA - one of the key features. In my last messages I described the problem most likely related to the fencing of the nodes. Fencing is an essential part for Proxmox VE HA.
I really did not mention about using HA, but I did not mention about use of cluster without HA. And I did not know that this would be a problem.
And it seems strange to me such a response - "so I have no real idea what you are doing." :(

 
Last edited:
And it seems strange to me such a response - "so I have no real idea what you are doing." :(

The the topic is "Kernel panic" - I guess that problem is solved?

I suggest you open a new topic for the HA related problem, describing exactly what you do, what behaviour you expect, and what you think is the bug.