Kernel panic

gosha · Dec 9, 2015

3-nodes Proxmox 4.0 cluster (HP DL380 Gen8)

I needed to stop one of the servers.
I migrate all VMs from this server to another server and press Shutdown button in GUI.
After server servicing, I turned it on. The server boots normally, but after a short time the server stopped responding (in GUI).
In the server's console (via iLO), I found "Kernel panic" (see pic):

I rebooted the server and then it worked fine.
I repeated it to another server, and also received a kernel panic after loading and normal work after re-reboot.

I repeated it to third server... the same kernel panic...

On Proxmox 3.x on the same servers such situations did not happen...

Why is this happening? This is new fencing?

dietmar · Dec 9, 2015

What version do you run exactly?

# pveversion -v

gosha · Dec 9, 2015

# pveversion -v
proxmox-ve: 4.0-22 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
fence-agents-pve: not correctly installed
^^^^^

Wow!!!

dietmar · Dec 9, 2015

We uploaded new package to pve-no-subscription repository. Please can you test with new kernel.

gosha · Dec 9, 2015

I just installed the latest updates and reboot all servers:

pveversion -v
proxmox-ve: 4.1-25 (running kernel: 4.2.6-1-pve)
pve-manager: 4.0-64 (running version: 4.0-64/fc76ac6c)
pve-kernel-4.2.6-1-pve: 4.2.6-25
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-39
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-37
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-16
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
fence-agents-pve: not correctly installed

I tried shutdown and boot again one of the servers and "Kernel panic" did not happen.
But I see again "fence-agents-pve: not correctly installed" in last line.

This is normal?

dietmar · Dec 9, 2015

gosha said:
I tried shutdown and boot again one of the servers and "Kernel panic" did not happen.

great

gosha said:
But I see again "fence-agents-pve: not correctly installed" in last line.
This is normal?

That is a useless warning, because the package is currently not used. Either ignore the
warning, or install the packages:

# apt-get install fence-agents-pve

I will fix that with the next release.

gosha · Dec 9, 2015

Dietmar, thanks!

I installed this package and...

# pveversion -v
....
fence-agents-pve: 4.0.20-1

---
Best Regards!
Gosha

hmronline · Dec 9, 2015

Gosha, we had the same problem. Could you please confirm that latest version resolves it ?

Have you tried migrating VMs from one node to another using this latest version ? No kernel panic or reboots happened ?

Thanks!
H

gosha · Dec 10, 2015

hmronline said:
Gosha, we had the same problem. Could you please confirm that latest version resolves it ?

Have you tried migrating VMs from one node to another using this latest version ? No kernel panic or reboots happened ?

Thanks!
H

Hi!

I tried to Shutdown and start again one server and one times only. And I did not get the kernel panic.
I will try to repeat these steps several times in the near future.

I just tried to online migrate VM (two times). Its Ok. See picture.

No kernel panic and reboots.

--
Best regards!

gosha · Dec 10, 2015

Hi!

Today, I shutdown one server (cn1) again.
And a few minutes later all running VMs on the remaining servers (cn2,cn3) are stopped:

All the VMs have been stopped by stop (not shutdown).
This is very bad...

P.S.
I boot first server and few minutes later (after restoring ceph-storage) I repeated shutdown this server, and then the two remaining servers was rebooted...

Horror!

dietmar · Dec 10, 2015

gosha said:
All the VMs have been stopped by stop (not shutdown).

Maybe you are just connected the dead node? A shutdown of one node does not stop VMs on other nodes.

gosha · Dec 10, 2015

dietmar said:
Maybe you are just connected the dead node? A shutdown of one node does not stop VMs on other nodes.

No. I was connected to third node (cn3) and shutdown first node (cn1)

See P.S. in my previous message please...

gosha · Dec 10, 2015

Third attempt.

I repeated. Boot first server and few minutes later (after restoring ceph-storage) I repeated shutdown this server.
Remaining servers was normal. No stop VMs and reboots.

Hm...

As a result - the latest update solved the problem kernel panic but does not solve the problem restart the servers (may be watchdog incorrect working?).

gosha · Dec 10, 2015

dietmar said:
Maybe you are just connected the dead node? A shutdown of one node does not stop VMs on other nodes.

May be was such situation - after shutdown cn1 I'm moving away for a while. If we assume that while I was away, the remaining two servers rebooted
and HA did not have time to run the VMs when I come back...
However, after my return, I began writing a message on a forum, take a screenshot, it took a lot of time ... HA was time to start all VMs...

While I could not come up with an explanation of this situation...

gosha · Dec 10, 2015

gosha said:
May be was such situation - after shutdown cn1 I'm moving away for a while. If we assume that while I was away, the remaining two servers rebooted
and HA did not have time to run the VMs when I come back...
However, after my return, I began writing a message on a forum, take a screenshot, it took a lot of time ... HA was time to start all VMs...

While I could not come up with an explanation of this situation...

In addition to the above...
May be after loading two servers HA do not have time to run the VMs due to recovery ceph-storage, which also takes some time (my OSD without SSD-journal...).

dietmar · Dec 10, 2015

gosha said:
May be after loading two servers HA ...

You did not even mentioned that you use HA in the initial post, so I have no real idea what you are doing.

gosha · Dec 10, 2015

dietmar said:
You did not even mentioned that you use HA in the initial post, so I have no real idea what you are doing.

Sorry. I did not know that it matters...
I assumed that the cluster without HA does not make sense... for me exactly...
Really in case of stopping the server, I'll be forced to remain without automatically migrated VMs?

sigxcpu · Dec 10, 2015

Cluster without HA features:
- single management UI
- live/offline VM migration

To be honest, I see platform level HA as ancient and obsoleted. Automatically starting the VM somewhere else is pretty useless in most cases I've encountered (of course not all). Examples:

- Databases: you do app-level HA (master/slave, master/master or replica sets)
- Stateless apps: why wouldn't you start multiple instances from the beginning ?
- Stateful realtime apps (e.g. PBX): you will lose state (current calls) anyway, but you can start multiple instances from the beginning

Even more, having shared storage, why would you risk a crashed hypervisor to corrupt the single data instance between HA instances?

I may be wrong, of course, but that's my view on HA regarding apps I've touched along the years.

gosha · Dec 11, 2015

sigxcpu said:
Cluster without HA features:
- single management UI
- live/offline VM migration

HA - one of the key features. In my last messages I described the problem most likely related to the fencing of the nodes. Fencing is an essential part for Proxmox VE HA.
I really did not mention about using HA, but I did not mention about use of cluster without HA. And I did not know that this would be a problem.
And it seems strange to me such a response - "so I have no real idea what you are doing."

dietmar · Dec 11, 2015

gosha said:
And it seems strange to me such a response - "so I have no real idea what you are doing."

The the topic is "Kernel panic" - I guess that problem is solved?

I suggest you open a new topic for the HA related problem, describing exactly what you do, what behaviour you expect, and what you think is the bug.

Kernel panic

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

New Member

Well-Known Member

Well-Known Member

Attachments

Proxmox Staff Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Staff Member