hypervisors falling down

alexskysilk

Distinguished Member
Oct 16, 2015
2,578
852
213
Chatsworth, CA
www.skysilk.com
Occasionally (without observable cause and effect) I have random hypervisors crash in the cluster. The fallen member does get fenced properly and reboots normally, but the fact its happening continuously (and not the same nodes) is troubling.

The system is not memory starved so I was assuming the mempool_alloc_slab messages were resource specific (container cgroup?) but that shouldnt cause the kernel to crap out. Now I'm thinking its specific kernel module related (to reiterate, it happens on more then one node so hardware fault is unlikely.)

help!

details:
/var/log/messages (attached)

Code:
# pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-99
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.7-1~bpo80+1

# free
             total       used       free     shared    buffers     cached
Mem:     131956756   39325664   92631092      76404       8184   25803132
-/+ buffers/cache:   13514348  118442408
Swap:            0          0          0
 

Attachments

Hi,

please update to the current version and kernel.