Occasionally (without observable cause and effect) I have random hypervisors crash in the cluster. The fallen member does get fenced properly and reboots normally, but the fact its happening continuously (and not the same nodes) is troubling.
The system is not memory starved so I was assuming the mempool_alloc_slab messages were resource specific (container cgroup?) but that shouldnt cause the kernel to crap out. Now I'm thinking its specific kernel module related (to reiterate, it happens on more then one node so hardware fault is unlikely.)
help!
details:
/var/log/messages (attached)
The system is not memory starved so I was assuming the mempool_alloc_slab messages were resource specific (container cgroup?) but that shouldnt cause the kernel to crap out. Now I'm thinking its specific kernel module related (to reiterate, it happens on more then one node so hardware fault is unlikely.)
help!
details:
/var/log/messages (attached)
Code:
# pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-99
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.7-1~bpo80+1
# free
total used free shared buffers cached
Mem: 131956756 39325664 92631092 76404 8184 25803132
-/+ buffers/cache: 13514348 118442408
Swap: 0 0 0