Proxmox issues with HP Blade C7000 - BL460c G8

itvietnam

Renowned Member
Aug 11, 2015
132
4
83
Hi,

I have few Blade Server HP C7000 and i use them to host Proxmox Cluster. My Cluster have 8 nodes (is this reliable quorum?) and sometime it rebooted all nodes hosted on Blade. Other server same blade have no impact.

upload_2018-1-30_16-51-52.png

Last 2 weeks i decided to add 1 Dell PowerEdge R620 and migrate some important customer to this node to avoid from this failure.

For first few incidents, i check and think this is HP firmware problem but after few incidents i connected all related information and firgure out this only happen to Proxmox node.

We have 16 nodes on this blade:
  • 6 server use with Virtuozzo (Old name Parrallels Cloud Server): this cluster is still up to day.
  • 3 server use CentOS 7: no impact
  • 7 server use Proxmox: whole these node down in the same incident.
Yesterday, it happen again and i check syslog: log_node01.txt

From other member node02, it show log in attachment: log_node02.txt

From log of node hv102 i could see this error:

Code:
Jan 29 17:31:32 mycluster-hv102 systemd[1]: Started udev Coldplug all Devices.

My Proxmox version:

Code:
root@hv101:~# pveversion -v
proxmox-ve: 5.0-19 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.17-2-pve: 4.10.17-19
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-3
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.9-pve16~bpo90
openvswitch-switch: 2.6.2~pre+git20161223-3
root@hv101:~#

From my experiences, this happen only to server use by Proxmox VE.

Does anyone has exp on this and to get over this issues? Is this conflict by Proxmox software?

Thanks,
 

Attachments

first please upgrade, you ran an outdated version
Jan 29 17:27:54 mycluster-hv101 corosync[2734]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
second, it seems your network makes problems, did you try to check if multicast works properly? or maybe the network got overloaded at those times?
 
first please upgrade, you ran an outdated version
Can i upgrade straight to 5.1.3x? do i have to migrate VM/CT to other and perform rolling upgrade?

maybe the network got overloaded at those times?

Probably. I'm checking log now. We have exp when we test MTU 9000 with following command and server reboot too:

From hv102:
Code:
ping hv101 -c 10 -M do -s 8972
 
This is network traffic but peak time is differences from incident time: i think this is not the root cause.

And how network can caused all nodes reboot ? except Dell Server?

upload_2018-1-30_17-34-58.png
 
And how network can caused all nodes reboot
if you enabled ha (and the logs indicate you did), then a non reliable network can lead to self fencing of the nodes namely of all nodes which are not in a quorate partition
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!