Proxmox issues with HP Blade C7000 - BL460c G8

itvietnam · Jan 30, 2018

Hi,

I have few Blade Server HP C7000 and i use them to host Proxmox Cluster. My Cluster have 8 nodes (is this reliable quorum?) and sometime it rebooted all nodes hosted on Blade. Other server same blade have no impact.

Last 2 weeks i decided to add 1 Dell PowerEdge R620 and migrate some important customer to this node to avoid from this failure.

For first few incidents, i check and think this is HP firmware problem but after few incidents i connected all related information and firgure out this only happen to Proxmox node.

We have 16 nodes on this blade:

6 server use with Virtuozzo (Old name Parrallels Cloud Server): this cluster is still up to day.
3 server use CentOS 7: no impact
7 server use Proxmox: whole these node down in the same incident.

Yesterday, it happen again and i check syslog: log_node01.txt

From other member node02, it show log in attachment: log_node02.txt

From log of node hv102 i could see this error:

Code:

Jan 29 17:31:32 mycluster-hv102 systemd[1]: Started udev Coldplug all Devices.

My Proxmox version:

Code:

root@hv101:~# pveversion -v
proxmox-ve: 5.0-19 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.17-2-pve: 4.10.17-19
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-3
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.9-pve16~bpo90
openvswitch-switch: 2.6.2~pre+git20161223-3
root@hv101:~#

From my experiences, this happen only to server use by Proxmox VE.

Does anyone has exp on this and to get over this issues? Is this conflict by Proxmox software?

Thanks,

dcsapak · Jan 30, 2018

first please upgrade, you ran an outdated version

Jan 29 17:27:54 mycluster-hv101 corosync[2734]: warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

second, it seems your network makes problems, did you try to check if multicast works properly? or maybe the network got overloaded at those times?

itvietnam · Jan 30, 2018

dcsapak said:
first please upgrade, you ran an outdated version

Can i upgrade straight to 5.1.3x? do i have to migrate VM/CT to other and perform rolling upgrade?

dcsapak said:
maybe the network got overloaded at those times?

Probably. I'm checking log now. We have exp when we test MTU 9000 with following command and server reboot too:

From hv102:

Code:

ping hv101 -c 10 -M do -s 8972

itvietnam · Jan 30, 2018

This is network traffic but peak time is differences from incident time: i think this is not the root cause.

And how network can caused all nodes reboot ? except Dell Server?

dcsapak · Jan 30, 2018

itvietnam said:
Can i upgrade straight to 5.1.3x? do i have to migrate VM/CT to other and perform rolling upgrade?

since this includes a kernel upgrade, i would do so, yes

please also check multicast traffic see
https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

dcsapak · Jan 30, 2018

itvietnam said:
And how network can caused all nodes reboot

if you enabled ha (and the logs indicate you did), then a non reliable network can lead to self fencing of the nodes namely of all nodes which are not in a quorate partition

Search

Search

Proxmox issues with HP Blade C7000 - BL460c G8

itvietnam

Renowned Member

Attachments

dcsapak

Proxmox Staff Member

itvietnam

Renowned Member

itvietnam

Renowned Member

dcsapak

Proxmox Staff Member

dcsapak

Proxmox Staff Member

We value your privacy