The sudden reboot after upgrade to 4.1

gosha

Well-Known Member
Oct 20, 2014
302
26
58
Russia
Proxmox VE 4.0 3-nodes (HP DL380 Gen8) cluster has been upgraded to 4.1
All VMs were stopped via disabled in HA before upgrade and all nodes were rebooted after upgrade.
All VMs were started by HA. All Ok.

I did load testing ceph-storage by two VMs placed on different servers and running the HDD-test program inside the each VMs.

After about 35-40 minutes GUI suddenly stopped working... :(
It turned out that there was a restart of the three servers...
In syslog (in all servers) was not found anything suspicious, sudden reboot only:

pic2.png

In iLO log the same:

pic1.png

How to identify the problem? o_O

P.S.
# pveversion -v
proxmox-ve: 4.1-26 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-1 (running version: 4.1-1/2f9650d4)
pve-kernel-4.2.6-1-pve: 4.2.6-26
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-41
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-17
pve-container: 1.0-33
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
fence-agents-pve: 4.0.20-1
 
Are there some corosync related logs in syslog?

A few minutes before reboot:

Dec 15 20:47:25 cn1 corosync[2364]: [TOTEM ] A new membership (192.168.0.190:5556) was formed. Members
Dec 15 20:47:25 cn1 corosync[2364]: [QUORUM] Members[2]: 3 1
Dec 15 20:47:25 cn1 corosync[2364]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 15 20:47:25 cn1 pve-ha-crm[2641]: node 'cn2': state changed from 'online' => 'unknown'
Dec 15 20:47:28 cn1 corosync[2364]: [TOTEM ] A new membership (192.168.0.190:5560) was formed. Members joined: 2
Dec 15 20:47:28 cn1 pmxcfs[2205]: [status] notice: members: 1/2205, 2/2068, 3/2110
Dec 15 20:47:28 cn1 pmxcfs[2205]: [status] notice: starting data syncronisation
Dec 15 20:47:28 cn1 corosync[2364]: [QUORUM] Members[3]: 3 2 1
Dec 15 20:47:28 cn1 corosync[2364]: [MAIN ] Completed service synchronization, ready to provide service.
....
Dec 15 20:52:29 cn1 corosync[2364]: [TOTEM ] A new membership (192.168.0.190:5564) was formed. Members left: 2
Dec 15 20:52:29 cn1 corosync[2364]: [TOTEM ] Failed to receive the leave message. failed: 2
Dec 15 20:52:29 cn1 pmxcfs[2205]: [dcdb] notice: members: 1/2205, 3/2110
Dec 15 20:52:29 cn1 pmxcfs[2205]: [dcdb] notice: starting data syncronisation
Dec 15 20:52:29 cn1 corosync[2364]: [QUORUM] Members[2]: 3 1
Dec 15 20:52:29 cn1 corosync[2364]: [MAIN ] Completed service synchronization, ready to provide service.
...
Dec 15 20:52:33 cn1 corosync[2364]: [QUORUM] Members[3]: 3 2 1
Dec 15 20:52:33 cn1 corosync[2364]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 15 20:52:33 cn1 pmxcfs[2205]: [dcdb] notice: cpg_send_message retried 8 times
Dec 15 20:52:33 cn1 pmxcfs[2205]: [status] notice: members: 1/2205, 2/2068, 3/2110
Dec 15 20:52:33 cn1 pmxcfs[2205]: [status] notice: starting data syncronisation
Dec 15 20:52:34 cn1 corosync[2364]: [TOTEM ] A processor failed, forming new configuration.
Dec 15 20:52:36 cn1 corosync[2364]: [TOTEM ] A new membership (192.168.0.190:5572) was formed. Members
Dec 15 20:52:36 cn1 corosync[2364]: [QUORUM] Members[3]: 3 2 1
Dec 15 20:52:36 cn1 corosync[2364]: [MAIN ] Completed service synchronization, ready to provide service.

--
Best regards!
Gosha
 
About two hours of speed disk testing inside two VMs (W2K8 R2 server both) simultaneously...

pic4.png

Flight normal. No reboots! :)

I was visited by a ghost? :)
 
The same...
...
Dec 15 20:52:32 cn3 corosync[2318]: [TOTEM ] A new membership (192.168.0.190:5568) was formed. Members joined: 2
Dec 15 20:52:32 cn3 pmxcfs[2110]: [dcdb] notice: members: 1/2205, 2/2068, 3/2110
Dec 15 20:52:32 cn3 pmxcfs[2110]: [dcdb] notice: starting data syncronisation
Dec 15 20:52:32 cn3 pmxcfs[2110]: [status] notice: members: 1/2205, 2/2068, 3/2110
Dec 15 20:52:32 cn3 pmxcfs[2110]: [status] notice: starting data syncronisation
Dec 15 20:52:33 cn3 corosync[2318]: [QUORUM] Members[3]: 3 2 1
Dec 15 20:52:33 cn3 corosync[2318]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 15 20:52:33 cn3 pmxcfs[2110]: [dcdb] notice: cpg_send_message retried 3 times
Dec 15 20:52:34 cn3 corosync[2318]: [TOTEM ] A processor failed, forming new configuration.
Dec 15 20:52:36 cn3 corosync[2318]: [TOTEM ] A new membership (192.168.0.190:5572) was formed. Members
Dec 15 20:52:36 cn3 corosync[2318]: [QUORUM] Members[3]: 3 2 1
Dec 15 20:52:36 cn3 corosync[2318]: [MAIN ] Completed service synchronization, ready to provide service.
...