high load after processor failed

sir.jan

New Member
Jan 2, 2012
7
0
1
Germany
Hello,

I've a strange problem after the upgrade from 2.1 to 2.3.

In my setup there are two Servers in a cluster. On of them
works without problems, but several times a day the other
server has a realy high load (3-5) and all clients on this host
are freezing.

The only solution to get the server working again is to do a restart.

This Server has a Raid5 on SAS-HDDs, after I added "elevator=deadline" to
grub the problem does not shown itself so often.

In syslog I found this logs, and I think after that the Server got these problems.

Code:
May  1 04:34:14 desokvm1 pvestatd[1934]: WARNING: closeing with write buffer at /usr/share/perl5/IO/Multiplex.pm line 913.
May  1 05:13:21 desokvm1 corosync[1565]:   [TOTEM ] A processor failed, forming new configuration.
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] CLM CONFIGURATION CHANGE
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] New Configuration:
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] #011r(0) ip(10.0.3.1) 
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] #011r(0) ip(10.0.3.2) 
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] Members Left:
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] Members Joined:
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] CLM CONFIGURATION CHANGE
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] New Configuration:
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] #011r(0) ip(10.0.3.1) 
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] #011r(0) ip(10.0.3.2) 
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] Members Left:
May  1 05:13:30 desokvm1 corosync[1565]:   [CLM   ] Members Joined:
May  1 05:13:30 desokvm1 corosync[1565]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May  1 05:13:32 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 10
May  1 05:13:33 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 20
May  1 05:13:34 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 30
May  1 05:13:35 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 40
May  1 05:13:36 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 50
May  1 05:13:37 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 60
May  1 05:13:38 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 70
May  1 05:13:39 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 80
May  1 05:13:40 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 90
May  1 05:13:41 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 100
May  1 05:13:41 desokvm1 pmxcfs[1434]: [dcdb] notice: cpg_send_message retried 100 times
May  1 05:13:41 desokvm1 pmxcfs[1434]: [status] crit: cpg_send_message failed: 6
May  1 05:13:42 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 10
May  1 05:13:43 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 20
May  1 05:13:44 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 30
May  1 05:13:45 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 40
May  1 05:13:46 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 50
May  1 05:13:47 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 60
May  1 05:13:48 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 70
May  1 05:13:49 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 80
May  1 05:13:50 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 90
May  1 05:13:51 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 100
May  1 05:13:51 desokvm1 pmxcfs[1434]: [dcdb] notice: cpg_send_message retried 100 times
May  1 05:13:51 desokvm1 pmxcfs[1434]: [status] crit: cpg_send_message failed: 6
May  1 05:13:52 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 10
May  1 05:13:53 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 20
May  1 05:13:54 desokvm1 corosync[1565]:   [CPG   ] chosen downlist: sender r(0) ip(10.0.3.1) ; members(old:2 left:0)
May  1 05:13:54 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 30
May  1 05:13:55 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 40
May  1 05:13:56 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 50
May  1 05:13:57 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 60
May  1 05:13:58 desokvm1 pmxcfs[1434]: [status] notice: cpg_send_message retry 70
May  1 05:13:59 desokvm1 corosync[1565]:   [MAIN  ] Completed service synchronization, ready to provide service.
May  1 05:13:59 desokvm1 pmxcfs[1434]: [dcdb] notice: cpg_send_message retried 75 times

Does anyone know what I can do to solve the problem?
 
May 1 05:13:30 desokvm1 corosync[1565]: [TOTEM ] A processor joined or left the membership and a new membership was formed.

this message is always displayed in when a server in the cluster reboots.

Do you have a particular VM on that serevr that has a high load and drags down the rest?
 
this message is always displayed in when a server in the cluster reboots.

Do you have a particular VM on that serevr that has a high load and drags down the rest?

no, the servers that are running are "looking" good.

Any do you run latest version?
Yes, I've installed the latest versions.

Code:
# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-95
pve-kernel-2.6.32-19-pve: 2.6.32-95
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-20
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-7
vncterm: 1.0-4
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-10
ksm-control-daemon: 1.1-1