Often automatic restarts servers

bromac · Apr 13, 2016

Good morning
We have a problem after moving from a previous version of Proxmox 3.x to 4.1.1. We do not have updated. We did the fresh installation of creating a new cluster of 4 servers. Currently, from time to time (once a week), we breakdown: restarts one of the servers in the cluster. When proxmox 3.x was not the problem, after the implementation of a new version (4.1) started automatically restarts. I do not know whether it can be related to the problem: it happens that does not work we access via the web to the node (connection refused) - restart pveproxy on the machine and access returns. Can someone please help explain the problem and analyzing logs? The problem comes from the line in 1872 (Apr 13 5:41:24) - perhaps something previously contributed to the crash.

Code:

root@RX300S6:/var/log# pvecm status
Quorum information
------------------
Date:             Wed Apr 13 10:30:36 2016
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000005
Ring ID:          1392
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.0.2.200
0x00000005          1 10.0.2.201 (local)
0x00000004          1 10.0.2.202
0x00000001          1 10.0.2.203

Code:

root@RX300S6:/etc/pve# cat corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: IBM
    nodeid: 1
    quorum_votes: 1
    ring0_addr: IBM
  }

  node {
    name: RX300S6
    nodeid: 5
    quorum_votes: 1
    ring0_addr: RX300S6
  }

  node {
    name: TX300S6
    nodeid: 2
    quorum_votes: 1
    ring0_addr: TX300S6
  }

  node {
    name: RX300S4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: RX300S4
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: klaster2
  config_version: 8
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.0.2.203
    ringnumber: 0
  }

}

t.lamprecht · Apr 13, 2016

Is it always the same server?

Whats your hardware setup also whats the storage you are using, ZFS? How much is the load on the server?

You quite often have corosync problems in the log, is the cluster network stable and does the switch works reliably with multicasting?

bromac · Apr 13, 2016

At least once a week, resets one of the four node, this is usually different each time.
We have 4 servers: fuitsu rx300s4, rx300s6, tx300s6 and IBM System x3400.
All of the Xeon (single or dual-processor). Each server a minimum of 8 cores. Each minimum of 48GB of RAM (minimum every free 10-15GB)
All servers connected to the emc disk array two bundles: each to a separate controller FC - use of multipath, space as LVM. EMC not have a problem with FC controllers.

Today, the server is restarted RX300S6 (8 cores: 1x Xeon E5620), between 5 and 6 hours (when he was a restart):
Day average: 6.0 server load
Day max: server load 8.0 (actually here much as 8 cores)
During the day, rather server load does not exceed 4.0 (but only at times).
At night, the servers make backup copies of all the machines (LZO), I tried to set so that none of the machines at the same time did a copy to NFS.

With the network no problem (a dedicated switch cisco / linksys SRW2024 only for servers and part of the virtual machine), or a problem with multicast - in attachment config switch - multicast. Tomorrow morning I connect another switch (cisco sg300-28)

bromac · Apr 14, 2016

It was indeed a problem ping the host names - added entries on the router and now ping by name node runs. I checked a multicast below.
Any more suggestions?
If this will solve my problem? Restart random server / node was always in a random day

Code:

root@RX300S6:/etc/pve# omping -c 10000 -i 0.001 -F -q IBM RX300S4 RX300S6 TX300S6

IBM     :   unicast, xmt/rcv/%loss = 9340/9340/0%, min/avg/max/std-dev = 0.043/0.125/11.940/0.296
IBM     : multicast, xmt/rcv/%loss = 9340/9340/0%, min/avg/max/std-dev = 0.056/0.143/11.979/0.296
RX300S4 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.054/0.133/2.973/0.055
RX300S4 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.059/0.140/2.983/0.057
TX300S6 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.032/0.097/0.775/0.064
TX300S6 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.047/0.118/0.825/0.070

and logs from 10 minutes multicast tests:

Code:

root@RX300S6:/etc/pve# omping -c 600 -i 1 -q IBM RX300S4 RX300S6 TX300S6

IBM     :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.083/0.257/28.653/1.169
IBM     : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.116/0.293/28.674/1.168
RX300S4 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.091/0.155/0.360/0.034
RX300S4 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.107/0.192/0.377/0.038
TX300S6 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.170/0.311/0.048
TX300S6 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.095/0.213/0.346/0.047

Search

Search

Often automatic restarts servers

bromac

Renowned Member

Attachments

t.lamprecht

Proxmox Staff Member

bromac

Renowned Member

Attachments

bromac

Renowned Member