Often automatic restarts servers

bromac

Renowned Member
Sep 14, 2014
23
2
68
Poland
Good morning
We have a problem after moving from a previous version of Proxmox 3.x to 4.1.1. We do not have updated. We did the fresh installation of creating a new cluster of 4 servers. Currently, from time to time (once a week), we breakdown: restarts one of the servers in the cluster. When proxmox 3.x was not the problem, after the implementation of a new version (4.1) started automatically restarts. I do not know whether it can be related to the problem: it happens that does not work we access via the web to the node (connection refused) - restart pveproxy on the machine and access returns. Can someone please help explain the problem and analyzing logs? The problem comes from the line in 1872 (Apr 13 5:41:24) - perhaps something previously contributed to the crash.

Code:
root@RX300S6:/var/log# pvecm status
Quorum information
------------------
Date:             Wed Apr 13 10:30:36 2016
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000005
Ring ID:          1392
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.0.2.200
0x00000005          1 10.0.2.201 (local)
0x00000004          1 10.0.2.202
0x00000001          1 10.0.2.203
Code:
root@RX300S6:/etc/pve# cat corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: IBM
    nodeid: 1
    quorum_votes: 1
    ring0_addr: IBM
  }

  node {
    name: RX300S6
    nodeid: 5
    quorum_votes: 1
    ring0_addr: RX300S6
  }

  node {
    name: TX300S6
    nodeid: 2
    quorum_votes: 1
    ring0_addr: TX300S6
  }

  node {
    name: RX300S4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: RX300S4
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: klaster2
  config_version: 8
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.0.2.203
    ringnumber: 0
  }

}
 

Attachments

  • syslog.rx300s6.txt
    265.5 KB · Views: 2
Is it always the same server?

Whats your hardware setup also whats the storage you are using, ZFS? How much is the load on the server?

You quite often have corosync problems in the log, is the cluster network stable and does the switch works reliably with multicasting?
 
At least once a week, resets one of the four node, this is usually different each time.
We have 4 servers: fuitsu rx300s4, rx300s6, tx300s6 and IBM System x3400.
All of the Xeon (single or dual-processor). Each server a minimum of 8 cores. Each minimum of 48GB of RAM (minimum every free 10-15GB)
All servers connected to the emc disk array two bundles: each to a separate controller FC - use of multipath, space as LVM. EMC not have a problem with FC controllers.

Today, the server is restarted RX300S6 (8 cores: 1x Xeon E5620), between 5 and 6 hours (when he was a restart):
Day average: 6.0 server load
Day max: server load 8.0 (actually here much as 8 cores)
During the day, rather server load does not exceed 4.0 (but only at times).
At night, the servers make backup copies of all the machines (LZO), I tried to set so that none of the machines at the same time did a copy to NFS.

With the network no problem (a dedicated switch cisco / linksys SRW2024 only for servers and part of the virtual machine), or a problem with multicast - in attachment config switch - multicast. Tomorrow morning I connect another switch (cisco sg300-28)
 

Attachments

  • igmp_snooping-vlan1.png
    igmp_snooping-vlan1.png
    10.4 KB · Views: 2
  • bridge-multicast-forward-all-vlan1.png
    bridge-multicast-forward-all-vlan1.png
    12.7 KB · Views: 2
  • bridge-multicast-filtering-vlan1.png
    bridge-multicast-filtering-vlan1.png
    18.1 KB · Views: 2
Last edited:
It was indeed a problem ping the host names - added entries on the router and now ping by name node runs. I checked a multicast below.
Any more suggestions?
If this will solve my problem? Restart random server / node was always in a random day

Code:
root@RX300S6:/etc/pve# omping -c 10000 -i 0.001 -F -q IBM RX300S4 RX300S6 TX300S6

IBM     :   unicast, xmt/rcv/%loss = 9340/9340/0%, min/avg/max/std-dev = 0.043/0.125/11.940/0.296
IBM     : multicast, xmt/rcv/%loss = 9340/9340/0%, min/avg/max/std-dev = 0.056/0.143/11.979/0.296
RX300S4 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.054/0.133/2.973/0.055
RX300S4 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.059/0.140/2.983/0.057
TX300S6 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.032/0.097/0.775/0.064
TX300S6 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.047/0.118/0.825/0.070


and logs from 10 minutes multicast tests:

Code:
root@RX300S6:/etc/pve# omping -c 600 -i 1 -q IBM RX300S4 RX300S6 TX300S6

IBM     :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.083/0.257/28.653/1.169
IBM     : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.116/0.293/28.674/1.168
RX300S4 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.091/0.155/0.360/0.034
RX300S4 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.107/0.192/0.377/0.038
TX300S6 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.170/0.311/0.048
TX300S6 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.095/0.213/0.346/0.047
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!