Random reboots

Discussion in 'Proxmox VE: Installation and configuration' started by Ben S, Aug 16, 2018.

  1. Ben S

    Ben S New Member

    Joined:
    Aug 16, 2018
    Messages:
    4
    Likes Received:
    0
    Hello dear Proxmox community :) ,

    We have a 5 nodes cluster with CEPH and HA enabled and we're experiencing some random and unwanted reboots.

    Those happen mainly (but not limited to) when :
    - We migrate VM/CT between nodes
    - We reboot manually a node (some random other nodes reboot as well)

    If we check the syslog we don't see any relevant information (see attached screenshot)

    Kernel Version Linux 4.15.18-1-pve #1 SMP PVE 4.15.18-17 (Mon, 30 Jul 2018 12:53:35 +0200)
    PVE Manager Version pve-manager/5.2-6/bcd5f008

    Does anyone have an idea about a probable cause for this issue ?

    Thanks

    Ben
     

    Attached Files:

  2. itvietnam

    itvietnam Member

    Joined:
    Aug 11, 2015
    Messages:
    116
    Likes Received:
    4
    Are you using cluster network on the same NIC with storage network and/or migration network?

    3 network need separated
     
  3. Ben S

    Ben S New Member

    Joined:
    Aug 16, 2018
    Messages:
    4
    Likes Received:
    0
    We use CEPH storage and disks are on each server. All servers are on the same network.

    Do you mean we need 3 NIC/Server ?
     
  4. itvietnam

    itvietnam Member

    Joined:
    Aug 11, 2015
    Messages:
    116
    Likes Received:
    4
    Yes,
    1. 1 for public network
    2. 1 for private network LAN - cross VM connection
    3. 1 for CEPH - 10Gbps would be better
    4. 1 for cluster network - this must be dedicated for ring0 network and you should add ring1 network. You may want to take a look here: https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network
    Your problem maybe caused by storage network saturated and ring network did not work. Hences, proxmox reboot server due to cluster down.
     
  5. KnowVation

    KnowVation Member
    Proxmox Subscriber

    Joined:
    Mar 19, 2015
    Messages:
    61
    Likes Received:
    0
    We had the same problem yesterday. Out of our 5 node cluster 3 nodes suddenly rebooted while trying to migrate a VM from one node to another. Our cluster uses one network for Ceph and one network for cross VM connection and public network. We have been running stable like this for many years with lots of migrations no problems at all. Our current Proxmox version is Virtual Environment 4.4-22/2728f613. Now we had this occur twice within one day and i have really no idea what to look for in the syslog. Here is the syslog from when the second reboot happened. Wondering if the first entry contains the right clue, but i have no idea what it means.


    Aug 28 11:33:51 Bucky corosync[1816]: [TOTEM ] A processor failed, forming new configuration.
    Aug 28 11:34:01 Bucky corosync[1816]: [TOTEM ] A new membership (192.168.X.XXX:2708) was formed. Members left: 5
    Aug 28 11:34:01 Bucky corosync[1816]: [TOTEM ] Failed to receive the leave message. failed: 5
    Aug 28 11:34:01 Bucky corosync[1816]: [TOTEM ] Retransmit List: 1
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [dcdb] notice: members: 1/22496, 2/28978, 3/1717, 4/1702
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [dcdb] notice: starting data syncronisation
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: members: 1/22496, 2/28978, 3/1717, 4/1702
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: starting data syncronisation
    Aug 28 11:34:01 Bucky corosync[1816]: [QUORUM] Members[4]: 3 4 1 2
    Aug 28 11:34:01 Bucky corosync[1816]: [MAIN ] Completed service synchronization, ready to provide service.
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [dcdb] notice: received sync request (epoch 1/22496/0000000C)
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received sync request (epoch 1/22496/00000008)
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [dcdb] notice: received all states
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [dcdb] notice: leader is 1/22496
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [dcdb] notice: synced members: 1/22496, 2/28978, 3/1717, 4/1702
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [dcdb] notice: all data is up to date
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [dcdb] notice: dfsm_deliver_queue: queue length 11
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received all states
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: all data is up to date
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: dfsm_deliver_queue: queue length 111
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received log
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [main] notice: ignore duplicate
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received log
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [main] notice: ignore duplicate
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received log
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [main] notice: ignore duplicate
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received log
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [main] notice: ignore duplicate
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received log
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [main] notice: ignore duplicate
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received log
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [main] notice: ignore duplicate
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [status] notice: received log
    Aug 28 11:34:01 Bucky pmxcfs[1702]: [main] notice: ignore duplicate
    Aug 28 11:34:07 Bucky corosync[1816]: [TOTEM ] A new membership (192.168.X.XXX:2712) was formed. Members joined: 5
    Aug 28 11:34:07 Bucky corosync[1816]: [TOTEM ] Retransmit List: 1
    Aug 28 11:34:07 Bucky pmxcfs[1702]: [dcdb] notice: members: 1/22496, 2/28978, 3/1717, 4/1702, 5/1688
    Aug 28 11:34:07 Bucky pmxcfs[1702]: [dcdb] notice: starting data syncronisation
    Aug 28 11:34:07 Bucky pmxcfs[1702]: [status] notice: members: 1/22496, 2/28978, 3/1717, 4/1702, 5/1688
    Aug 28 11:34:07 Bucky pmxcfs[1702]: [status] notice: starting data syncronisation
    Aug 28 11:34:07 Bucky corosync[1816]: [QUORUM] Members[5]: 3 4 5 1 2
    Aug 28 11:34:07 Bucky corosync[1816]: [MAIN ] Completed service synchronization, ready to provide service.
    Aug 28 11:34:07 Bucky pmxcfs[1702]: [dcdb] notice: received sync request (epoch 1/22496/0000000D)
    Aug 28 11:34:07 Bucky pmxcfs[1702]: [status] notice: received sync request (epoch 1/22496/00000009)
     
    #5 KnowVation, Aug 29, 2018
    Last edited: Aug 29, 2018
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice