Proxmox 2.0 Node1 CPU Failed Rebooted Automatically

  • Thread starter Thread starter Chris Rivera
  • Start date Start date
C

Chris Rivera

Guest
This has took me some time to track down and find out what has happened and I still don't know exactly what happened.

It seems that proxmox (on my production cloud) detected 1 cpu failed and literally shutdown all vms and rebooted its self.


Sep 4 16:44:21 proxmox1 corosync[648411]: [TOTEM ] A processor failed, forming new configuration.
Sep 4 16:47:04 proxmox1 shutdown[649841]: shutting down for system reboot

When it came back online i checked the logs again and see:

Sep 4 17:10:44 proxmox1 corosync[1482]: [TOTEM ] A processor failed, forming new configuration.
Sep 4 17:11:53 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 13 14 15 16 17 18 19 1a 1b 1c 1d 1e
Sep 4 17:11:53 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 31 32
Sep 4 17:11:53 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 45 46 47 48 49 4a 4b 4c 4d 4e 4f 50
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] CLM CONFIGURATION CHANGE
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] New Configuration:
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.154)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.155)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.156)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.157)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.158)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.159)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.160)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.161)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] Members Left:
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] Members Joined:
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] CLM CONFIGURATION CHANGE
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] New Configuration:
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.154)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.155)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.156)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.157)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.158)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.159)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.160)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.161)
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] Members Left:
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] Members Joined:
Sep 4 17:11:53 proxmox1 corosync[1482]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 4 17:11:53 proxmox1 corosync[1482]: [CPG ] chosen downlist: sender r(0) ip(63.217.249.158) ; members(old:8 left:0)
Sep 4 17:11:53 proxmox1 corosync[1482]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 4 17:12:19 proxmox1 corosync[1482]: [TOTEM ] A processor failed, forming new configuration.
Sep 4 17:13:30 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 13 14 15 16 17 18 19 1a 1b 1c 1d 1e
Sep 4 17:13:44 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 31 32
Sep 4 17:13:44 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 45 46 47 48 49 4a 4b 4c 4d 4e 4f 50
Sep 4 17:13:44 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 70 71 72 73 75 76 77 78 79 7a 7b
Sep 4 17:13:44 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4
Sep 4 17:13:44 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: c7 c8 c9 ca cb cc cd ce cf d0
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] CLM CONFIGURATION CHANGE
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] New Configuration:
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.154)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.155)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.156)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.157)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.158)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.159)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.160)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.161)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] Members Left:
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] Members Joined:
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] CLM CONFIGURATION CHANGE
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] New Configuration:
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.154)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.155)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.156)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.157)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.158)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.159)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.160)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.161)
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] Members Left:
Sep 4 17:13:44 proxmox1 corosync[1482]: [CLM ] Members Joined:
Sep 4 17:13:44 proxmox1 corosync[1482]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 4 17:13:44 proxmox1 corosync[1482]: [CPG ] chosen downlist: sender r(0) ip(63.217.249.158) ; members(old:8 left:0)
Sep 4 17:13:44 proxmox1 corosync[1482]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 4 17:26:01 proxmox1 corosync[1482]: [TOTEM ] A processor failed, forming new configuration.
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] CLM CONFIGURATION CHANGE
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] New Configuration:
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.154)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.155)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.156)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.157)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.158)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.159)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.160)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.161)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] Members Left:
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] Members Joined:
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] CLM CONFIGURATION CHANGE
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] New Configuration:
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.154)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.155)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.156)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.157)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.158)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.159)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.160)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] #011r(0) ip(63.217.249.161)
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] Members Left:
Sep 4 17:29:02 proxmox1 corosync[1482]: [CLM ] Members Joined:
Sep 4 17:29:02 proxmox1 corosync[1482]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 4 17:29:12 proxmox1 corosync[1482]: [TOTEM ] A processor failed, forming new configuration.




WHAT IS GOING ON HERE????
 
It seems that proxmox (on my production cloud) detected 1 cpu failed

Why do you think a CPU has failed? The logs you send just indicates a problem with cluster communication.

Sep 4 16:44:21 proxmox1 corosync[648411]: [TOTEM ] A processor failed, forming new configuration.


This indicates that this node can't communicate to another node in the cluster.


Sep 4 16:47:04 proxmox1 shutdown[649841]: shutting down for system reboot


Someone manually restarted the node?


Sep 4 17:10:44 proxmox1 corosync[1482]: [TOTEM ] A processor failed, forming new configuration.
Sep 4 17:11:53 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 13 14 15 16 17 18 19 1a 1b 1c 1d 1e
Sep 4 17:11:53 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 31 32
Sep 4 17:11:53 proxmox1 corosync[1482]: [TOTEM ] Retransmit List: 45 46 47 48 49 4a 4b 4c 4d 4e 4f 50
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] CLM CONFIGURATION CHANGE
Sep 4 17:11:53 proxmox1 corosync[1482]: [CLM ] New Configuration:

This is the log from corosync cluster membertship protocol (totem).
 
Originally Posted by Chris Rivera
Sep 4 16:44:21 proxmox1 corosync[648411]: [TOTEM ] A processor failed, forming new configuration.


This indicates that this node can't communicate to another node in the cluster.

Good to know. I replaced the processors before this post just to clear up any issues.



Originally Posted by Chris Rivera
Sep 4 16:44:21 proxmox1 corosync[648411]: [TOTEM ] A processor failed, forming new configuration.

This indicates that this node can't communicate to another node in the cluster

Which node is in question i do not see any hostname or ip associated. If i can guess... id bet its node 8 which does seems to have issues with quorum.. I had to run pvecm e 1 just to be able to ssh into the box since something is wrong.



Originally Posted by Chris Rivera
Sep 4 16:47:04 proxmox1 shutdown[649841]: shutting down for system reboot

Someone manually restarted the node?

This node was not manually rebooted. There was only 1 session logged in to the server which was me and i did not issue a reboot command. history is still on the server, was not cleared, and no history items show reboot or related command to reboot the server. It rebooted by itself, and or a process submitted the command

How can we track down what process, task, or application issued a reboot command? I need to find out what caused this node to reboot to ensure this will not happen again.