Hi,
Previous week the power in ouroffice failed and after it came back, we had some client computers with broken hardware. Our server hardware looked fine at that moment. We started the nodes and everything was working as before.
Now a week later the servers shut down unexpectedly. And my question is how this can happen.
Our setup is as follow:
2 SuperMicro servers with 64 GB ECC RAM and a fiber connection between them to sync the network raid arrays.
The servers are also connected to the client network over an UTP cable. This connection is also used to connect to IPMI to get the system state.
Server 1: 192.168.0.201 (IPMI 192.168.0.200) over eth0 and eth1 (both on switch 1)
Server 2: 192.168.0.203 (IPMI 192.168.0.202) over eth0 and eth1 (both on switch 1)
Data: 10.0.0.1 and 10.0.0.2 (no switch, direct fiber connection between the 2 servers)
Logs:
Proxmox 1:
Feb 25 09:09:08 prox1 corosync[3828]: [TOTEM ] A processor failed, forming new configuration.
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] New Configuration:
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] #011r(0) ip(192.168.0.201)
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] Members Left:
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] Members Joined:
Feb 25 09:09:10 prox1 corosync[3828]: [QUORUM] Members[1]: 1
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] New Configuration:
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] #011r(0) ip(192.168.0.201)
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] Members Left:
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] Members Joined:
Feb 25 09:09:10 prox1 corosync[3828]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 25 09:09:10 prox1 kernel: dlm: closing connection to node 2
Feb 25 09:09:10 prox1 rgmanager[4199]: State change: prox2 DOWN
Feb 25 09:09:10 prox1 corosync[3828]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.201) ; members(old:2 left:1)
Feb 25 09:09:10 prox1 pmxcfs[3596]: [dcdb] notice: members: 1/3596
Feb 25 09:09:10 prox1 pmxcfs[3596]: [dcdb] notice: members: 1/3596
Feb 25 09:09:10 prox1 corosync[3828]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 25 09:09:10 prox1 fenced[3906]: fencing node prox2
Feb 25 09:09:10 prox1 rgmanager[928142]: [pvevm] VM 120 is running
Feb 25 09:09:10 prox1 rgmanager[928162]: [pvevm] VM 110 is running
Feb 25 09:09:10 prox1 kernel: eth0: received packet with own address as source address
Feb 25 09:09:10 prox1 fence_ipmilan: Parse error: Ignoring unknown option 'nodename=prox2
Feb 25 09:09:14 prox1 shutdown[928210]: shutting down for system halt
Proxmox 2:
Feb 25 09:09:08 prox2 corosync[3804]: [TOTEM ] A processor failed, forming new configuration.
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] New Configuration:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Left:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.201)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Joined:
Feb 25 09:09:10 prox2 corosync[3804]: [QUORUM] Members[1]: 2
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] New Configuration:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Left:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Joined:
Feb 25 09:09:10 prox2 corosync[3804]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 25 09:09:10 prox2 rgmanager[4156]: State change: prox1 DOWN
Feb 25 09:09:10 prox2 kernel: dlm: closing connection to node 1
Feb 25 09:09:10 prox2 corosync[3804]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.203) ; members(old:2 left:1)
Feb 25 09:09:10 prox2 pmxcfs[3526]: [dcdb] notice: members: 2/3526
Feb 25 09:09:10 prox2 pmxcfs[3526]: [dcdb] notice: members: 2/3526
Feb 25 09:09:10 prox2 corosync[3804]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 25 09:09:10 prox2 fenced[3862]: fencing node prox1
Feb 25 09:09:10 prox2 fence_ipmilan: Parse error: Ignoring unknown option 'nodename=prox1
Feb 25 09:09:10 prox2 kernel: eth0: received packet with own address as source address
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] New Configuration:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Left:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Joined:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] New Configuration:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Left:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Joined:
Feb 25 09:09:10 prox2 corosync[3804]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 25 09:09:10 prox2 corosync[3804]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.203) ; members(old:1 left:0)
Feb 25 09:09:10 prox2 corosync[3804]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 25 09:09:14 prox2 shutdown[1038915]: shutting down for system halt
Who can help us? Can it be that we have a server hardware failure?
Thanks for any help!
Previous week the power in ouroffice failed and after it came back, we had some client computers with broken hardware. Our server hardware looked fine at that moment. We started the nodes and everything was working as before.
Now a week later the servers shut down unexpectedly. And my question is how this can happen.
Our setup is as follow:
2 SuperMicro servers with 64 GB ECC RAM and a fiber connection between them to sync the network raid arrays.
The servers are also connected to the client network over an UTP cable. This connection is also used to connect to IPMI to get the system state.
Server 1: 192.168.0.201 (IPMI 192.168.0.200) over eth0 and eth1 (both on switch 1)
Server 2: 192.168.0.203 (IPMI 192.168.0.202) over eth0 and eth1 (both on switch 1)
Data: 10.0.0.1 and 10.0.0.2 (no switch, direct fiber connection between the 2 servers)
Logs:
Proxmox 1:
Feb 25 09:09:08 prox1 corosync[3828]: [TOTEM ] A processor failed, forming new configuration.
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] New Configuration:
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] #011r(0) ip(192.168.0.201)
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] Members Left:
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] Members Joined:
Feb 25 09:09:10 prox1 corosync[3828]: [QUORUM] Members[1]: 1
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] New Configuration:
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] #011r(0) ip(192.168.0.201)
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] Members Left:
Feb 25 09:09:10 prox1 corosync[3828]: [CLM ] Members Joined:
Feb 25 09:09:10 prox1 corosync[3828]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 25 09:09:10 prox1 kernel: dlm: closing connection to node 2
Feb 25 09:09:10 prox1 rgmanager[4199]: State change: prox2 DOWN
Feb 25 09:09:10 prox1 corosync[3828]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.201) ; members(old:2 left:1)
Feb 25 09:09:10 prox1 pmxcfs[3596]: [dcdb] notice: members: 1/3596
Feb 25 09:09:10 prox1 pmxcfs[3596]: [dcdb] notice: members: 1/3596
Feb 25 09:09:10 prox1 corosync[3828]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 25 09:09:10 prox1 fenced[3906]: fencing node prox2
Feb 25 09:09:10 prox1 rgmanager[928142]: [pvevm] VM 120 is running
Feb 25 09:09:10 prox1 rgmanager[928162]: [pvevm] VM 110 is running
Feb 25 09:09:10 prox1 kernel: eth0: received packet with own address as source address
Feb 25 09:09:10 prox1 fence_ipmilan: Parse error: Ignoring unknown option 'nodename=prox2
Feb 25 09:09:14 prox1 shutdown[928210]: shutting down for system halt
Proxmox 2:
Feb 25 09:09:08 prox2 corosync[3804]: [TOTEM ] A processor failed, forming new configuration.
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] New Configuration:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Left:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.201)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Joined:
Feb 25 09:09:10 prox2 corosync[3804]: [QUORUM] Members[1]: 2
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] New Configuration:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Left:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Joined:
Feb 25 09:09:10 prox2 corosync[3804]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 25 09:09:10 prox2 rgmanager[4156]: State change: prox1 DOWN
Feb 25 09:09:10 prox2 kernel: dlm: closing connection to node 1
Feb 25 09:09:10 prox2 corosync[3804]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.203) ; members(old:2 left:1)
Feb 25 09:09:10 prox2 pmxcfs[3526]: [dcdb] notice: members: 2/3526
Feb 25 09:09:10 prox2 pmxcfs[3526]: [dcdb] notice: members: 2/3526
Feb 25 09:09:10 prox2 corosync[3804]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 25 09:09:10 prox2 fenced[3862]: fencing node prox1
Feb 25 09:09:10 prox2 fence_ipmilan: Parse error: Ignoring unknown option 'nodename=prox1
Feb 25 09:09:10 prox2 kernel: eth0: received packet with own address as source address
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] New Configuration:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Left:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Joined:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] CLM CONFIGURATION CHANGE
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] New Configuration:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] #011r(0) ip(192.168.0.203)
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Left:
Feb 25 09:09:10 prox2 corosync[3804]: [CLM ] Members Joined:
Feb 25 09:09:10 prox2 corosync[3804]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 25 09:09:10 prox2 corosync[3804]: [CPG ] chosen downlist: sender r(0) ip(192.168.0.203) ; members(old:1 left:0)
Feb 25 09:09:10 prox2 corosync[3804]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 25 09:09:14 prox2 shutdown[1038915]: shutting down for system halt
Who can help us? Can it be that we have a server hardware failure?
Thanks for any help!