Cluster failed

baldy · Mar 10, 2017

Hi all,

today u was in the datacenter and bring back a broken server.
I added the server with pvecm 10.0.2.110 --force

After this command Cluster was complete and working.
Then i added a VLAN (tagged) in my Switches and then the chaos begins.
After i see a lot a trouble i deleted my VLANs again in hope i will run in normal mode again.

but things happends. On Node 1 and Node2 i got a lot of errors:

Mar 10 15:51:59 host01 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Mar 10 15:52:00 host01 corosync[14350]: [TOTEM ] A new membership (10.0.2.110:127136) was formed. Members
Mar 10 15:52:00 host01 corosync[14350]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 11 12
Mar 10 15:52:00 host01 corosync[14350]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 10 15:52:01 host01 corosync[14350]: [TOTEM ] A new membership (10.0.2.110:127140) was formed. Members
Mar 10 15:52:01 host01 corosync[14350]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 11 12
Mar 10 15:52:01 host01 corosync[14350]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 10 15:52:04 host01 corosync[14350]: [TOTEM ] A new membership (10.0.2.110:127144) was formed. Members
Mar 10 15:52:04 host01 corosync[14350]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 11 12
Mar 10 15:52:04 host01 corosync[14350]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 10 15:52:10 host01 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Mar 10 15:52:10 host01 cron[2006]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Mar 10 15:52:10 host01 pve-ha-lrm[2128]: unable to write lrm status file - unable to open file '/etc/pve/nodes/host01/lrm_status.tmp.2128' - Transport endpoint is not connected
Mar 10 15:52:10 host01 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Mar 10 15:52:10 host01 systemd[1]: Unit pve-cluster.service entered failed state.
Mar 10 15:52:10 host01 systemd[1]: Starting The Proxmox VE cluster filesystem...
Mar 10 15:52:10 host01 pmxcfs[23185]: [status] notice: update cluster info (cluster name fcse, version = 13)
Mar 10 15:52:10 host01 pmxcfs[23185]: [status] notice: node has quorum
Mar 10 15:52:10 host01 pmxcfs[23185]: [dcdb] notice: members: 1/23185, 3/1990, 4/1910, 5/22930, 6/1893, 7/2035, 8/1927, 9/1887, 10/1989, 11/1509, 12/2135
Mar 10 15:52:10 host01 pmxcfs[23185]: [dcdb] notice: starting data syncronisation
Mar 10 15:52:10 host01 pmxcfs[23185]: [dcdb] notice: received sync request (epoch 1/23185/00000001)
Mar 10 15:52:10 host01 pmxcfs[23185]: [status] notice: members: 1/23185, 3/1990, 4/1910, 5/22930, 6/1893, 7/2035, 8/1927, 9/1887, 10/1989, 11/1509, 12/2135
Mar 10 15:52:10 host01 pmxcfs[23185]: [status] notice: starting data syncronisation
Mar 10 15:52:10 host01 pmxcfs[23185]: [status] notice: received sync request (epoch 1/23185/00000001)
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Transport endpoint is not connected
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:10 host01 pvestatd[31372]: status update time (35.230 seconds)
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:10 host01 pvestatd[31372]: ipcc_send_rec failed: Connection refused
Mar 10 15:52:13 host01 corosync[14350]: [TOTEM ] A new membership (10.0.2.110:127148) was formed. Members
Mar 10 15:52:13 host01 corosync[14350]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 11 12
Mar 10 15:52:13 host01 corosync[14350]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 10 15:52:15 host01 pve-ha-lrm[2128]: loop take too long (45 seconds)
Mar 10 15:52:15 host01 pve-ha-crm[2115]: ipcc_send_rec failed: Transport endpoint is not connected
Mar 10 15:52:15 host01 pve-ha-lrm[2128]: ipcc_send_rec failed: Transport endpoint is not connected
Mar 10 15:52:22 host01 corosync[14350]: [TOTEM ] A new membership (10.0.2.110:127152) was formed. Members
Mar 10 15:52:22 host01 corosync[14350]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 11 12
Mar 10 15:52:22 host01 corosync[14350]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 10 15:52:25 host01 corosync[14350]: [TOTEM ] A new membership (10.0.2.110:127156) was formed. Members
Mar 10 15:52:25 host01 corosync[14350]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 11 12
Mar 10 15:52:25 host01 corosync[14350]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 10 15:52:27 host01 corosync[14350]: [TOTEM ] A new membership (10.0.2.110:127160) was formed. Members
Mar 10 15:52:27 host01 corosync[14350]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 10 11 12
Mar 10 15:52:27 host01 corosync[14350]: [MAIN ] Completed service synchronization, ready to provide service.

I a not able to start/restart pve-cluster and pvestatd.

Multicast is still working tested with omping arround the whole cluster with 12 nodes.

Has anyone any idea because i am not able to reinstall the cluster :-(

Cheers

Daniel

Search

Search

Cluster failed

baldy

Active Member