Cluster node(s) go "Offline"

neeko

New Member
Jan 31, 2011
7
0
1
Sacramento, CA
neekogeek.com
At the beginning of troubleshooting this issue we had a 2 node cluster:

  • proxmox01 (pve 3.2-4)
  • proxmox02 (pve 3.3-1)
The second node had been part of the cluster for 1 week when we migrated 5 containers and 2 VM's to it. It remained functional for over a week. This is when I left the office for a week long conference out of state.

While away I got a call from one of my tech's onsite that proxmox02 was offline in the web ui and that all the machines running on it were unavailable. At this point he never tried to ssh into proxmox02 and just hard reset it at the server. It booted back up, all the containers and VM's booted and all was well in the web ui. It ran fine for 2 days and then happened again. This time he just hard reset it out of the gate and then called me.

I had a new node that was all ready to go out to join the cluster at out colo datacenter. I had him grab that off the bench, install it in the rack, join in to the cluster and migrated all of the containers and VA's that were on proxmox02 the new node named proxmox03. So at this point our cluster looks as follows:

  • proxmox01 (pve 3.2-4)
  • proxmox02 (pve 3.3-1)
  • proxmox03 (pve 3.3-1)
Everything again was running fine for the last 7 days. Then this morning both proxmox02 & proxmox03 experience the say "offline" issue. This time I ssh'd into proxmox03 and was successful. So I initiated a reboot. As soon as I did this proxmox03 appeared as though it was back "online" in the web ui and there were 4 shutdown tasks for the VM's and 1 for the containers running. Once they complete. the server rebooted and everything came back online. I did the same for proxmox02 even though at this time it has no containers or VM's running on it.

Has anyone experienced anything like this before? Any suggestions on where to start. I had originally suspected a hardware issue when it happened to the 1 server, but know with both experiencing it at the same time I am convinced that it is a configuration problem. I suspect something with quorum, but I am not sure where to start looking.

Any help would be greatly appreciated.
 
Last edited:
I just found this section of the UI where I see a column in the Datacenter summary table that is title 'Estranged' and thought that sounded interestingly close to what I might describe my issue as. Does anyone know what that column represents. I don't see it documented anywhere and my Google-Fu is failing me. Belowis a screenshot of the UI.

pve_cluster_estranged.PNG
 
I just found this section of the UI where I see a column in the Datacenter summary table that is title 'Estranged'

This flag is set after network partitioning at cluster level (corosync). But from what I see, corosync works perfectly in your case. Check with

# pvecm nodes
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!