Hey Guys,
for two month now we had a cluster of two running in our noise chamber.
Both were running Proxmox 2.3 and worked as good as possible as it was configured.
The problem started today.
We noticed that Version 2.3 is already outdated and were about to realize HA Clustering.
We installed a third node and added it to the cluster, then we decided to upgrade all
nodes one after another.
Third and second node were okay, both were not holding any virtual machines at the moment.
As for now every node was accessible via each others webinterface.
One of the virtual machines is our firewall, the other one our DNS, so I prepared the Server
according to the wiki until this
After shutting down all the virtual machines I proceeded to go on with the upgrade.
So, this is the part i guess where a lot of stuff startet to go wrong.
This guy told me the upgrade failed, my guess was he just couldn't download some final updates
as he was holding the shut down firewall, so I rebootet him and tried to access him.
First I tried to access him from another node and noticed that there was an error message
that google couldn't find so far.
Damn it. The other node said the same.
I wanted to access node 1 himself, but the page was not reachable.
Double Damn.
So I did some testing with pings and ssh sessions and stuff and found that node 1 is still reachable from everywhere, ping, ssh, the stuff that WinSCP does, everything is fine.
My next attempt was to shut down all the nodes and bring back on one after another. Node one as the cluster master, node two and then node three.
But already after booting up node one I noticed that nothing got better.
So I started to migrate all the virtual machines to node two, noticed that error number 595 ist still present and wanted to reactivate the backups I made on the NAS before I started. I only found one of three backups.
Well, after seeing that the little icon of node one is red in the webinterface I found a suggestion to check several things services as suggested. (Cluster, daemon and stuff).
Got the hint to manually start the Apache service and received a mesage telling that Document Root is missing.
That is all I got for today.
As Node one locked himself up I couldn't try to do upgrade again. Getting the firewall working on node two didn't help as there are some unknown settings.
I would really appretiate your help tomorrow is my last chance to get this cluster working.
Any hints how to fix this problem?
Rollback the node to a former state? Before upgrading?
How to get the config files of the virtual machines so I can easily restart them on the other nodes?
Thanks for reading that many letters, hope we can fix this together. My classmates already tried to hurt me because the internet was down
greetings
for two month now we had a cluster of two running in our noise chamber.
Both were running Proxmox 2.3 and worked as good as possible as it was configured.
The problem started today.
We noticed that Version 2.3 is already outdated and were about to realize HA Clustering.
We installed a third node and added it to the cluster, then we decided to upgrade all
nodes one after another.
Third and second node were okay, both were not holding any virtual machines at the moment.
As for now every node was accessible via each others webinterface.
One of the virtual machines is our firewall, the other one our DNS, so I prepared the Server
according to the wiki until this
Code:
[COLOR=#000000]./pve-upgrade-2.3-to-3.0 --download-only[/COLOR]
So, this is the part i guess where a lot of stuff startet to go wrong.
This guy told me the upgrade failed, my guess was he just couldn't download some final updates
as he was holding the shut down firewall, so I rebootet him and tried to access him.
First I tried to access him from another node and noticed that there was an error message
that google couldn't find so far.
Code:
Connection error 595: Connection refused
Damn it. The other node said the same.
I wanted to access node 1 himself, but the page was not reachable.
Double Damn.
So I did some testing with pings and ssh sessions and stuff and found that node 1 is still reachable from everywhere, ping, ssh, the stuff that WinSCP does, everything is fine.
My next attempt was to shut down all the nodes and bring back on one after another. Node one as the cluster master, node two and then node three.
But already after booting up node one I noticed that nothing got better.
So I started to migrate all the virtual machines to node two, noticed that error number 595 ist still present and wanted to reactivate the backups I made on the NAS before I started. I only found one of three backups.
Well, after seeing that the little icon of node one is red in the webinterface I found a suggestion to check several things services as suggested. (Cluster, daemon and stuff).
Got the hint to manually start the Apache service and received a mesage telling that Document Root is missing.
Code:
[....] Starting web server: apache2Warning: DocumentRoot [/usr/share/pve-manager/root] does not exist[Thu Nov 28 15:27:07 2013] [warn] NameVirtualHost *:80 has no VirtualHosts
Action 'start' failed.
The Apache error log may have more information.
failed!
That is all I got for today.
As Node one locked himself up I couldn't try to do upgrade again. Getting the firewall working on node two didn't help as there are some unknown settings.
I would really appretiate your help tomorrow is my last chance to get this cluster working.
Any hints how to fix this problem?
Rollback the node to a former state? Before upgrading?
How to get the config files of the virtual machines so I can easily restart them on the other nodes?
Thanks for reading that many letters, hope we can fix this together. My classmates already tried to hurt me because the internet was down
greetings