High Availability Cluster questions

bread-baker · Apr 7, 2012

Hello
I'm trying to set up a 3 node cluster using wiki information.

First question - does fencing need to be set up before creating the cluster?

e100 · Apr 7, 2012

Fencing is only needed for HA, you can setup the cluster without fencing

bread-baker · Apr 7, 2012

Ok I tried to set up a 3 node cluster.

First made sure multicast was all set using ssmping .

Then on 1-st node ran pvecm create on other 1 ran pvecm add ..

the cluster is not ok. here is what we've got:

Code:

s009 fbc240 /etc # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M      4   2012-04-06 18:53:49  s009
   2   X      0                        s010
   3   X      0                        s002
s009 fbc240 /etc # pvecm status
Version: 6.2.0
Config Version: 7
Cluster Name: fbcluster
Cluster Id: 52020
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: s009
Node ID: 1
Multicast addresses: 239.192.203.0 
Node addresses: 10.100.100.240

and the other 2 nodes:

Code:

s010 fbc246 /etc # pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: fbcluster
Cluster Id: 52020
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 1
Expected votes: 2
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags: 
Ports Bound: 0  
Node name: s010
Node ID: 2
Multicast addresses: 239.192.203.0 
Node addresses: 10.100.100.246 
s010 fbc246 /etc # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        s009
   2   M     12   2012-04-06 20:17:48  s010


s002 fbc247 /etc # pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: fbcluster
Cluster Id: 52020
Cluster Member: Yes
Cluster Generation: 8
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags: 
Ports Bound: 0  
Node name: s002
Node ID: 3
Multicast addresses: 239.192.203.0 
Node addresses: 10.100.100.247 
s002 fbc247 /etc # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        s009
   2   X      0                        s010
   3   M      8   2012-04-06 19:53:04  s002

Any suggestion to correct the issue ?

dietmar · Apr 7, 2012

You get some error message after adding the nodes? If not, what happened after you added the nodes?

bread-baker · Apr 7, 2012

dietmar said:
You get some error message after adding the nodes? If not, what happened after you added the nodes?

this occured:

Code:

s002 fbc247 ~ # pvecm add 10.100.100.240
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-cluster.
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... 
Timed-out waiting for cluster
[FAILED]
waiting for quorum...Write failed: Broken pipe

before that tested multicast with:

Code:

002 fbc247 ~ # asmping 224.0.2.1 10.100.100.240
asmping joined (S,G) = (*,224.0.2.234)
pinging 10.100.100.240 from 10.100.100.247
  unicast from 10.100.100.240, seq=1 dist=0 time=0.497 ms
  unicast from 10.100.100.240, seq=2 dist=0 time=0.187 ms
  unicast from 10.100.100.240, seq=3 dist=0 time=0.187 ms
  unicast from 10.100.100.240, seq=4 dist=0 time=0.175 ms
  unicast from 10.100.100.240, seq=5 dist=0 time=0.209 ms
  unicast from 10.100.100.240, seq=6 dist=0 time=0.173 ms
  unicast from 10.100.100.240, seq=7 dist=0 time=0.188 ms
  unicast from 10.100.100.240, seq=8 dist=0 time=0.150 ms
  unicast from 10.100.100.240, seq=9 dist=0 time=0.183 ms

dietmar · Apr 8, 2012

Are there any hints in /var/log/syslog (cman)?

bread-baker · Apr 8, 2012

dietmar said:
Are there any hints in /var/log/syslog (cman)?

not that I can see:

Code:

zgrep cman syslog*
syslog.2.gz:Apr  6 20:17:48 s010 corosync[2489]:   [MAIN  ] Successfully parsed cman config       
syslog.2.gz:Apr  6 20:17:48 s010 corosync[2489]:   [QUORUM] Using quorum provider quorum_cman              
syslog.2.gz:Apr  6 20:17:48 s010 corosync[2489]:   [QUORUM] Using quorum provider quorum_cman              
syslog.4.gz:Apr  6 18:55:22 s010 corosync[169467]:   [MAIN  ] Successfully parsed cman config              
syslog.4.gz:Apr  6 18:55:23 s010 corosync[169467]:   [QUORUM] Using quorum provider quorum_cman
syslog.4.gz:Apr  6 18:55:23 s010 corosync[169467]:   [QUORUM] Using quorum provider quorum_cman
syslog.4.gz:Apr  6 19:54:25 s010 corosync[2946]:   [MAIN  ] Successfully parsed cman config
syslog.4.gz:Apr  6 19:54:26 s010 corosync[2946]:   [QUORUM] Using quorum provider quorum_cman
syslog.4.gz:Apr  6 19:54:26 s010 corosync[2946]:   [QUORUM] Using quorum provider quorum_cman


s002 fbc247 /var/log # zgrep cman syslog*
syslog.0:Apr  6 18:55:37 s002 corosync[15426]:   [MAIN  ] Successfully parsed cman config
syslog.0:Apr  6 18:55:37 s002 corosync[15426]:   [QUORUM] Using quorum provider quorum_cman
syslog.0:Apr  6 18:55:37 s002 corosync[15426]:   [QUORUM] Using quorum provider quorum_cman
syslog.2.gz:Apr  6 19:53:04 s002 corosync[1672]:   [MAIN  ] Successfully parsed cman config
syslog.2.gz:Apr  6 19:53:04 s002 corosync[1672]:   [QUORUM] Using quorum provider quorum_cman
syslog.2.gz:Apr  6 19:53:04 s002 corosync[1672]:   [QUORUM] Using quorum provider quorum_cman

bread-baker · Apr 8, 2012

I had not checked the 1-st node , there is something there

Code:

s009 fbc240 /var/log # zgrep cman syslog*
syslog.0:Apr  6 19:38:11 s009 pmxcfs[33658]: [dcdb] crit: cman_tool version failed with exit code 1#010
syslog.0:Apr  6 19:44:07 s009 pmxcfs[33658]: [dcdb] crit: cman_tool version failed with exit code 1#010
syslog.0:Apr  6 19:45:08 s009 pmxcfs[33658]: [dcdb] crit: cman_tool version failed with exit code 1#010
syslog.3.gz:Apr  6 18:53:49 s009 corosync[33742]:   [MAIN  ] Successfully parsed cman config
syslog.3.gz:Apr  6 18:53:49 s009 corosync[33742]:   [QUORUM] Using quorum provider quorum_cman
syslog.3.gz:Apr  6 18:53:49 s009 corosync[33742]:   [QUORUM] Using quorum provider quorum_cman

That could have been from when I was setting nup fencing and getting syntax errors at first?

bread-baker · Apr 9, 2012

are keys in /root/.ssh taken over by cluster set up? I ask as we've been syncing a common id_rsa and id_rsa.pub to our servers for some time. maybe that is causing our issue. here is /root/.ssh :

Code:

s009 fbc240 ~/.ssh # ll
total 132
-rw------- 1 root root  9040 Mar 29 20:40 authorized_keys
-rw------- 1 root root  9040 Mar 29 20:40 authorized_keys.org
-rw------- 1 root root  1675 Mar 12 13:51 id_rsa
-rw-r--r-- 1 root root   395 Mar 12 13:51 id_rsa.pub
-rw-r--r-- 1 root root 94701 Apr  5 15:42 known_hosts

dietmar · Apr 9, 2012

bread-baker said:
That could have been from when I was setting nup fencing and getting syntax errors at first?

That should not be a problem, but please check that you have the same /etc/cluster/cluster.conf files on all nodes? (reboot any node with an outdated version).

dietmar · Apr 9, 2012

bread-baker said:
are keys in /root/.ssh taken over by cluster set up? I ask as we've been syncing a common id_rsa and id_rsa.pub to our servers for some time.

Yes, this is (and always was) used by the PVE cluster. Just make sure that all nodes can connect to each other without a PW.

bread-baker · Apr 9, 2012

/etc/cluster/cluster.conf good file is on fbc240 . fbc246 and fbc247 had out of date /etc/cluster/cluster.conf .

rebooted 246 and 247 .

/etc/cluster/cluster.conf is still old on 246 and 247.

bread-baker · Apr 9, 2012

dietmar said:
Yes, this is (and always was) used by the PVE cluster. Just make sure that all nodes can connect to each other without a PW.

I thought that was my mistake - letting our script update /root/.ssh/ keys. so I'll disable that update then

1- remove the 2 bad nodes
2- reinstall proxmox on them
3- add them to cluster.

My question- for the 'main' node , is there a way to fix the missing symlinks at /root/.ssh/ , or should I reinstall and start over?

here are /root/.ssh and /etc/ssh listing:

Code:

s009 fbc240 ~ # ls -l /root/.ssh  /etc/pve
/etc/pve:
total 5
-rw-r----- 1 root www-data  451 Mar  2 12:49 authkey.pub
-rw-r----- 1 root www-data 1228 Apr  6 20:01 cluster.conf
-rw-r----- 1 root www-data  281 Apr  6 18:55 cluster.conf.old
-rw-r----- 1 root www-data  328 Apr  6 19:36 cluster.conf.ori
-rw-r----- 1 root www-data   16 Mar  2 12:40 datacenter.cfg
lrwxr-x--- 1 root www-data    0 Dec 31  1969 local -> nodes/s009
drwxr-x--- 2 root www-data    0 Mar  2 12:49 nodes
lrwxr-x--- 1 root www-data    0 Dec 31  1969 openvz -> nodes/s009/openvz
drwx------ 2 root www-data    0 Mar  2 12:49 priv
-rw-r----- 1 root www-data 1533 Mar  2 12:49 pve-root-ca.pem
-rw-r----- 1 root www-data 1679 Mar  2 12:49 pve-www.key
lrwxr-x--- 1 root www-data    0 Dec 31  1969 qemu-server -> nodes/s009/qemu-server
-rw-r----- 1 root www-data  235 Apr  2 21:53 storage.cfg
-rw-r----- 1 root www-data   49 Mar  2 12:40 user.cfg
-rw-r----- 1 root www-data  285 Mar 29 15:59 vzdump.cron


/root/.ssh:
total 132
-rw------- 1 root root  9040 Mar 29 20:40 authorized_keys
-rw------- 1 root root  9040 Mar 29 20:40 authorized_keys.org
-rw------- 1 root root  1675 Mar 12 13:51 id_rsa
-rw-r--r-- 1 root root   395 Mar 12 13:51 id_rsa.pub
-rw-r--r-- 1 root root 94701 Apr  5 15:42 known_hosts

tom · Apr 9, 2012

bread-baker said:
..
002 fbc247 ~ # asmping 224.0.2.1 10.100.100.240
asmping joined (S,G) = (*,224.0.2.234)
pinging 10.100.100.240 from 10.100.100.247
unicast from 10.100.100.240, seq=1 dist=0 time=0.497 ms
unicast from 10.100.100.240, seq=2 dist=0 time=0.187 ms
unicast from 10.100.100.240, seq=3 dist=0 time=0.187 ms
unicast from 10.100.100.240, seq=4 dist=0 time=0.175 ms
unicast from 10.100.100.240, seq=5 dist=0 time=0.209 ms
unicast from 10.100.100.240, seq=6 dist=0 time=0.173 ms
unicast from 10.100.100.240, seq=7 dist=0 time=0.188 ms
unicast from 10.100.100.240, seq=8 dist=0 time=0.150 ms
unicast from 10.100.100.240, seq=9 dist=0 time=0.183 ms
[/code]

the same commmand here shows (see also example from the wiki):

Code:

asmping 224.0.2.1 192.168.7.201
asmping joined (S,G) = (*,224.0.2.234)
pinging 192.168.7.201 from 192.168.7.202
  unicast from 192.168.7.201, seq=1 dist=0 time=0.313 ms
  unicast from 192.168.7.201, seq=2 dist=0 time=0.336 ms
multicast from 192.168.7.201, seq=2 dist=0 time=0.378 ms
  unicast from 192.168.7.201, seq=3 dist=0 time=0.317 ms
multicast from 192.168.7.201, seq=3 dist=0 time=0.359 ms
  unicast from 192.168.7.201, seq=4 dist=0 time=0.317 ms
multicast from 192.168.7.201, seq=4 dist=0 time=0.356 ms

bread-baker · Apr 9, 2012

Tom - thanks - I did not notice the missing 'multicast' lines. I'll get multicast working correctly before trying to make a cluster.....

bread-baker · Apr 9, 2012

OK I fixed multicast on our switches .

Rebooted and the cluster.conf updated.

Now I just need to fix the symlinks in /root/.ssh

Could someone please post the output of

Code:

ls -l /root/.ssh

so that I can try to fix root's .ssh information?

thanks for the help.

ps: I put pics of Netgear managed switch multicast settings to http://pve.proxmox.com/wiki/Multicast_notes

dietmar · Apr 9, 2012

bread-baker said:
Now I just need to fix the symlinks in /root/.ssh

Try to run:

# pcecm updatecerts

that tries to fix the sysmlinks.

Here is the output of 'ls -l /root/.ssh"

Code:

 ls -l /root/.ssh/
total 32
lrwxrwxrwx 1 root root    29 Aug 10  2011 authorized_keys -> /etc/pve/priv/authorized_keys
lrwxrwxrwx 1 root root    29 Aug 10  2011 authorized_keys.org -> /etc/pve/priv/authorized_keys
-rw------- 1 root root  1679 Dec 17  2010 id_rsa
-rw-r--r-- 1 root root   391 Dec 17  2010 id_rsa.pub
-rw-r--r-- 1 root root 20626 Apr  6 08:37 known_hosts

bread-baker · Apr 9, 2012

Had to do this as 'pvecm updatecerts ' did not work here

Code:

cd  /root/.ssh
[/COLOR]
ln -sf /etc/pve/priv/authorized_keys[COLOR=#333333]
[/COLOR]ln -sf   /etc/pve/priv/authorized_keys   authorized_keys.org

then rebooted all nodes as i was getting connection errors to 2 of the nodes at pve .

all seems well now.

thank you!

Search

Search

High Availability Cluster questions

bread-baker

Member

e100

Renowned Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member

bread-baker

Member

bread-baker

Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

bread-baker

Member

bread-baker

Member

tom

Proxmox Staff Member

bread-baker

Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member