High Availability Cluster questions

bread-baker

Member
Mar 6, 2010
432
0
16
Hello
I'm trying to set up a 3 node cluster using wiki information.

First question - does fencing need to be set up before creating the cluster?
 
Ok I tried to set up a 3 node cluster.

First made sure multicast was all set using ssmping .

Then on 1-st node ran pvecm create on other 1 ran pvecm add ..

the cluster is not ok. here is what we've got:

Code:
s009 fbc240 /etc # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M      4   2012-04-06 18:53:49  s009
   2   X      0                        s010
   3   X      0                        s002
s009 fbc240 /etc # pvecm status
Version: 6.2.0
Config Version: 7
Cluster Name: fbcluster
Cluster Id: 52020
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: s009
Node ID: 1
Multicast addresses: 239.192.203.0 
Node addresses: 10.100.100.240

and the other 2 nodes:
Code:
s010 fbc246 /etc # pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: fbcluster
Cluster Id: 52020
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 1
Expected votes: 2
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags: 
Ports Bound: 0  
Node name: s010
Node ID: 2
Multicast addresses: 239.192.203.0 
Node addresses: 10.100.100.246 
s010 fbc246 /etc # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        s009
   2   M     12   2012-04-06 20:17:48  s010


s002 fbc247 /etc # pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: fbcluster
Cluster Id: 52020
Cluster Member: Yes
Cluster Generation: 8
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags: 
Ports Bound: 0  
Node name: s002
Node ID: 3
Multicast addresses: 239.192.203.0 
Node addresses: 10.100.100.247 
s002 fbc247 /etc # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        s009
   2   X      0                        s010
   3   M      8   2012-04-06 19:53:04  s002

Any suggestion to correct the issue ?
 
You get some error message after adding the nodes? If not, what happened after you added the nodes?
 
You get some error message after adding the nodes? If not, what happened after you added the nodes?

this occured:
Code:
s002 fbc247 ~ # pvecm add 10.100.100.240
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-cluster.
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... 
Timed-out waiting for cluster
[FAILED]
waiting for quorum...Write failed: Broken pipe

before that tested multicast with:
Code:
002 fbc247 ~ # asmping 224.0.2.1 10.100.100.240
asmping joined (S,G) = (*,224.0.2.234)
pinging 10.100.100.240 from 10.100.100.247
  unicast from 10.100.100.240, seq=1 dist=0 time=0.497 ms
  unicast from 10.100.100.240, seq=2 dist=0 time=0.187 ms
  unicast from 10.100.100.240, seq=3 dist=0 time=0.187 ms
  unicast from 10.100.100.240, seq=4 dist=0 time=0.175 ms
  unicast from 10.100.100.240, seq=5 dist=0 time=0.209 ms
  unicast from 10.100.100.240, seq=6 dist=0 time=0.173 ms
  unicast from 10.100.100.240, seq=7 dist=0 time=0.188 ms
  unicast from 10.100.100.240, seq=8 dist=0 time=0.150 ms
  unicast from 10.100.100.240, seq=9 dist=0 time=0.183 ms
 
Are there any hints in /var/log/syslog (cman)?

not that I can see:
Code:
zgrep cman syslog*
syslog.2.gz:Apr  6 20:17:48 s010 corosync[2489]:   [MAIN  ] Successfully parsed cman config       
syslog.2.gz:Apr  6 20:17:48 s010 corosync[2489]:   [QUORUM] Using quorum provider quorum_cman              
syslog.2.gz:Apr  6 20:17:48 s010 corosync[2489]:   [QUORUM] Using quorum provider quorum_cman              
syslog.4.gz:Apr  6 18:55:22 s010 corosync[169467]:   [MAIN  ] Successfully parsed cman config              
syslog.4.gz:Apr  6 18:55:23 s010 corosync[169467]:   [QUORUM] Using quorum provider quorum_cman
syslog.4.gz:Apr  6 18:55:23 s010 corosync[169467]:   [QUORUM] Using quorum provider quorum_cman
syslog.4.gz:Apr  6 19:54:25 s010 corosync[2946]:   [MAIN  ] Successfully parsed cman config
syslog.4.gz:Apr  6 19:54:26 s010 corosync[2946]:   [QUORUM] Using quorum provider quorum_cman
syslog.4.gz:Apr  6 19:54:26 s010 corosync[2946]:   [QUORUM] Using quorum provider quorum_cman


s002 fbc247 /var/log # zgrep cman syslog*
syslog.0:Apr  6 18:55:37 s002 corosync[15426]:   [MAIN  ] Successfully parsed cman config
syslog.0:Apr  6 18:55:37 s002 corosync[15426]:   [QUORUM] Using quorum provider quorum_cman
syslog.0:Apr  6 18:55:37 s002 corosync[15426]:   [QUORUM] Using quorum provider quorum_cman
syslog.2.gz:Apr  6 19:53:04 s002 corosync[1672]:   [MAIN  ] Successfully parsed cman config
syslog.2.gz:Apr  6 19:53:04 s002 corosync[1672]:   [QUORUM] Using quorum provider quorum_cman
syslog.2.gz:Apr  6 19:53:04 s002 corosync[1672]:   [QUORUM] Using quorum provider quorum_cman
 
I had not checked the 1-st node , there is something there
Code:
s009 fbc240 /var/log # zgrep cman syslog*
syslog.0:Apr  6 19:38:11 s009 pmxcfs[33658]: [dcdb] crit: cman_tool version failed with exit code 1#010
syslog.0:Apr  6 19:44:07 s009 pmxcfs[33658]: [dcdb] crit: cman_tool version failed with exit code 1#010
syslog.0:Apr  6 19:45:08 s009 pmxcfs[33658]: [dcdb] crit: cman_tool version failed with exit code 1#010
syslog.3.gz:Apr  6 18:53:49 s009 corosync[33742]:   [MAIN  ] Successfully parsed cman config
syslog.3.gz:Apr  6 18:53:49 s009 corosync[33742]:   [QUORUM] Using quorum provider quorum_cman
syslog.3.gz:Apr  6 18:53:49 s009 corosync[33742]:   [QUORUM] Using quorum provider quorum_cman

That could have been from when I was setting nup fencing and getting syntax errors at first?
 
Last edited:
are keys in /root/.ssh taken over by cluster set up? I ask as we've been syncing a common id_rsa and id_rsa.pub to our servers for some time. maybe that is causing our issue. here is /root/.ssh :
Code:
s009 fbc240 ~/.ssh # ll
total 132
-rw------- 1 root root  9040 Mar 29 20:40 authorized_keys
-rw------- 1 root root  9040 Mar 29 20:40 authorized_keys.org
-rw------- 1 root root  1675 Mar 12 13:51 id_rsa
-rw-r--r-- 1 root root   395 Mar 12 13:51 id_rsa.pub
-rw-r--r-- 1 root root 94701 Apr  5 15:42 known_hosts
 
That could have been from when I was setting nup fencing and getting syntax errors at first?

That should not be a problem, but please check that you have the same /etc/cluster/cluster.conf files on all nodes? (reboot any node with an outdated version).
 
are keys in /root/.ssh taken over by cluster set up? I ask as we've been syncing a common id_rsa and id_rsa.pub to our servers for some time.

Yes, this is (and always was) used by the PVE cluster. Just make sure that all nodes can connect to each other without a PW.
 
/etc/cluster/cluster.conf good file is on fbc240 . fbc246 and fbc247 had out of date /etc/cluster/cluster.conf .

rebooted 246 and 247 .

/etc/cluster/cluster.conf is still old on 246 and 247.
 
Yes, this is (and always was) used by the PVE cluster. Just make sure that all nodes can connect to each other without a PW.

I thought that was my mistake - letting our script update /root/.ssh/ keys. so I'll disable that update then

1- remove the 2 bad nodes
2- reinstall proxmox on them
3- add them to cluster.

My question- for the 'main' node , is there a way to fix the missing symlinks at /root/.ssh/ , or should I reinstall and start over?

here are /root/.ssh and /etc/ssh listing:
Code:
s009 fbc240 ~ # ls -l /root/.ssh  /etc/pve
/etc/pve:
total 5
-rw-r----- 1 root www-data  451 Mar  2 12:49 authkey.pub
-rw-r----- 1 root www-data 1228 Apr  6 20:01 cluster.conf
-rw-r----- 1 root www-data  281 Apr  6 18:55 cluster.conf.old
-rw-r----- 1 root www-data  328 Apr  6 19:36 cluster.conf.ori
-rw-r----- 1 root www-data   16 Mar  2 12:40 datacenter.cfg
lrwxr-x--- 1 root www-data    0 Dec 31  1969 local -> nodes/s009
drwxr-x--- 2 root www-data    0 Mar  2 12:49 nodes
lrwxr-x--- 1 root www-data    0 Dec 31  1969 openvz -> nodes/s009/openvz
drwx------ 2 root www-data    0 Mar  2 12:49 priv
-rw-r----- 1 root www-data 1533 Mar  2 12:49 pve-root-ca.pem
-rw-r----- 1 root www-data 1679 Mar  2 12:49 pve-www.key
lrwxr-x--- 1 root www-data    0 Dec 31  1969 qemu-server -> nodes/s009/qemu-server
-rw-r----- 1 root www-data  235 Apr  2 21:53 storage.cfg
-rw-r----- 1 root www-data   49 Mar  2 12:40 user.cfg
-rw-r----- 1 root www-data  285 Mar 29 15:59 vzdump.cron


/root/.ssh:
total 132
-rw------- 1 root root  9040 Mar 29 20:40 authorized_keys
-rw------- 1 root root  9040 Mar 29 20:40 authorized_keys.org
-rw------- 1 root root  1675 Mar 12 13:51 id_rsa
-rw-r--r-- 1 root root   395 Mar 12 13:51 id_rsa.pub
-rw-r--r-- 1 root root 94701 Apr  5 15:42 known_hosts
 
..
002 fbc247 ~ # asmping 224.0.2.1 10.100.100.240
asmping joined (S,G) = (*,224.0.2.234)
pinging 10.100.100.240 from 10.100.100.247
unicast from 10.100.100.240, seq=1 dist=0 time=0.497 ms
unicast from 10.100.100.240, seq=2 dist=0 time=0.187 ms
unicast from 10.100.100.240, seq=3 dist=0 time=0.187 ms
unicast from 10.100.100.240, seq=4 dist=0 time=0.175 ms
unicast from 10.100.100.240, seq=5 dist=0 time=0.209 ms
unicast from 10.100.100.240, seq=6 dist=0 time=0.173 ms
unicast from 10.100.100.240, seq=7 dist=0 time=0.188 ms
unicast from 10.100.100.240, seq=8 dist=0 time=0.150 ms
unicast from 10.100.100.240, seq=9 dist=0 time=0.183 ms
[/code]

the same commmand here shows (see also example from the wiki):

Code:
asmping 224.0.2.1 192.168.7.201
asmping joined (S,G) = (*,224.0.2.234)
pinging 192.168.7.201 from 192.168.7.202
  unicast from 192.168.7.201, seq=1 dist=0 time=0.313 ms
  unicast from 192.168.7.201, seq=2 dist=0 time=0.336 ms
multicast from 192.168.7.201, seq=2 dist=0 time=0.378 ms
  unicast from 192.168.7.201, seq=3 dist=0 time=0.317 ms
multicast from 192.168.7.201, seq=3 dist=0 time=0.359 ms
  unicast from 192.168.7.201, seq=4 dist=0 time=0.317 ms
multicast from 192.168.7.201, seq=4 dist=0 time=0.356 ms
 
Tom - thanks - I did not notice the missing 'multicast' lines. I'll get multicast working correctly before trying to make a cluster.....
 
OK I fixed multicast on our switches .

Rebooted and the cluster.conf updated.

Now I just need to fix the symlinks in /root/.ssh

Could someone please post the output of
Code:
ls -l /root/.ssh

so that I can try to fix root's .ssh information?

thanks for the help.

ps: I put pics of Netgear managed switch multicast settings to http://pve.proxmox.com/wiki/Multicast_notes
 
Now I just need to fix the symlinks in /root/.ssh

Try to run:

# pcecm updatecerts

that tries to fix the sysmlinks.

Here is the output of 'ls -l /root/.ssh"

Code:
 ls -l /root/.ssh/
total 32
lrwxrwxrwx 1 root root    29 Aug 10  2011 authorized_keys -> /etc/pve/priv/authorized_keys
lrwxrwxrwx 1 root root    29 Aug 10  2011 authorized_keys.org -> /etc/pve/priv/authorized_keys
-rw------- 1 root root  1679 Dec 17  2010 id_rsa
-rw-r--r-- 1 root root   391 Dec 17  2010 id_rsa.pub
-rw-r--r-- 1 root root 20626 Apr  6 08:37 known_hosts
 
Had to do this as 'pvecm updatecerts ' did not work here
Code:
cd  /root/.ssh
[/COLOR]
ln -sf /etc/pve/priv/authorized_keys[COLOR=#333333]
[/COLOR]ln -sf   /etc/pve/priv/authorized_keys   authorized_keys.org

then rebooted all nodes as i was getting connection errors to 2 of the nodes at pve .

all seems well now.

thank you!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!