/etc/pve/local for node missing

bread-baker · Dec 2, 2011

I added the 2-nd node to a cluster, and it is missing /etc/pve/node directory.

both use :
pve-manager: 2.0-12 (pve-manager/2.0/784729f4)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-53
pve-kernel-2.6.32-6-pve: 2.6.32-53
lvm2: 2.02.86-1pve2
clvm: 2.02.86-1pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-1
libqb: 0.6.0-1
redhat-cluster-pve: 3.1.7-1
pve-cluster: 1.0-12
qemu-server: 2.0-9
pve-firmware: 1.0-13
libpve-common-perl: 1.0-8
libpve-access-control: 1.0-2
libpve-storage-perl: 2.0-8
vncterm: 1.0-2
vzctl: 3.0.29-3pve3
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 0.15.0-1
ksm-control-daemon: 1.1-1

(qemu-server is one upgrade behind on one of the nodes , but probably does not have anything to do with the issue ).

this is what I did to create the cluster:

Code:

root@fbc1 /usr/bin # pvecm create fbc
Restarting pve cluster filesystem: pve-cluster.
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]
root@fbc1 /usr/bin # pvecm status
Version: 6.2.0
Config Version: 1
Cluster Name: fbc
Cluster Id: 703
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: fbc1
Node ID: 1
Multicast addresses: 239.192.2.193 
Node addresses: 10.100.100.1

then to add a node: ( note this line: Waiting for quorum... Timed-out waiting for cluster' )

Code:

root@fbc186 ~ # pvecm add 10.100.100.1
root@10.100.100.1's password: 
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-cluster.
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]
cluster not ready - no quorum?
root@fbc186 ~ # pvecm status
Version: 6.2.0
Config Version: 2
Cluster Name: fbc
Cluster Id: 703
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 2
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 1
Flags: 
Ports Bound: 0  
Node name: fbc186
Node ID: 2
Multicast addresses: 239.192.2.193 
Node addresses: 10.100.100.186

this did not look correct on either node:

Code:

root@fbc186 ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        fbc1
   2   M      4   2011-12-01 21:03:48  fbc186

root@fbc1 ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M      4   2011-12-01 21:01:24  fbc1
   2   X      0                        fbc186

so i rebooted fbc186 then this part looks ok:
root@fbc1 ~ # pvecm nodes
Node Sts Inc Joined Name
1 M 4 2011-12-01 21:01:24 fbc1
2 M 12 2011-12-01 21:14:27 fbc186

[/code]

here is ls -la /etc/pve both nodes:

Code:

root@fbc186 /etc/pve # ls -la /etc/pve
total 5
drwxr-x---  2 root www-data    0 Dec 31  1969 .
drwxr-xr-x 84 root root     4096 Dec  1 21:13 ..
-rw-r-----  1 root www-data  277 Dec  1 21:03 cluster.conf
-r--r-----  1 root www-data  153 Dec 31  1969 .clusterlog
-rw-r-----  1 root www-data    2 Dec 31  1969 .debug
lrwxr-x---  1 root www-data    0 Dec 31  1969 local -> nodes/fbc186
-r--r-----  1 root www-data  223 Dec 31  1969 .members
lrwxr-x---  1 root www-data    0 Dec 31  1969 openvz -> nodes/fbc186/openvz
lrwxr-x---  1 root www-data    0 Dec 31  1969 qemu-server -> nodes/fbc186/qemu-server
-r--r-----  1 root www-data  200 Dec 31  1969 .rrd
-r--r-----  1 root www-data  230 Dec 31  1969 .version
-r--r-----  1 root www-data   18 Dec 31  1969 .vmlist


root@fbc1 /etc/pve # ls -la /etc/pve
total 16
drwxr-x---   2 root www-data     0 Dec 31  1969 .
drwxr-xr-x 125 root root     12288 Dec  1 21:01 ..
-r--r-----   1 root www-data   451 Oct 31 12:53 authkey.pub
-r--r-----   1 root www-data   277 Dec  1 21:03 cluster.conf
-r--r-----   1 root www-data   228 Dec  1 21:03 cluster.conf.old
-r--r-----   1 root www-data   938 Dec 31  1969 .clusterlog
-rw-r-----   1 root www-data     2 Dec 31  1969 .debug
lr-xr-x---   1 root www-data     0 Dec 31  1969 local -> nodes/fbc1
-r--r-----   1 root www-data   219 Dec 31  1969 .members
dr-xr-x---   2 root www-data     0 Oct 31 12:53 nodes
lr-xr-x---   1 root www-data     0 Dec 31  1969 openvz -> nodes/fbc1/openvz
dr-x------   2 root www-data     0 Oct 31 12:53 priv
-r--r-----   1 root www-data  1533 Oct 31 12:53 pve-root-ca.pem
-r--r-----   1 root www-data  1675 Oct 31 12:53 pve-www.key
lr-xr-x---   1 root www-data     0 Dec 31  1969 qemu-server -> nodes/fbc1/qemu-server
-r--r-----   1 root www-data  1243 Dec 31  1969 .rrd
-r--r-----   1 root www-data   216 Dec  1 21:09 storage.cfg
-r--r-----   1 root www-data   228 Dec 31  1969 .version
-r--r-----   1 root www-data   393 Dec 31  1969 .vmlist
-r--r-----   1 root www-data   281 Nov 26 19:38 vzdump.cron

note on fbc186 , nodes/fbc186/openvz is missing .

bread-baker · Dec 2, 2011

more info

fbc186 was created today. originally it had 1 test kvm.

when i tried to add it to the cluster, there was a msg that it could not be done with an existing vm. so I deleted it.

originally I could ssh into fbc186 from fbc1 , but can not now.

from fbc1 this does not work:
ssh-copy-id fbc186

results in:

Code:

root@fbc1 ~ # ssh-copy-id fbc186
root@fbc186's password: 
bash: .ssh/authorized_keys: No such file or directory

let me know if you need more info, or tests .

dietmar · Dec 2, 2011

Please try to reinstall fcb186 and try to join again (when there is no existing VM on fbc186).

bread-baker · Dec 2, 2011

dietmar said:
Please try to reinstall fcb186 and try to join again (when there is no existing VM on fbc186).

will do that later.

fbc1, which was installed using the 1-st beta, has gnome and other software installed, may be the bigger problem. [ i use it as a workstation and backup storage area..] .

before try to setup a cluster, fbc1 had authorized keys setup to allow password-less login . now those keys do not work, and I can not update them:

Code:

root@fbcadmin ~ # ssh-copy-id -v fbc1
OpenSSH_5.5p1 Debian-6+squeeze1, OpenSSL 0.9.8o 01 Jun 2010
Pseudo-terminal will not be allocated because stdin is not a terminal.
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
ssh: Could not resolve hostname umask 077; test -d .ssh || mkdir .ssh ; cat >> .ssh/authorized_keys: Name or service not known

fbc1 .ssh list:

Code:

total 112
drwx------  2 root root  4096 Dec  1 21:01 .
drwx------ 21 root pro   4096 Dec  1 22:54 ..
lrwxrwxrwx  1 root root    29 Dec  1 21:01 authorized_keys -> /etc/pve/priv/authorized_keys
-rw-------  1 root root  5997 Dec  1 20:58 authorized_keys.org
-rw-------  1 root root  1675 Sep 25 07:57 id_rsa
-rw-------  1 root root   391 Sep 25 07:57 id_rsa.pub
-rw-------  1 root root 83502 Dec  1 20:40 known_hosts

and

Code:

root@fbc1 ~ # ls -l /etc/pve/priv/authorized_keys
-r-------- 1 root www-data 1176 Dec  1 21:14 /etc/pve/priv/authorized_keys

my question - are those perms normal when a cluster is not operational?

if not then maybe I should reinstall proxmox 2.0 onto fbc1 .

bread-baker · Dec 2, 2011

per http://pve.proxmox.com/wiki/Proxmox_VE_2.0_Cluster#Remove_a_cluster_node :

Code:

del node :
root@fbc1 /bkup/fbc1-pvebkup/dump # pvecm delnode  fbc186
cluster not ready - no quorum?

root@fbc1 /bkup/fbc1-pvebkup/dump # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M      4   2011-12-01 21:01:24  fbc1
   2   X     36                        fbc186

node fbc186 can't be deleted?

if that can not be solved, I assume that it has something to do with fbc1 , and for fbc1 to be put into the cluster a reinstall should be done first?

tom · Dec 2, 2011

you have no quorum, so the proxmox cluster file system is readonly. set the expected votes to 1 and try again.

> pvecm e 1

bread-baker · Dec 2, 2011

that worked:

Code:

root@fbc1 ~ # pvecm e 1
root@fbc1 ~ # pvecm delnode  fbc186

root@fbc1 ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M      4   2011-12-01 21:01:24  fbc1

fbc186 is reinstalled, I'll try to add it to the cluster.

Code:

root@fbc186:~# pvecm add 10.100.100.1
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
1f:34:2e:ee:f5:fb:98:ba:81:59:fb:09:13:4e:57:76 root@fbc186
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|          o   o E|
|         o . o . |
|        S * .    |
|       . O =     |
|        + O      |
|       . . * +   |
|        . ooBo.  |
+-----------------+
The authenticity of host '10.100.100.1 (10.100.100.1)' can't be established.
RSA key fingerprint is c3:60:2f:ca:b3:aa:eb:21:24:52:72:02:a7:b0:3c:a8.
Are you sure you want to continue connecting (yes/no)? yes
root@10.100.100.1's password: 
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-cluster.
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum...

it is stuck at Waiting for quorum...

then after a minute:

Code:

... Timed-out waiting for cluster
[FAILED]
cluster not ready - no quorum?

dietmar · Dec 2, 2011

bread-baker said:
it is stuck at Waiting for quorum...

Seems your cluster communication does not work - do you have a firewall?

bread-baker · Dec 2, 2011

no firewall.

is there a netstat or telnet command I can use to test ports? ssh works between the systems.

dietmar · Dec 2, 2011

Maybe you enabled network-manager when you installed gnome? What is the output of:

# dpkg -l network-manager

Please remove if that is installed.

bread-baker · Dec 2, 2011

Code:

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                        Version                     Description
+++-===========================-===========================-======================================================================
un  network-manager             <none>                      (no description available)

dietmar · Dec 2, 2011

I just saw this in one of your posts "ssh: Could not resolve hostname"

Did you change the hostname somehow - why is that not resolvable (should be in /etc/hosts)?

bread-baker · Dec 2, 2011

fbc186 is a newly installed system. the only thing I did besides trying to add it to the cluster was an apt update.

here is it's hosts :

Code:

root@fbc186:/etc# cat hosts
127.0.0.1 localhost.localdomain localhost
10.100.100.186 fbc186.fantinibakery.com fbc186 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

the command to join a cluster uses an i/p address..

I think the basis of my problems could be due to installing squeeze, gnome then proxmox2 on fbc1 .

that makes for complications .

I have a cluster at home using 2 systems since the 1-st beta and that works good. to improve it and do more tests, I just need to add a 3-rd system.

so I'll do a fresh install on fbc1 and fbc186 . then setup a 3 node cluster [ those 3 are at the factory].

dietmar · Dec 3, 2011

bread-baker said:
here is it's hosts :

Code:

[/QUOTE] And how does it look on fbc1?

bread-baker · Dec 6, 2011

fbc1 has a large hosts file, i put all our local network in each machines' hosts.

it's hosts file is the same as what works on our 2 proxmox 1.9 production systems.

dietmar · Dec 6, 2011

bread-baker said:
fbc1 has a large hosts file, i put all our local network in each machines' hosts.

it's hosts file is the same as what works on our 2 proxmox 1.9 production systems.

PVE 2.0 has other requirements than 1.9. The hostname from /etc/hostname must have an entry in /etc/hosts.

So what is the output of:

# grep fbc1 /etc/hosts

bread-baker · Dec 6, 2011

dietmar said:
PVE 2.0 has other requirements than 1.9. The hostname from /etc/hostname must have an entry in /etc/hosts.

So what is the output of:

# grep fbc1 /etc/hosts

10.100.100.1 fbc1.fantinibakery.com fbc1 # backup server.

the issue probably has something to do with the managed switches an multicast, as noted in the other thread.

Search

Search

/etc/pve/local for node missing

bread-baker

Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member

bread-baker

Member

tom

Proxmox Staff Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member

dietmar

Proxmox Staff Member

bread-baker

Member

We value your privacy