Remove machine from cluster

  • Thread starter Thread starter alex88
  • Start date Start date
A

alex88

Guest
Hello,

i've 2 hosts, host1 and host2,

on host1 i've created a new cluster, trying to add host2 gives no quorum, maybe because multicast isn't enabled at ovh, so i've setup a virtual switch between the machines.

Now on host2, after a failed cluster config, how can i delete host1 and add it again with the local ip? Or just change the ip?

Best Regards
 
I've tried with a virtual switch between hosts and adding the second, now i can see using pvecm nodes that both hosts have 2 nodes set correctly but doing /etc/init.d/cman start gives:

Code:
Starting cluster:                                                                                                                                                                                      Checking if cluster has been disabled at boot... [  OK  ]                                                                                                                                        
   Checking Network Manager... [  OK  ]                                                                                                                                                             
   Global setup... [  OK  ]                                                                                                                                                                         
   Loading kernel modules... [  OK  ]                                                                                                                                                               
   Mounting configfs... [  OK  ]                                                                                                                                                                    
   Starting cman... [  OK  ]                                                                                                                                                                        
   Waiting for quorum... Timed-out waiting for cluster                                                                                                                                              
[FAILED]

on both nodes, is there a way i could get this working? How can i test for multicast working on virtual switch?

Also, multicast queries are done via some kind of dns resolution or just ip? Because i'm trying to add second node to cluster using ip of first machine.

Host1: 10.8.0.1
Host2: 10.8.0.2
 
Last edited by a moderator:
The node names in the cluster.conf should match the host / ip address used in /etc/hosts.

You can also check the multicast address used with "netstat -g" and there you should see it is bound to the vmbr0 interface:
Code:
# netstat -g
<snip>
vmbr0         1      239.192.3.55

To see if there is multicast communication, it may help to do a tcpdump on the vmbr0 interface:
Code:
tcpdump -i vmbr0 'host 239.192.3.55'
<snip>
19:19:23.216213 IP node1.5404 > 239.192.3.55.5405: UDP, length 119
19:19:25.120313 IP node1.5404 > 239.192.3.55.5405: UDP, length 119
19:19:27.023581 IP node1.5404 > 239.192.3.55.5405: UDP, length 119
19:19:27.280122 IP node2.5404 > 239.192.3.55.5405: UDP, length 75
19:19:27.280458 IP node2.5404 > 239.192.3.55.5405: UDP, length 1473
19:19:27.280485 IP node2.5404 > 239.192.3.55.5405: UDP, length 1473
 
The node names in the cluster.conf should match the host / ip address used in /etc/hosts.

You can also check the multicast address used with "netstat -g" and there you should see it is bound to the vmbr0 interface:
Code:
# netstat -g
<snip>
vmbr0         1      239.192.3.55

To see if there is multicast communication, it may help to do a tcpdump on the vmbr0 interface:
Code:
tcpdump -i vmbr0 'host 239.192.3.55'
<snip>
19:19:23.216213 IP node1.5404 > 239.192.3.55.5405: UDP, length 119
19:19:25.120313 IP node1.5404 > 239.192.3.55.5405: UDP, length 119
19:19:27.023581 IP node1.5404 > 239.192.3.55.5405: UDP, length 119
19:19:27.280122 IP node2.5404 > 239.192.3.55.5405: UDP, length 75
19:19:27.280458 IP node2.5404 > 239.192.3.55.5405: UDP, length 1473
19:19:27.280485 IP node2.5404 > 239.192.3.55.5405: UDP, length 1473

ohh great, thank you very much man :)

The local system on /etc/hosts was with the public address, i've changed with the private one and now it works perfectly :)

Well, i can see 2 nodes in the datacenter, but in HA there are no entries in clusternodes :/ Any idea why?

Regards and thank you for helping

EDIT: i've added the clusternodes in /etc/pve/cluster.conf and now it works. Now i've tried live migration of a vm with storage on a nfs share, from host1 to host2 but this is the output:

Code:
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:13 starting migration of CT 116 to node 'ks27489' (10.8.0.2)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:13 container is running - using online migration[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:13 container data is on shared storage 'HA_storage'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:13 start live migration - suspending container[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:13 dump container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:13 dump 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:13 initialize container on remote node 'ks27489'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:13 initializing remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:14 # /usr/bin/ssh -c blowfish -o 'BatchMode=yes' root@10.8.0.2 vzctl quotainit 116[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:14 vzquota : (error) Quota check : open 'init.log': Permission denied[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:14 ERROR: online migrate failure - Failed to initialize quota: vzquota init failed [1][/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:14 start final cleanup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 12 23:16:14 ERROR: migration finished with problems (duration 00:00:02)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration problems[/FONT][/COLOR]


Is live migration supposed to work?

Also, when i try to add a ct to host it asks me again to login, and when i try to create directly to host2 it says:

permission denied - invalid ticket (401)

Any idea?
 
Last edited by a moderator:

Is live migration supposed to work?


Yes. What kink of container is that exactly (how can I reproduce that bug?)


Also, when i try to add a ct to host it asks me again to login, and when i try to create directly to host2 it says:

permission denied - invalid ticket (401)

Seems there is something wrong with the certificates - try (on both nodes):


# pvecm updatecerts --force

I guess you need to restart pvedaemon and apache2 (or simply reboot).
 
Seems there is something wrong with the certificates - try (on both nodes):


# pvecm updatecerts --force

I guess you need to restart pvedaemon and apache2 (or simply reboot).

Thanks, that worked!



Yes. What kink of container is that exactly (how can I reproduce that bug?)



So, on host1 i create a machine setting its storage to a nfs share, using the ubuntu 11.04 template from openvz wiki. This is the output:

Code:
[COLOR=#000000][FONT=tahoma]Creating container private area (ubuntu-11.04-x86_64.tar.gz)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Performing postcreate actions[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]/bin/cp: preserving permissions for `/var/lib/vz/root/116/etc/crontab.3000': Operation not supported[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Saved parameters for CT 116[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Container private area was created[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK OK[/FONT][/COLOR]

The init.log files that are causing the error below are:

---s--S--T+ 1 root root 0 Jan 13 09:01 /var/lib/vz/root/116/var/log/init.log
---s--S--T+ 1 root root 0 Jan 13 09:01 /mnt/pve/HA_storage/private/116/var/log/init.log

Now i try to live migrate from host1 to host2 and the output is:

Code:
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:17 starting migration of CT 116 to node 'ks27489' (10.8.0.2)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:17 container is running - using online migration[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:17 container data is on shared storage 'HA_storage'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:17 start live migration - suspending container[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:17 dump container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:17 dump 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:18 initialize container on remote node 'ks27489'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:18 initializing remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:18 # /usr/bin/ssh -c blowfish -o 'BatchMode=yes' root@10.8.0.2 vzctl quotainit 116[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:18 vzquota : (error) Quota check : open 'init.log': Permission denied[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:18 ERROR: online migrate failure - Failed to initialize quota: vzquota init failed [1][/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:18 start final cleanup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:05:18 ERROR: migration finished with problems (duration 00:00:01)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration problems[/FONT][/COLOR]

The only difference i can see is that the init.log permissions of ct 116 are followed by a "+", the init.log of other machines aren't.

The result is that the vm 116 is moved to the other container, it's down, and trying to start it up gives me:

Code:
[COLOR=#000000][FONT=tahoma]Starting container ...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Initializing quota ...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]vzquota : (error) Quota check : open 'init.log': Permission denied[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]vzquota init failed [1][/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: command 'vzctl start 116' failed: exit code 61[/FONT][/COLOR]

PS: This is only using live migration, offline works fine.

Another test, i've tried to migrate a machine from host1 to host2, using live migration and storing on local storage not shared one:

Code:
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:50 starting migration of CT 117 to node 'ks27489' (10.8.0.2)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:50 container is running - using online migration[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:50 starting rsync phase 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:50 # /usr/bin/rsync -aH --delete --numeric-ids --sparse /var/lib/vz/private/117 root@10.8.0.2:/var/lib/vz/private[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:58 start live migration - suspending container[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:58 dump container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:58 copy dump file to target node[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 starting rsync (2nd pass)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 # /usr/bin/rsync -aH --delete --numeric-ids /var/lib/vz/private/117 root@10.8.0.2:/var/lib/vz/private[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 dump 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 copy 2nd level quota to target node[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 initialize container on remote node 'ks27489'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 initializing remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 turn on remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 load 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 starting container on remote node 'ks27489'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 restore container state[/FONT][/COLOR]


That seemed to work, but the migration is still running :/

Then i decided to click on "stop" on the live migration, output:

Code:
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:50 starting migration of CT 117 to node 'ks27489' (10.8.0.2)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:50 container is running - using online migration[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:50 starting rsync phase 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:50 # /usr/bin/rsync -aH --delete --numeric-ids --sparse /var/lib/vz/private/117 root@10.8.0.2:/var/lib/vz/private[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:58 start live migration - suspending container[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:58 dump container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:58 copy dump file to target node[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 starting rsync (2nd pass)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 # /usr/bin/rsync -aH --delete --numeric-ids /var/lib/vz/private/117 root@10.8.0.2:/var/lib/vz/private[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 dump 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 copy 2nd level quota to target node[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:09:59 initialize container on remote node 'ks27489'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 initializing remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 turn on remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 load 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 starting container on remote node 'ks27489'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 restore container state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:32:26 # /usr/bin/ssh -c blowfish -o 'BatchMode=yes' root@10.8.0.2 vzctl restore 117 --undump --dumpfile /var/lib/vz/dump/dump.117 --skip_arpdetect[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 Restoring container ...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 Starting container ...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 Container is mounted[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00     undump...[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 Setting CPU units: 1000[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 Setting CPUs: 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 Configure veth devices: veth117.0 [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:10:00 Adding interface veth117.0 to bridge vmbr1 on CT0 for CT117[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:32:26 vzquota : (warning) Quota is running for id 117 already[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:32:26 ERROR: online migrate failure - Failed to restore container: interrupted by signal[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:32:26 removing container files on local node[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:32:27 start final cleanup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:32:27 ERROR: migration finished with problems (duration 00:22:38)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration problems[/FONT][/COLOR]


Don't get why that time mixing.. But the vm is running on the other node, but if i try to console into i get a black screen and after 2 secs "Network error: remote side closed connection"

Trying to stop it gives:
Code:
Container already locked
Code:
[COLOR=#000000][FONT=tahoma]TASK ERROR: command 'vzctl stop 117' failed: exit code 9"[/FONT][/COLOR]

Regards
 
Last edited by a moderator:
Sorry for double posting, but i have no more chars in the other post :)

Btw, offline migration works fine:

Code:
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:25 starting migration of CT 116 to node 'ks27489' (10.8.0.2)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:25 starting rsync phase 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:25 # /usr/bin/rsync -aH --delete --numeric-ids --sparse /var/lib/vz/private/116 root@10.8.0.2:/var/lib/vz/private[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:33 dump 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:33 copy 2nd level quota to target node[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:33 initialize container on remote node 'ks27489'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:33 initializing remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:33 turn on remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:33 load 2nd level quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:34 turn off remote quota[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:34 removing container files on local node[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:34 start final cleanup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jan 13 09:51:34 migration finished successfuly (duration 00:00:10)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK OK
[/FONT][/COLOR]
 
What kind of nfs server is that (see /proc/mounts) - What nfs version (vers=4 or vers=3)?

vers=3

Can it be that the problem?

Btw, as i've written before also live migration using local hdd of nodes results in fail, it goes on but remains stuck.
 
are you sure you mount it as vers=3?

Code:
cat /proc/mounts

and post your /etc/pve/storage.cfg

Code:
cat /etc/pve/storage.cfg
 
cat /proc/mounts | grep /mnt/pve/

Code:
10.16.101.8:/nas-000024/mininas-000953/ /mnt/pve/HA_storage nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.16.101.8,mountvers=3,mountport=32792,mountproto=udp,local_lock=none,addr=10.16.101.8 0 0

cat /etc/pve/storage.cfg
Code:
dir: local
        path /var/lib/vz
        content images,iso,vztmpl,backup,rootdir


nfs: HA_storage
        path /mnt/pve/HA_storage
        server 10.16.101.8
        export /nas-000024/mininas-000953/
        options vers=3
        content images,iso,vztmpl,backup,rootdir

I've added nfs within the proxmox interface
 
ok, this looks fine as far as I see.