Cannot readd upgraded node to cluster

komdat

New Member
Jun 6, 2011
14
0
1
Hello,

I've upgraded one node of the cluster (delete node from cluster, fresh install of 3.0), and now I'm unable to add it to the cluster again. And the Webgui also broke! in Detail:

- I did a pvecm delnode
- I rebooted the node and made a fresh installation of pve 3.0
- When trying to do pvecm add I get the error message: "unable to copy ssh ID" and it aborts
- When trying to edit the authorized_keys file on a cluster node itv vi says "".ssh/authorized_keys" E166: Can't open linked file for writing"
- a lsof .ssh/authorized_keys doesn't give any output
- the machine isn't visible in the webgui, in pvecm nodes it shows "4 X 536 proxmox-20"
- The /etc/pve/nodes/proxmox-20/ directory is present, and I'm unable to delete it. There also lsof doesn't show anything.

So I'm not able to add the node. But the worst is that the remaining nodes are very slow on responding (pvecm nodes takes ~20sec to print the output) and are all marked red in the Webui. So I cant't see any details on the nodes and backups also fail with errors"ERROR: Backup of VM 120 failed - command 'qm set 120 --lock backup' failed: exit code 2"

Please, could you help me getting things running again?

Best regards
Norbert
 
join the re-installed node using the -force option. see 'man pvecm'
 
Hello Tom,

thanks for the fast reply, but this doesn't solve the problem. It still says "unable to copy ssh ID".

Here is the output of pvecm status:

Version: 6.2.0
Config Version: 39
Cluster Name: Balanstrasse
Cluster Id: 15487
Cluster Member: Yes
Cluster Generation: 556
Membership state: Cluster-Member
Nodes: 15
Expected votes: 15
Total votes: 15
Node votes: 1
Quorum: 9
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: proxmox-21
Node ID: 2
Multicast addresses: x.x.x.x
Node addresses: x.x.x.x

Any idea what else i could do?

Best regards
Norbert
 
is the cluster operational, all nodes on the same version?
 
Well, all nodes except of one are on 2.3, and one node is already upgraded to 3.0. That worked perfect until I tried to upgrade the second one. The nodes in the Webui are marked red, so I guess the cluster has a Problem, but I can't figure out what.

A "ps aux |grep pve" on the node I'm trying to add the machine to says:

4192 ? Ss 0:53 pvedaemon worker
4222 ? S 238:11 pvestatd
5853 ? S 0:08 pvedaemon worker
5857 ? S 0:02 pvedaemon worker
5860 ? S 0:50 pvedaemon worker

The VMs are up and running, but the backups fail.

Best regards
Norbert
 
before joining nodes, make sure the cluster is fully operational. dig deeper.
 
Well, I have loads of those lines in syslog:

May 29 13:05:54 proxmox-21 pmxcfs[366472]: [status] crit: cpg_send_message failed: 9
May 29 13:05:54 proxmox-21 pmxcfs[366472]: [status] crit: cpg_send_message failed: 9
May 29 13:05:54 proxmox-21 pmxcfs[366472]: [status] crit: cpg_send_message failed: 9
May 29 13:05:54 proxmox-21 pmxcfs[366472]: [status] crit: cpg_send_message failed: 9
May 29 13:05:54 proxmox-21 pmxcfs[366472]: [status] crit: cpg_send_message failed: 9
May 29 13:05:54 proxmox-21 pmxcfs[366472]: [status] crit: cpg_send_message failed: 9
May 29 13:05:55 proxmox-21 dlm_controld[3975]: daemon cpg_leave error retrying
May 29 13:05:55 proxmox-21 pmxcfs[366472]: [dcdb] notice: cpg_join retry 6030
May 29 13:05:56 proxmox-21 pmxcfs[366472]: [dcdb] notice: cpg_join retry 6040
May 29 13:05:57 proxmox-21 pmxcfs[366472]: [dcdb] notice: cpg_join retry 6050
May 29 13:05:58 proxmox-21 pmxcfs[366472]: [dcdb] notice: cpg_join retry 6060
May 29 13:05:59 proxmox-21 pmxcfs[366472]: [dcdb] notice: cpg_join retry 6070

There's nothing about quorum in the logs. I'm not really into corosync and even Google can't help me :-(

I also can't stop cman, if I try it breaks after ~30sec:

Stopping cluster:
Stopping dlm_controld...
[FAILED]

Maybe you could push me in the right direction, I don't know what to do any more.


Best regards
Norbert
 
looks like a problem with the cluster communication (IP multicast) - check if your switches does not block anything here. do you have a separate network for the cluster communication?
 
I've tested multicasting from several different nodes, everything looks good to me. It's always like that:

# asmping 224.0.2.1 192.168.200.71
asmping joined (S,G) = (*,224.0.2.234)
pinging 192.168.200.71 from 192.168.200.66
unicast from 192.168.200.71, seq=1 dist=0 time=0.220 ms
multicast from 192.168.200.71, seq=1 dist=0 time=0.253 ms
unicast from 192.168.200.71, seq=2 dist=0 time=0.236 ms
multicast from 192.168.200.71, seq=2 dist=0 time=0.250 ms
unicast from 192.168.200.71, seq=3 dist=0 time=0.262 ms
multicast from 192.168.200.71, seq=3 dist=0 time=0.278ms

Also, we didn't change anything at the network, so this shouldn't be the problem.


Best regards
Norbert
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!