Cluster upgrade fails after Proxmox 3.x upgrade to 4.0

HBO · Nov 2, 2015

We upgraded from latest 3.x to 4.x without any Problems.

But on recreating the cluster there is a quorum problem (no problems on 3.x before):

Code:

 pvecm add xx.xx.xx.xx -force
cluster not ready - no quorum?

pvecm status on first node:

Code:

Quorum information
------------------
Date:             Fri Oct 30 10:16:50 2015
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          71312
Quorate:          No


Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.10.10 (local)

Output /var/log/syslog from first node:

Code:

Oct 30 10:12:29 proxmox pmxcfs[8281]: [status] crit: cpg_dispatch failed: 2
Oct 30 10:12:29 proxmox pmxcfs[8281]: [status] crit: cpg_leave failed: 2
Oct 30 10:12:29 proxmox pmxcfs[8281]: [quorum] crit: quorum_dispatch failed: 2
Oct 30 10:12:29 proxmox pmxcfs[8281]: [quorum] crit: quorum_initialize failed: 2
Oct 30 10:12:29 proxmox pmxcfs[8281]: [quorum] crit: can't initialize service
Oct 30 10:12:29 proxmox pmxcfs[8281]: [confdb] crit: cmap_initialize failed: 2
Oct 30 10:12:29 proxmox pmxcfs[8281]: [confdb] crit: can't initialize service
Oct 30 10:12:29 proxmox pmxcfs[8281]: [dcdb] notice: start cluster connection
Oct 30 10:12:29 proxmox pmxcfs[8281]: [dcdb] crit: cpg_initialize failed: 2
Oct 30 10:12:29 proxmox pmxcfs[8281]: [dcdb] crit: can't initialize service
Oct 30 10:12:29 proxmox pmxcfs[8281]: [status] notice: start cluster connection
Oct 30 10:12:29 proxmox pmxcfs[8281]: [status] crit: cpg_initialize failed: 2
Oct 30 10:12:29 proxmox pmxcfs[8281]: [status] crit: can't initialize service
Oct 30 10:12:30 proxmox corosync[8314]: Waiting for corosync services to unload:.[  OK  ]
Oct 30 10:12:30 proxmox corosync[8331]: [MAIN  ] Corosync Cluster Engine ('2.3.5'): started and ready to provide service.
Oct 30 10:12:30 proxmox corosync[8331]: [MAIN  ] Corosync built-in features: augeas systemd pie relro bindnow
Oct 30 10:12:30 proxmox corosync[8332]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Oct 30 10:12:30 proxmox corosync[8332]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Oct 30 10:12:30 proxmox corosync[8332]: [TOTEM ] The network interface [10.0.10.10] is now up.
Oct 30 10:12:30 proxmox corosync[8332]: [SERV  ] Service engine loaded: corosync configuration map access [0]
Oct 30 10:12:30 proxmox corosync[8332]: [QB    ] server name: cmap
Oct 30 10:12:30 proxmox corosync[8332]: [SERV  ] Service engine loaded: corosync configuration service [1]
Oct 30 10:12:30 proxmox corosync[8332]: [QB    ] server name: cfg
Oct 30 10:12:30 proxmox corosync[8332]: [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Oct 30 10:12:30 proxmox corosync[8332]: [QB    ] server name: cpg
Oct 30 10:12:30 proxmox corosync[8332]: [SERV  ] Service engine loaded: corosync profile loading service [4]
Oct 30 10:12:30 proxmox corosync[8332]: [QUORUM] Using quorum provider corosync_votequorum
Oct 30 10:12:30 proxmox corosync[8332]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Oct 30 10:12:30 proxmox corosync[8332]: [QB    ] server name: votequorum
Oct 30 10:12:30 proxmox corosync[8332]: [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Oct 30 10:12:30 proxmox corosync[8332]: [QB    ] server name: quorum
Oct 30 10:12:30 proxmox corosync[8332]: [TOTEM ] A new membership (10.0.10.10:71312) was formed. Members joined: 1
Oct 30 10:12:30 proxmox corosync[8332]: [QUORUM] Members[1]: 1
Oct 30 10:12:30 proxmox corosync[8332]: [MAIN  ] Completed service synchronization, ready to provide service.
Oct 30 10:12:30 proxmox corosync[8325]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Oct 30 10:12:30 proxmox pmxcfs[8281]: [status] crit: cpg_send_message failed: 9
Oct 30 10:12:30 proxmox pmxcfs[8281]: [status] crit: cpg_send_message failed: 9
Oct 30 10:12:30 proxmox pmxcfs[8281]: [status] crit: cpg_send_message failed: 9
Oct 30 10:12:30 proxmox pmxcfs[8281]: [status] crit: cpg_send_message failed: 9
Oct 30 10:12:35 proxmox pmxcfs[8281]: [status] notice: update cluster info (cluster name  xyz, version = 2)
Oct 30 10:12:35 proxmox pmxcfs[8281]: [dcdb] notice: members: 1/8281
Oct 30 10:12:35 proxmox pmxcfs[8281]: [dcdb] notice: all data is up to date
Oct 30 10:12:35 proxmox pmxcfs[8281]: [status] notice: members: 1/8281
Oct 30 10:12:35 proxmox pmxcfs[8281]: [status] notice: all data is up to date
Oct 30 10:13:17 proxmox sshd[8464]: Connection closed by 10.0.10.12 [preauth]
Oct 30 10:13:17 proxmox sshd[8466]: Accepted publickey for root from 10.0.10.12 port 59194 ssh2: .....
Oct 30 10:13:17 proxmox sshd[8466]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 30 10:13:17 proxmox systemd-logind[769]: New session 7 of user root.
Oct 30 10:13:17 proxmox sshd[8466]: Received disconnect from 10.0.10.12: 11: disconnected by user
Oct 30 10:13:17 proxmox sshd[8466]: pam_unix(sshd:session): session closed for user root
Oct 30 10:13:17 proxmox systemd-logind[769]: Removed session 7.
Oct 30 10:13:17 proxmox sshd[8470]: Accepted publickey for root from 10.0.10.12 port 59196 ssh2: .....
Oct 30 10:13:17 proxmox sshd[8470]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 30 10:13:17 proxmox systemd-logind[769]: New session 8 of user root.
Oct 30 10:13:18 proxmox sshd[8470]: Received disconnect from 10.0.10.12: 11: disconnected by user
Oct 30 10:13:18 proxmox sshd[8470]: pam_unix(sshd:session): session closed for user root
Oct 30 10:13:18 proxmox systemd-logind[769]: Removed session 8.
Oct 30 10:13:22 proxmox pveproxy[7723]: ipcc_send_rec failed: Transport endpoint is not connected
Oct 30 10:13:22 proxmox pvedaemon[1524]: ipcc_send_rec failed: Transport endpoint is not connected

OMPING works:

Code:

omping 10.0.10.10 10.0.10.12
10.0.10.10 : waiting for response msg
10.0.10.10 : joined (S,G) = (*, 232.43.211.234), pinging
10.0.10.10 :   unicast, seq=1, size=69 bytes, dist=0, time=0.451ms
10.0.10.10 : multicast, seq=1, size=69 bytes, dist=0, time=0.472ms
10.0.10.10 :   unicast, seq=2, size=69 bytes, dist=0, time=0.991ms
10.0.10.10 : multicast, seq=2, size=69 bytes, dist=0, time=1.014ms

more info on restarting corosync.service by scanning with 'tcdump -i eth1 igmp -n" output identical on first and second node:

Code:

11:30:28.119245 IP 10.0.10.10 > 224.0.0.22: igmp v3 report, 1 group record(s)
11:30:28.763246 IP 10.0.10.10 > 224.0.0.22: igmp v3 report, 1 group record(s)
11:30:29.235214 IP 10.0.10.10 > 224.0.0.22: igmp v3 report, 1 group record(s)
11:30:29.475224 IP 10.0.10.10 > 224.0.0.22: igmp v3 report, 1 group record(s)
11:30:31.449922 IP 0.0.0.0 > 224.0.0.1: igmp query v2
11:32:36.449939 IP 0.0.0.0 > 224.0.0.1: igmp query v2

Can someone give us a hint to recreate the cluster?
Is it possible to reset the cluster? We have a single server without any vm on it only for login.

dietmar · Nov 2, 2015

HBO said:
Is it possible to reset the cluster? We have a single server without any vm on it only for login.

You can temporarily set expected votes to 1, like:

# pvecm expected 1

then you can modify/update /etc/pve/corosync.conf

HBO · Nov 2, 2015

Works fine, thanks.

But on our last node we get following error:

Code:

can't create shared ssh key database '/etc/pve/priv/authorized_keys'node mc2-node6 already defined
copy corosync auth key
stopping pve-cluster service
backup old database
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
waiting for quorum...

There comes no timeout since the last 5 minutes.

systemctl status corosync.service

Code:

 systemd[1]: corosync.service: control process exited, code=exited status=1
 systemd[1]: Failed to start Corosync Cluster Engine.
 systemd[1]: Unit corosync.service entered failed state.

journalctl -xn

Code:

pveproxy[2593]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/HTTPServer.pm line 1631.

It's a new node with fresh debian 8 / proxmox 4.0 install.

dietmar · Nov 2, 2015

HBO said:

Code:

can't create shared ssh key database '/etc/pve/priv/authorized_keys'node mc2-node6 already defined
[/QUOTE]

Seems node mc2-node6 has already an entry in '/etc/pve/priv/authorized_keys'? Does it help if you remove
that entry first?

HBO · Nov 2, 2015

There is no "priv" folder in /etc/pve, /root/.ssh/authorized_keys points to "/etc/pve/priv/authorized_keys" but is dead link.

HBO · Nov 3, 2015

After a new fresh install with Debian 8 and proxmox-ve i added the last node to our cluster now.

But there is no local access to the webinterface, the cluster shows a "broken pipe".

Output of "syslog":

Code:

Nov  3 10:01:02 mc2-node6 pveproxy[1376]: problem with client 10.0.10.1; rsa_padding_check_pkcs1_type_1: block type is not 01Nov  3 10:01:02 mc2-node6 pveproxy[1376]: Can't call method "timeout_reset" on an undefined value at /usr/share/perl5/PVE/HTTPServer.pm line 225.
Nov  3 10:01:05 mc2-node6 pveproxy[1376]: problem with client 10.0.10.10; rsa_padding_check_pkcs1_type_1: block type is not 01
Nov  3 10:01:05 mc2-node6 pveproxy[1376]: problem with client 10.0.10.10; rsa_padding_check_pkcs1_type_1: block type is not 01
Nov  3 10:01:05 mc2-node6 pveproxy[1376]: problem with client 10.0.10.10; rsa_padding_check_pkcs1_type_1: block type is not 01
Nov  3 10:01:05 mc2-node6 pveproxy[1378]: problem with client 10.0.10.10; rsa_padding_check_pkcs1_type_1: block type is not 01
Nov  3 10:01:06 mc2-node6 pveproxy[1378]: problem with client 10.0.10.10; rsa_padding_check_pkcs1_type_1: block type is not 01
Nov  3 10:01:06 mc2-node6 pveproxy[1376]: problem with client 10.0.10.10; rsa_padding_check_pkcs1_type_1: block type is not 01
Nov  3 10:01:07 mc2-node6 pveproxy[1376]: problem with client 10.0.10.10; rsa_padding_check_pkcs1_type_1: block type is not 01
Nov  3 10:01:07 mc2-node6 corosync[1388]:  [TOTEM ] A new membership (10.0.10.10:120240) was formed. Members
Nov  3 10:01:07 mc2-node6 corosync[1388]:  [QUORUM] Members[11]: 1 2 3 4 5 6 7 8 9 10 11
Nov  3 10:01:07 mc2-node6 corosync[1388]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov  3 10:01:08 mc2-node6 pveproxy[1376]: EV: error in callback (ignoring): Can't call method "push_write" on an undefined value at /usr/share/perl5/PVE/HTTPServer.pm line 295.

The last entry comes from local acces try on https://10.0.10.26:8006

dietmar · Nov 3, 2015

Does it help if you clear the browser cache and reload the page?

HBO · Nov 3, 2015

Worked fine, thanks.

Is there a known bug to set /sys/class/net/vmbr$/bridge/multicast_snooping to 0?
Without setting this value corosync fails.

dietmar · Nov 3, 2015

HBO said:
Without setting this value corosync fails.

Maybe multicast querier does not work with your switch?

HBO · Nov 3, 2015

We're using HP 2830 switches with have enabled multicast querier by default.

Search

Search

Cluster upgrade fails after Proxmox 3.x upgrade to 4.0

HBO

Active Member

dietmar

Proxmox Staff Member

HBO

Active Member

dietmar

Proxmox Staff Member

HBO

Active Member

HBO

Active Member

dietmar

Proxmox Staff Member

HBO

Active Member

dietmar

Proxmox Staff Member

HBO

Active Member