Proxmox nodes are not showing up status and vote quorum has 2 activity blocked

sai.dasari

New Member
Oct 23, 2020
18
0
1
35
We are using proxmox with a 3 node cluster setup.

After a power outage we are not able to see node1 in node2,3. node 2,3 in node 1.
FIrst I thought it was a time sync issue. But the time is totally in sync.

after logging into 1.1.1.1

1610370771328.png

after logging into 1.1.1.2

1610370825811.png
after logging into 1.1.1.3

1610370881300.png



This is the output of systemctl status pve-cluster corosync -l node1

# systemctl status pve-cluster corosync -l

● pve-cluster.service - The Proxmox VE cluster filesystem

Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)

Active: active (running) since Mon 2021-01-11 18:05:00 IST; 26min ago

Process: 2522 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)

Main PID: 2533 (pmxcfs)

Tasks: 12 (limit: 12287)

Memory: 59.8M

CGroup: /system.slice/pve-cluster.service

└─2533 /usr/bin/pmxcfs



Jan 11 18:04:59 dell-r730-xd-1 pmxcfs[2533]: [dcdb] crit: cpg_initialize failed: 2

Jan 11 18:04:59 dell-r730-xd-1 pmxcfs[2533]: [dcdb] crit: can't initialize service

Jan 11 18:04:59 dell-r730-xd-1 pmxcfs[2533]: [status] crit: cpg_initialize failed: 2

Jan 11 18:04:59 dell-r730-xd-1 pmxcfs[2533]: [status] crit: can't initialize service

Jan 11 18:05:00 dell-r730-xd-1 systemd[1]: Started The Proxmox VE cluster filesystem.

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [status] notice: update cluster info (cluster name ####, version = 3)

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [dcdb] notice: members: 1/2533

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [dcdb] notice: all data is up to date

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [status] notice: members: 1/2533

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [status] notice: all data is up to date



● corosync.service - Corosync Cluster Engine

Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)

Active: active (running) since Mon 2021-01-11 18:05:00 IST; 26min ago

Docs: man:corosync

man:corosync.conf

man:corosync_overview

Main PID: 2645 (corosync)

Tasks: 9 (limit: 12287)

Memory: 144.7M

CGroup: /system.slice/corosync.service

└─2645 /usr/sbin/corosync -f



Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 3 has no active links

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 3 has no active links

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 0)

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 1 has no active links

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [TOTEM ] A new membership (1.128) was formed. Members joined: 1

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [QUORUM] Members[1]: 1

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [MAIN ] Completed service synchronization, ready to provide service.

Jan 11 18:05:00 dell-r730-xd-1 systemd[1]: Started Corosync Cluster Engine.

This is the output of systemctl status pve-cluster corosync -l node2

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-12-18 20:00:26 IST; 3 weeks 2 days ago
Process: 2513 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 2523 (pmxcfs)
Tasks: 10 (limit: 12287)
Memory: 85.6M
CGroup: /system.slice/pve-cluster.service
└─2523 /usr/bin/pmxcfs

Jan 11 17:00:25 dell-r730-xd-2 pmxcfs[2523]: [dcdb] notice: data verification successful
Jan 11 18:00:25 dell-r730-xd-2 pmxcfs[2523]: [dcdb] notice: data verification successful
Jan 11 18:00:31 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:07:29 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:14:26 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:14:38 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:15:54 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:15:55 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:22:21 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:37:22 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-12-18 20:00:26 IST; 3 weeks 2 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2655 (corosync)
Tasks: 9 (limit: 12287)
Memory: 146.1M
CGroup: /system.slice/corosync.service
└─2655 /usr/sbin/corosync -f

Jan 05 13:44:31 dell-r730-xd-2 corosync[2655]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] link: host: 1 link: 0 is down
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)

root@dell-r730-xd-2:~# systemctl status pve-cluster corosync -l
root@dell-r730-xd-2:~# systemctl status pve-cluster corosync -l
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-12-18 20:00:26 IST; 3 weeks 2 days ago
Process: 2513 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 2523 (pmxcfs)
Tasks: 10 (limit: 12287)
Memory: 85.6M
CGroup: /system.slice/pve-cluster.service
└─2523 /usr/bin/pmxcfs

Jan 11 17:00:25 dell-r730-xd-2 pmxcfs[2523]: [dcdb] notice: data verification successful
Jan 11 18:00:25 dell-r730-xd-2 pmxcfs[2523]: [dcdb] notice: data verification successful
Jan 11 18:00:31 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:07:29 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:14:26 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:14:38 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:15:54 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:15:55 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:22:21 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:37:22 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-12-18 20:00:26 IST; 3 weeks 2 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2655 (corosync)
Tasks: 9 (limit: 12287)
Memory: 146.1M
CGroup: /system.slice/corosync.service
└─2655 /usr/sbin/corosync -f

Jan 05 13:44:31 dell-r730-xd-2 corosync[2655]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] link: host: 1 link: 0 is down
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] host: host: 1 has no active links
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [TOTEM ] Token has not been received in 1237 ms
Jan 05 13:44:35 dell-r730-xd-2 corosync[2655]: [TOTEM ] A processor failed, forming new configuration.
Jan 05 13:44:37 dell-r730-xd-2 corosync[2655]: [TOTEM ] A new membership (2.123) was formed. Members left: 1
Jan 05 13:44:37 dell-r730-xd-2 corosync[2655]: [TOTEM ] Failed to receive the leave message. failed: 1
Jan 05 13:44:37 dell-r730-xd-2 corosync[2655]: [QUORUM] Members[2]: 2 3
Jan 05 13:44:37 dell-r730-xd-2 corosync[2655]: [MAIN ] Completed service synchronization, ready to provide service.

---------------------
pvecm status of node1
---------------------



Cluster information
-------------------
Name: ####
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jan 11 18:34:05 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.128
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 1.1.1.1 (local)

---------------------
pvecm status of node2
---------------------

Cluster information
-------------------
Name: #####
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jan 11 18:38:20 2021
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 2.123
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 1.1.1.2 (local)
0x00000003 1 1.1.1.3

---------------------
pvecm status of node3
---------------------

Cluster information
-------------------
Name: #####
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jan 11 18:38:20 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 2.123
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 1.1.1.2
0x00000003 1 1.1.1.3 (local)
 
Yes we are using Ceph, I am not really sure how to check if my corosync link is seperated from traffic.
 
Can you post your network tab of one node?
What is in your /etc/pve/corosync.conf?

Or posting the output of:
cat /etc/pve/corosync.conf
and
cat /etc/network/interfaces

Does ceph report a health warning?
 
Last edited:
You you could try to to restart the proxmox cluster service with

Code:
systemctl restart corosync
node by node starting with that one wich is now seperated.

Have you checked that Date and time is synchronized?
 
Can you post your network tab of one node?
What is in your /etc/pve/corosync.conf?

Or posting the output of:
cat /etc/pve/corosync.conf
and
cat /etc/network/interfaces

Does ceph report a health warning?
Hello Christian,

Thank you for looking into the issue. Please find the outputs of the

###### /etc/pve/corosync.conf ######

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: dell-r730-xd-2
nodeid: 2
quorum_votes: 1
ring0_addr: 1.1.1.2
}
node {
name: dell-r730-xd-3
nodeid: 3
quorum_votes: 1
ring0_addr: 1.1.1.3
}
node {
name: dell-r730-xd-1
nodeid: 1
quorum_votes: 1
ring0_addr: 1.1.1.1
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: organisation
config_version: 3
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}


###### /etc/network/interfaces ######

auto lo
iface lo inet loopback

iface enps0 inet manual

auto eno1
iface eno1 inet static
address 1.1.1.1/24

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto enps0
iface enps01 inet static
address 1.1.18.1/24

iface enps02 inet manual

iface enps03 inet manual

auto vmbr0
iface vmbr0 inet static
address 1.1.5.1/24
gateway 1.1.5.254
bridge-ports enps0
bridge-stp off
bridge-fd 0


CEPH Health command output:

Node1:


HEALTH_ERR 1/799659 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 3/2398977 objects degraded (0.000%), 1 pg degraded; 1 pgs not deep-scrubbed in time; 1 pgs not scrubbed in time; 2 slow ops, oldest one blocked for 31 sec, osd.8 has slow ops; clock skew detected on mon.dell-r730-xd-2, mon.dell-r730-xd-3

Node2:

cluster:
id: 6db2a265-b2b6-46e0-93fa-6434898e6218
health: HEALTH_ERR
1/799661 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 3/2398983 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
2 slow ops, oldest one blocked for 31 sec, osd.8 has slow ops
clock skew detected on mon.dell-r730-xd-2, mon.dell-r730-xd-3

services:
mon: 3 daemons, quorum dell-r730-xd-1,dell-r730-xd-2,dell-r730-xd-3 (age 6d)
mgr: dell-r730-xd-2(active, since 4w), standbys: dell-r730-xd-3, dell-r730-xd-1
osd: 27 osds: 26 up (since 6d), 26 in (since 4w)

data:
pools: 3 pools, 384 pgs
objects: 799.66k objects, 1.7 TiB
usage: 4.0 TiB used, 9.5 TiB / 13 TiB avail
pgs: 3/2398983 objects degraded (0.000%)
1/799661 objects unfound (0.000%)
383 active+clean
1 active+recovery_unfound+degraded

io:
client: 378 KiB/s rd, 642 KiB/s wr, 2 op/s rd, 58 op/s wr
 
There should be one. It is the one which is down and out.
Look at the OSD tabs on the nodes. There should be one with the status down and out. You should try to bring it up and in again.
Clock skew, as written comes from an problem with the time syncronization. Mayby looking in the forum after that failure and bring the clocks on the same time again.
I think after starting the osd and time syncronization it should work again.
 
Can you please check what your configuration concerning osd_pool_default_min_size and osd_pool_default_size is?
You can find this under Node > Ceph > Configuration > and there under [global]

I do not really understand why there is a degreded pg, when there is just one osd missing. Normally it should transfer the data to another osd.
Please try to mark OSD.8, which is now out and press start and in under the osd tab.
 
Can you please check what your configuration concerning osd_pool_default_min_size and osd_pool_default_size is?
You can find this under Node > Ceph > Configuration > and there under [global]

I do not really understand why there is a degreded pg, when there is just one osd missing. Normally it should transfer the data to another osd.
Please try to mark OSD.8, which is now out and press start and in under the osd tab.
Hi Christian,
There is a time sync issue as well.

1611027980579.png

Any Idea how to get these nodes time to sync with each other??
 
Time syncronization works with ntp
https://pve.proxmox.com/wiki/Time_Synchronization
Maybe this is helpful,otherwise you should search in the forum concerning time and ntp.
Could it be, that the Port 123 is blocked via a firewall?
I have setup ntp settings and added ntp server info and rebooted all the three nodes.

I am getting three different errors on all nodes

node1:
kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

node2:
kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 19 20:57:52 dell-r730-xd-2 ntpd[2318]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: Listen normally on 8 eno1 1.1.6.2:123
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: bind(28) AF_INET6 fz80::ac2c:72ff:dio3:78cf%3#123 flags 0x11 failed: Cannot assign requested address
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: unable to create socket on eno1 (9) for fe80::ba2a:72ff:fed3:79bd%3#123
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: failed to init interface for address fz80::ac2c:72ff:dio3:78cf%3
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: new interface(s) found: waking up resolver
Jan 19 20:57:56 dell-r730-xd-2 ntpd[2318]: Listen normally on 10 eno1 [fz80::ac2c:72ff:dio3:78cf%3]:123
Jan 19 20:57:56 dell-r730-xd-2 ntpd[2318]: new interface(s) found: waking up resolver
Jan 19 20:57:58 dell-r730-xd-2 ntpd[2318]: receive: Unexpected origin timestamp 0xe3b178ff.07f48b66 does not match aorg 0000000000.00000000 from server@214.241.31.8 xmt 0xe3b178fe.b2c5

node3:
root@dell-r730-xd-3:~# systemctl status ntp
● ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-01-19 15:28:22 GMT; 4min 4s ago
Docs: man:ntpd(8)
Process: 2397 ExecStart=/usr/lib/ntp/ntp-systemd-wrapper (code=exited, status=0/SUCCESS)
Main PID: 2415 (ntpd)
Tasks: 2 (limit: 12287)
Memory: 3.8M
CGroup: /system.slice/ntp.service
└─2415 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 110:117

Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: Listen normally on 6 enp5s0f1 [fe68::7a05:cbrr:fe81:1ef1%7]:123
Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: Listen normally on 7 vmbr0 [fe68::7a05:cbrr:fe81:1ef1%10]:123
Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: Listening on routing socket on fd #24 for interface updates
Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 19 15:28:25 dell-r730-xd-3 ntpd[2415]: Listen normally on 8 eno1 10.10.60.13:123
Jan 19 15:28:25 dell-r730-xd-3 ntpd[2415]: Listen normally on 9 eno1 [fe68::7a05:cbrr:fe81:1ef1%3]:123
Jan 19 15:28:25 dell-r730-xd-3 ntpd[2415]: new interface(s) found: waking up resolver
Jan 19 15:28:31 dell-r730-xd-3 ntpd[2415]: receive: Unexpected origin timestamp 0xe3b1791f.f048e9e9 does not match aorg 0000000000.00000000 from server@216.239.35.0 xmt 0xe3b1791f.ae6d
Jan 19 15:28:31 dell-r730-xd-3 ntpd[2415]: receive: Unexpected origin timestamp 0xe3b1791f.f0478dd1 does not match aorg 0000000000.00000000 from server@216.239.35.12 xmt 0xe3b1791f.ae8
lines 1-21/21 (END)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!