Proxmox nodes are not showing up status and vote quorum has 2 activity blocked

sai.dasari · Jan 11, 2021

We are using proxmox with a 3 node cluster setup.

After a power outage we are not able to see node1 in node2,3. node 2,3 in node 1.
FIrst I thought it was a time sync issue. But the time is totally in sync.

after logging into 1.1.1.1

after logging into 1.1.1.2

after logging into 1.1.1.3

This is the output of systemctl status pve-cluster corosync -l node1

# systemctl status pve-cluster corosync -l

● pve-cluster.service - The Proxmox VE cluster filesystem

Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)

Active: active (running) since Mon 2021-01-11 18:05:00 IST; 26min ago

Process: 2522 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)

Main PID: 2533 (pmxcfs)

Tasks: 12 (limit: 12287)

Memory: 59.8M

CGroup: /system.slice/pve-cluster.service

└─2533 /usr/bin/pmxcfs

Jan 11 18:04:59 dell-r730-xd-1 pmxcfs[2533]: [dcdb] crit: cpg_initialize failed: 2

Jan 11 18:04:59 dell-r730-xd-1 pmxcfs[2533]: [dcdb] crit: can't initialize service

Jan 11 18:04:59 dell-r730-xd-1 pmxcfs[2533]: [status] crit: cpg_initialize failed: 2

Jan 11 18:04:59 dell-r730-xd-1 pmxcfs[2533]: [status] crit: can't initialize service

Jan 11 18:05:00 dell-r730-xd-1 systemd[1]: Started The Proxmox VE cluster filesystem.

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [status] notice: update cluster info (cluster name ####, version = 3)

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [dcdb] notice: members: 1/2533

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [dcdb] notice: all data is up to date

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [status] notice: members: 1/2533

Jan 11 18:05:05 dell-r730-xd-1 pmxcfs[2533]: [status] notice: all data is up to date

● corosync.service - Corosync Cluster Engine

Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)

Active: active (running) since Mon 2021-01-11 18:05:00 IST; 26min ago

Docs: man:corosync

man:corosync.conf

man:corosync_overview

Main PID: 2645 (corosync)

Tasks: 9 (limit: 12287)

Memory: 144.7M

CGroup: /system.slice/corosync.service

└─2645 /usr/sbin/corosync -f

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 3 has no active links

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 3 has no active links

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 0)

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [KNET ] host: host: 1 has no active links

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [TOTEM ] A new membership (1.128) was formed. Members joined: 1

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [QUORUM] Members[1]: 1

Jan 11 18:05:00 dell-r730-xd-1 corosync[2645]: [MAIN ] Completed service synchronization, ready to provide service.

Jan 11 18:05:00 dell-r730-xd-1 systemd[1]: Started Corosync Cluster Engine.

This is the output of systemctl status pve-cluster corosync -l node2

● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-12-18 20:00:26 IST; 3 weeks 2 days ago
Process: 2513 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 2523 (pmxcfs)
Tasks: 10 (limit: 12287)
Memory: 85.6M
CGroup: /system.slice/pve-cluster.service
└─2523 /usr/bin/pmxcfs

Jan 11 17:00:25 dell-r730-xd-2 pmxcfs[2523]: [dcdb] notice: data verification successful
Jan 11 18:00:25 dell-r730-xd-2 pmxcfs[2523]: [dcdb] notice: data verification successful
Jan 11 18:00:31 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:07:29 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:14:26 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:14:38 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:15:54 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:15:55 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:22:21 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:37:22 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-12-18 20:00:26 IST; 3 weeks 2 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2655 (corosync)
Tasks: 9 (limit: 12287)
Memory: 146.1M
CGroup: /system.slice/corosync.service
└─2655 /usr/sbin/corosync -f

Jan 05 13:44:31 dell-r730-xd-2 corosync[2655]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] link: host: 1 link: 0 is down
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)

root@dell-r730-xd-2:~# systemctl status pve-cluster corosync -l
root@dell-r730-xd-2:~# systemctl status pve-cluster corosync -l
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-12-18 20:00:26 IST; 3 weeks 2 days ago
Process: 2513 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 2523 (pmxcfs)
Tasks: 10 (limit: 12287)
Memory: 85.6M
CGroup: /system.slice/pve-cluster.service
└─2523 /usr/bin/pmxcfs

Jan 11 17:00:25 dell-r730-xd-2 pmxcfs[2523]: [dcdb] notice: data verification successful
Jan 11 18:00:25 dell-r730-xd-2 pmxcfs[2523]: [dcdb] notice: data verification successful
Jan 11 18:00:31 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:07:29 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:14:26 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:14:38 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:15:54 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:15:55 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:22:21 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log
Jan 11 18:37:22 dell-r730-xd-2 pmxcfs[2523]: [status] notice: received log

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-12-18 20:00:26 IST; 3 weeks 2 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 2655 (corosync)
Tasks: 9 (limit: 12287)
Memory: 146.1M
CGroup: /system.slice/corosync.service
└─2655 /usr/sbin/corosync -f

Jan 05 13:44:31 dell-r730-xd-2 corosync[2655]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] link: host: 1 link: 0 is down
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [KNET ] host: host: 1 has no active links
Jan 05 13:44:34 dell-r730-xd-2 corosync[2655]: [TOTEM ] Token has not been received in 1237 ms
Jan 05 13:44:35 dell-r730-xd-2 corosync[2655]: [TOTEM ] A processor failed, forming new configuration.
Jan 05 13:44:37 dell-r730-xd-2 corosync[2655]: [TOTEM ] A new membership (2.123) was formed. Members left: 1
Jan 05 13:44:37 dell-r730-xd-2 corosync[2655]: [TOTEM ] Failed to receive the leave message. failed: 1
Jan 05 13:44:37 dell-r730-xd-2 corosync[2655]: [QUORUM] Members[2]: 2 3
Jan 05 13:44:37 dell-r730-xd-2 corosync[2655]: [MAIN ] Completed service synchronization, ready to provide service.

---------------------
pvecm status of node1
---------------------

Cluster information
-------------------
Name: ####
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jan 11 18:34:05 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.128
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 1.1.1.1 (local)

---------------------
pvecm status of node2
---------------------

Cluster information
-------------------
Name: #####
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jan 11 18:38:20 2021
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 2.123
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 1.1.1.2 (local)
0x00000003 1 1.1.1.3

---------------------
pvecm status of node3
---------------------

Cluster information
-------------------
Name: #####
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jan 11 18:38:20 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 2.123
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 1.1.1.2
0x00000003 1 1.1.1.3 (local)

Christian St. · Jan 11, 2021

Sre you using ceph? Can you tell us a llittle bit more about your network setup. Is the corosync link seperated from other traffic?

sai.dasari · Jan 12, 2021

Yes we are using Ceph, I am not really sure how to check if my corosync link is seperated from traffic.

Christian St. · Jan 13, 2021

Can you post your network tab of one node?
What is in your /etc/pve/corosync.conf?

Or posting the output of:
cat /etc/pve/corosync.conf
and
cat /etc/network/interfaces

Does ceph report a health warning?

Christian St. · Jan 13, 2021

You you could try to to restart the proxmox cluster service with

Code:

systemctl restart corosync

node by node starting with that one wich is now seperated.

Have you checked that Date and time is synchronized?

sai.dasari · Jan 18, 2021

Christian St. said:
Can you post your network tab of one node?
What is in your /etc/pve/corosync.conf?

Or posting the output of:
cat /etc/pve/corosync.conf
and
cat /etc/network/interfaces

Does ceph report a health warning?

Hello Christian,

Thank you for looking into the issue. Please find the outputs of the

###### /etc/pve/corosync.conf ######

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: dell-r730-xd-2
nodeid: 2
quorum_votes: 1
ring0_addr: 1.1.1.2
}
node {
name: dell-r730-xd-3
nodeid: 3
quorum_votes: 1
ring0_addr: 1.1.1.3
}
node {
name: dell-r730-xd-1
nodeid: 1
quorum_votes: 1
ring0_addr: 1.1.1.1
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: organisation
config_version: 3
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

###### /etc/network/interfaces ######

auto lo
iface lo inet loopback

iface enps0 inet manual

auto eno1
iface eno1 inet static
address 1.1.1.1/24

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto enps0
iface enps01 inet static
address 1.1.18.1/24

iface enps02 inet manual

iface enps03 inet manual

auto vmbr0
iface vmbr0 inet static
address 1.1.5.1/24
gateway 1.1.5.254
bridge-ports enps0
bridge-stp off
bridge-fd 0

CEPH Health command output:

Node1:

HEALTH_ERR 1/799659 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 3/2398977 objects degraded (0.000%), 1 pg degraded; 1 pgs not deep-scrubbed in time; 1 pgs not scrubbed in time; 2 slow ops, oldest one blocked for 31 sec, osd.8 has slow ops; clock skew detected on mon.dell-r730-xd-2, mon.dell-r730-xd-3

Node2:

cluster:
id: 6db2a265-b2b6-46e0-93fa-6434898e6218
health: HEALTH_ERR
1/799661 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 3/2398983 objects degraded (0.000%), 1 pg degraded
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
2 slow ops, oldest one blocked for 31 sec, osd.8 has slow ops
clock skew detected on mon.dell-r730-xd-2, mon.dell-r730-xd-3

services:
mon: 3 daemons, quorum dell-r730-xd-1,dell-r730-xd-2,dell-r730-xd-3 (age 6d)
mgr: dell-r730-xd-2(active, since 4w), standbys: dell-r730-xd-3, dell-r730-xd-1
osd: 27 osds: 26 up (since 6d), 26 in (since 4w)

data:
pools: 3 pools, 384 pgs
objects: 799.66k objects, 1.7 TiB
usage: 4.0 TiB used, 9.5 TiB / 13 TiB avail
pgs: 3/2398983 objects degraded (0.000%)
1/799661 objects unfound (0.000%)
383 active+clean
1 active+recovery_unfound+degraded

io:
client: 378 KiB/s rd, 642 KiB/s wr, 2 op/s rd, 58 op/s wr

Christian St. · Jan 18, 2021

sai.dasari said:
osd: 27 osds: 26 up (since 6d), 26 in (since 4w)

What is the status of the 27th osd?

Christian St. · Jan 18, 2021

sai.dasari said:
clock skew detected on mon.dell-r730-xd-2, mon.dell-r730-xd-3

Is the time on all nodes syncronized?

sai.dasari · Jan 18, 2021

Christian St. said:
What is the status of the 27th osd?

There is no 27th OSD

Christian St. · Jan 18, 2021

There should be one. It is the one which is down and out.
Look at the OSD tabs on the nodes. There should be one with the status down and out. You should try to bring it up and in again.
Clock skew, as written comes from an problem with the time syncronization. Mayby looking in the forum after that failure and bring the clocks on the same time again.
I think after starting the osd and time syncronization it should work again.

Christian St. · Jan 18, 2021

Can you please check what your configuration concerning osd_pool_default_min_size and osd_pool_default_size is?
You can find this under Node > Ceph > Configuration > and there under [global]

I do not really understand why there is a degreded pg, when there is just one osd missing. Normally it should transfer the data to another osd.
Please try to mark OSD.8, which is now out and press start and in under the osd tab.

sai.dasari · Jan 19, 2021

Christian St. said:
Can you please check what your configuration concerning osd_pool_default_min_size and osd_pool_default_size is?
You can find this under Node > Ceph > Configuration > and there under [global]

I do not really understand why there is a degreded pg, when there is just one osd missing. Normally it should transfer the data to another osd.
Please try to mark OSD.8, which is now out and press start and in under the osd tab.

Hi Christian,
There is a time sync issue as well.

Any Idea how to get these nodes time to sync with each other??

sai.dasari · Jan 19, 2021

Christian St. said:
Is the time on all nodes syncronized?

No The time is not synchronized. There is a delay.

Christian St. · Jan 19, 2021

Time syncronization works with ntp
https://pve.proxmox.com/wiki/Time_Synchronization
Maybe this is helpful,otherwise you should search in the forum concerning time and ntp.
Could it be, that the Port 123 is blocked via a firewall?

sai.dasari · Jan 19, 2021

Christian St. said:
Time syncronization works with ntp
https://pve.proxmox.com/wiki/Time_Synchronization
Maybe this is helpful,otherwise you should search in the forum concerning time and ntp.
Could it be, that the Port 123 is blocked via a firewall?

I have setup ntp settings and added ntp server info and rebooted all the three nodes.

I am getting three different errors on all nodes

node1:
kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

node2:
kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 19 20:57:52 dell-r730-xd-2 ntpd[2318]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: Listen normally on 8 eno1 1.1.6.2:123
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: bind(28) AF_INET6 fz80::ac2c:72ff:dio3:78cf%3#123 flags 0x11 failed: Cannot assign requested address
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: unable to create socket on eno1 (9) for fe80::ba2a:72ff:fed3:79bd%3#123
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: failed to init interface for address fz80::ac2c:72ff:dio3:78cf%3
Jan 19 20:57:54 dell-r730-xd-2 ntpd[2318]: new interface(s) found: waking up resolver
Jan 19 20:57:56 dell-r730-xd-2 ntpd[2318]: Listen normally on 10 eno1 [fz80::ac2c:72ff:dio3:78cf%3]:123
Jan 19 20:57:56 dell-r730-xd-2 ntpd[2318]: new interface(s) found: waking up resolver
Jan 19 20:57:58 dell-r730-xd-2 ntpd[2318]: receive: Unexpected origin timestamp 0xe3b178ff.07f48b66 does not match aorg 0000000000.00000000 from server@214.241.31.8 xmt 0xe3b178fe.b2c5

node3:
root@dell-r730-xd-3:~# systemctl status ntp
● ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-01-19 15:28:22 GMT; 4min 4s ago
Docs: man:ntpd(8)
Process: 2397 ExecStart=/usr/lib/ntp/ntp-systemd-wrapper (code=exited, status=0/SUCCESS)
Main PID: 2415 (ntpd)
Tasks: 2 (limit: 12287)
Memory: 3.8M
CGroup: /system.slice/ntp.service
└─2415 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 110:117

Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: Listen normally on 6 enp5s0f1 [fe68::7a05:cbrr:fe81:1ef1%7]:123
Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: Listen normally on 7 vmbr0 [fe68::7a05:cbrr:fe81:1ef1%10]:123
Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: Listening on routing socket on fd #24 for interface updates
Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 19 15:28:22 dell-r730-xd-3 ntpd[2415]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Jan 19 15:28:25 dell-r730-xd-3 ntpd[2415]: Listen normally on 8 eno1 10.10.60.13:123
Jan 19 15:28:25 dell-r730-xd-3 ntpd[2415]: Listen normally on 9 eno1 [fe68::7a05:cbrr:fe81:1ef1%3]:123
Jan 19 15:28:25 dell-r730-xd-3 ntpd[2415]: new interface(s) found: waking up resolver
Jan 19 15:28:31 dell-r730-xd-3 ntpd[2415]: receive: Unexpected origin timestamp 0xe3b1791f.f048e9e9 does not match aorg 0000000000.00000000 from server@216.239.35.0 xmt 0xe3b1791f.ae6d
Jan 19 15:28:31 dell-r730-xd-3 ntpd[2415]: receive: Unexpected origin timestamp 0xe3b1791f.f0478dd1 does not match aorg 0000000000.00000000 from server@216.239.35.12 xmt 0xe3b1791f.ae8
lines 1-21/21 (END)

Christian St. · Jan 19, 2021

Can you check

timedatectl status

In the shell of all nodes?

sai.dasari · Apr 9, 2021

Issue got resolved. Some legendary god pulled the lan cable from DC.

Search

Search

Proxmox nodes are not showing up status and vote quorum has 2 activity blocked

sai.dasari

New Member

Christian St.

Active Member

sai.dasari

New Member

Christian St.

Active Member

Christian St.

Active Member

sai.dasari

New Member

Christian St.

Active Member

Christian St.

Active Member

sai.dasari

New Member

Christian St.

Active Member

Christian St.

Active Member

sai.dasari

New Member

sai.dasari

New Member

Christian St.

Active Member

sai.dasari

New Member

Christian St.

Active Member

sai.dasari

New Member