PVE-cluster service not starting

sanjayc · Apr 2, 2024

i had to power off the host due to load on one container . It shutdown however I could not bring it back up buffer error.

Steps taken
1) power offed the host and reboot

The problem now is
oot@vmserver2:~# systemctl reset-failed pve-cluster
root@vmserver2:~# systemctl start pve-cluster

pve-cluster not starting
root@vmserver2:~# pvesm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused
root@vmserver2:~#
root@vmserver2:~# journalctl -b -u pve-cluster.service
-- Journal begins at Mon 2022-03-21 16:15:56 EDT, ends at Tue 2024-04-02 11:14:48 EDT. --
-- No entries --

root@vmserver2:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)

My ceph cluster is degraded not sure if it will fix the above after the ceph cluster recovers

Degraded data redundancy: 175343/2807727 objects degraded (6.245%), 51 pgs degraded, 51 pgs undersizedpg 4.0 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,15]
pg 4.8 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,1]
pg 4.9 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,1]
pg 4.f is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,2]
pg 4.19 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,3]
pg 4.21 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,10]
pg 4.23 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,1]
pg 4.26 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [12,2]
pg 4.28 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [13,4]
pg 4.2c is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [3,1]
pg 4.2d is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,1]
pg 4.36 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,3]
pg 4.37 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [9,13]
pg 4.48 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [2,4]
pg 4.52 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,4]
pg 4.53 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [4,10]
pg 4.55 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [2,1]
pg 4.61 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,2]
pg 4.6f is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,15]
pg 4.73 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,15]
pg 4.79 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,4]
pg 4.7b is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,10]
pg 4.7d is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,3]
pg 4.82 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,8]
pg 4.89 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,13]
pg 4.8a is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,1]
pg 4.8e is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,8]
pg 4.91 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,13]
pg 4.92 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [1,4]
pg 4.96 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,5]
pg 4.98 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,5]
pg 4.9a is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,9]
pg 4.9c is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,9]
pg 4.9d is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,5]
pg 4.9e is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,10]
pg 4.ac is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,2]
pg 4.b1 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,9]
pg 4.b2 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,5]
pg 4.b8 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,2]
pg 4.bb is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [4,10]
pg 4.bf is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [4,1]
pg 4.c6 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [4,5]
pg 4.c7 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,4]
pg 4.d2 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [4,1]
pg 4.d7 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [2,15]
pg 4.de is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,9]
pg 4.e6 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,10]
pg 4.ea is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,4]
pg 4.eb is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,2]
pg 4.ed is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,10]
pg 4.fd is active+undersized+degraded+remapped+backfill_wait, acting [13,10]
Active: inactive (dead)

root@vmserver2:~# ls -l /etc/pve/
total 0
root@vmserver2:~#

Any suggestions how to fix this problem.

Alwin Antreich · Apr 3, 2024

sanjayc said:
My ceph cluster is degraded not sure if it will fix the above after the ceph cluster recovers

Ceph will work independently from corosync. Please post a ceph -s to get an overview.

sanjayc said:
The problem now is
oot@vmserver2:~# systemctl reset-failed pve-cluster
root@vmserver2:~# systemctl start pve-cluster

Can you please post the output of the journal for pve-cluster and corosync? Also please check that /etc/corosync/ has the corosync.conf and the authkey.

sanjayc · Apr 3, 2024

oot@vmserver2:~# ceph -s
cluster:
id: 189d85c6-fe4f-45f7-8452-21b0a34991fe
health: HEALTH_WARN
1/4 mons down, quorum vmserver1,vmserver3,vmserver4

services:
mon: 4 daemons, quorum vmserver1,vmserver3,vmserver4 (age 23h), out of quorum: vmserver2
mgr: vmserver1(active, since 6w), standbys: vmserver3, vmserver4
osd: 16 osds: 12 up (since 23h), 12 in (since 23h)

data:
pools: 3 pools, 289 pgs
objects: 936.10k objects, 3.5 TiB
usage: 11 TiB used, 11 TiB / 22 TiB avail
pgs: 287 active+clean
2 active+clean+scrubbing+deep

io:
client: 2.7 KiB/s rd, 4.0 MiB/s wr, 0 op/s rd, 40 op/s wr

I checked the conf it is correct matches with the other nodes ( I don't want to disclose the IP hence I have cut it off )

root@vmserver2:~# ls /etc/corosync/
authkey corosync.conf uidgid.d
root@vmserver2:~# more /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

As of yesterday the pvecm status has changed

root@vmserver2:~# pvecm status
Cluster information
-------------------
Name: uwcs
Config Version: 4
Transport: knet
Secure auth: on

Cannot initialize CMAP service

I will try and reboot and see if it fixes as the ceph has recovered

Alwin Antreich · Apr 3, 2024

I can't say anything without the log snippets.

As a general advise, I hope the corosync links and ceph are on separate physical interfaces. Otherwise such issues will pop up regularly.

sanjayc · Apr 3, 2024

cronosync and ceph are on seperate interfaces ceph is running on mellanox on private ips and cronsync is on class b subnet

Sylog is as follows

root@vmserver2:/etc/pve/qemu-server#
root@vmserver2:/etc/pve/qemu-server#
root@vmserver2:/etc/pve/qemu-server# tail -f /var/log/syslog
Apr 3 13:35:42 vmserver2 pmxcfs[1951]: [dcdb] crit: cpg_initialize failed: 2
Apr 3 13:35:42 vmserver2 pmxcfs[1951]: [status] crit: cpg_initialize failed: 2
Apr 3 13:35:48 vmserver2 pmxcfs[1951]: [quorum] crit: quorum_initialize failed: 2
Apr 3 13:35:48 vmserver2 pmxcfs[1951]: [confdb] crit: cmap_initialize failed: 2
Apr 3 13:35:48 vmserver2 pmxcfs[1951]: [dcdb] crit: cpg_initialize failed: 2
Apr 3 13:35:48 vmserver2 pmxcfs[1951]: [status] crit: cpg_initialize failed: 2
Apr 3 13:35:54 vmserver2 pmxcfs[1951]: [quorum] crit: quorum_initialize failed: 2
Apr 3 13:35:54 vmserver2 pmxcfs[1951]: [confdb] crit: cmap_initialize failed: 2
Apr 3 13:35:54 vmserver2 pmxcfs[1951]: [dcdb] crit: cpg_initialize failed: 2
Apr 3 13:35:54 vmserver2 pmxcfs[1951]: [status] crit: cpg_initialize failed: 2
Apr 3 13:36:00 vmserver2 pmxcfs[1951]: [quorum] crit: quorum_initialize failed: 2
Apr 3 13:36:00 vmserver2 pmxcfs[1951]: [confdb] crit: cmap_initialize failed: 2
Apr 3 13:36:00 vmserver2 pmxcfs[1951]: [dcdb] crit: cpg_initialize failed: 2
Apr 3 13:36:00 vmserver2 pmxcfs[1951]: [status] crit: cpg_initialize failed: 2

root@vmserver2:/etc/pve/qemu-server# journalctl -b -u corosync
-- Journal begins at Mon 2022-03-21 16:15:56 EDT, ends at Wed 2024-04-03 13:42:12 EDT. --
-- No entries --

sanjayc · Apr 3, 2024

Thanks all, The following is the process we followed and it worked

systemd-analyze set-log-level debug ; gave us an idea what was failing

We were running chrony and for some reason the pve-cluster was waiting on time-sync.target upon futher investigation we install systemd-timesyncd this removed chrony and we are back in business.

Currently we have Chrony on all the other nodes and Systemd.timesyncd on one node. Should I change all of them to Systemd.timesyncd?

Any idea if Cortex XDr can create troubles as we have to install Cortex XDR on all the containers and VMs as requirement from our security administrators?

VictorSTS · Apr 3, 2024

You should find out why chrony was not starting/running properly. That's why that system never reached time-sync.target.
The main difference among timesyncd and chrony is that the former only syncs time when the server boots instead of periodically as the later does. Given that servers are not rebooted that much often allows for time drifts between your nodes, so timesyncd isn't best practice. In Ceph, monitor nodes must be within 5 minutes each other.

Alwin Antreich · Apr 3, 2024

sanjayc said:
Any idea if Cortex XDr can create troubles as we have to install Cortex XDR on all the containers and VMs as requirement from our security administrators?

Idk that product. Though inside VMs shouldn't be a problem. I wouldn't think that it would cause trouble with container either, but not sure there.

sanjayc said:
Currently we have Chrony on all the other nodes and Systemd.timesyncd on one node. Should I change all of them to Systemd.timesyncd?

No. systemd-timesyncd was replaced with chrony, as timesyncd only uses a single source for time and easier led to time skews. In general the time needs to be synchronized, as proxmox & ceph heavily rely on it.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_time_synchronization

Alwin Antreich · Apr 3, 2024

VictorSTS said:
In Ceph, monitor nodes must be within 5 minutes each other.

I think a typo happend here. The MONs need to be within 50ms (default) otherwise Ceph will throw a clock skew warning. 30s is the maximum before the MONs lose quorum altogether.

Search

Search

PVE-cluster service not starting

sanjayc

New Member

Alwin Antreich

Well-Known Member

sanjayc

New Member

Alwin Antreich

Well-Known Member

sanjayc

New Member

sanjayc

New Member

VictorSTS

Famous Member

Alwin Antreich

Well-Known Member

Alwin Antreich

Well-Known Member

We value your privacy