PVE-cluster service not starting

sanjayc

New Member
Nov 24, 2023
4
0
1
i had to power off the host due to load on one container . It shutdown however I could not bring it back up buffer error.

Steps taken
1) power offed the host and reboot

The problem now is
oot@vmserver2:~# systemctl reset-failed pve-cluster
root@vmserver2:~# systemctl start pve-cluster

pve-cluster not starting
root@vmserver2:~# pvesm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused
root@vmserver2:~#
root@vmserver2:~# journalctl -b -u pve-cluster.service
-- Journal begins at Mon 2022-03-21 16:15:56 EDT, ends at Tue 2024-04-02 11:14:48 EDT. --
-- No entries --


root@vmserver2:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)


My ceph cluster is degraded not sure if it will fix the above after the ceph cluster recovers

Degraded data redundancy: 175343/2807727 objects degraded (6.245%), 51 pgs degraded, 51 pgs undersizedpg 4.0 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,15]
pg 4.8 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,1]
pg 4.9 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,1]
pg 4.f is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,2]
pg 4.19 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,3]
pg 4.21 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,10]
pg 4.23 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,1]
pg 4.26 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [12,2]
pg 4.28 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [13,4]
pg 4.2c is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [3,1]
pg 4.2d is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,1]
pg 4.36 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,3]
pg 4.37 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [9,13]
pg 4.48 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [2,4]
pg 4.52 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,4]
pg 4.53 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [4,10]
pg 4.55 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [2,1]
pg 4.61 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,2]
pg 4.6f is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,15]
pg 4.73 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,15]
pg 4.79 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,4]
pg 4.7b is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,10]
pg 4.7d is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,3]
pg 4.82 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,8]
pg 4.89 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,13]
pg 4.8a is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,1]
pg 4.8e is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,8]
pg 4.91 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,13]
pg 4.92 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [1,4]
pg 4.96 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,5]
pg 4.98 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [6,5]
pg 4.9a is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,9]
pg 4.9c is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,9]
pg 4.9d is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,5]
pg 4.9e is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,10]
pg 4.ac is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [10,2]
pg 4.b1 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,9]
pg 4.b2 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,5]
pg 4.b8 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,2]
pg 4.bb is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [4,10]
pg 4.bf is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [4,1]
pg 4.c6 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfilling, last acting [4,5]
pg 4.c7 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [13,4]
pg 4.d2 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [4,1]
pg 4.d7 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [2,15]
pg 4.de is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,9]
pg 4.e6 is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,10]
pg 4.ea is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,4]
pg 4.eb is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [3,2]
pg 4.ed is stuck undersized for 6h, current state active+undersized+degraded+remapped+backfill_wait, last acting [5,10]
pg 4.fd is active+undersized+degraded+remapped+backfill_wait, acting [13,10]
Active: inactive (dead)


root@vmserver2:~# ls -l /etc/pve/
total 0
root@vmserver2:~#


Any suggestions how to fix this problem.
 
Last edited:
My ceph cluster is degraded not sure if it will fix the above after the ceph cluster recovers
Ceph will work independently from corosync. Please post a ceph -s to get an overview.

The problem now is
oot@vmserver2:~# systemctl reset-failed pve-cluster
root@vmserver2:~# systemctl start pve-cluster
Can you please post the output of the journal for pve-cluster and corosync? Also please check that /etc/corosync/ has the corosync.conf and the authkey.
 
oot@vmserver2:~# ceph -s
cluster:
id: 189d85c6-fe4f-45f7-8452-21b0a34991fe
health: HEALTH_WARN
1/4 mons down, quorum vmserver1,vmserver3,vmserver4

services:
mon: 4 daemons, quorum vmserver1,vmserver3,vmserver4 (age 23h), out of quorum: vmserver2
mgr: vmserver1(active, since 6w), standbys: vmserver3, vmserver4
osd: 16 osds: 12 up (since 23h), 12 in (since 23h)

data:
pools: 3 pools, 289 pgs
objects: 936.10k objects, 3.5 TiB
usage: 11 TiB used, 11 TiB / 22 TiB avail
pgs: 287 active+clean
2 active+clean+scrubbing+deep

io:
client: 2.7 KiB/s rd, 4.0 MiB/s wr, 0 op/s rd, 40 op/s wr


I checked the conf it is correct matches with the other nodes ( I don't want to disclose the IP hence I have cut it off )

root@vmserver2:~# ls /etc/corosync/
authkey corosync.conf uidgid.d
root@vmserver2:~# more /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

As of yesterday the pvecm status has changed

root@vmserver2:~# pvecm status
Cluster information
-------------------
Name: uwcs
Config Version: 4
Transport: knet
Secure auth: on

Cannot initialize CMAP service

I will try and reboot and see if it fixes as the ceph has recovered
 
I can't say anything without the log snippets.

As a general advise, I hope the corosync links and ceph are on separate physical interfaces. Otherwise such issues will pop up regularly.
 
cronosync and ceph are on seperate interfaces ceph is running on mellanox on private ips and cronsync is on class b subnet


Sylog is as follows

root@vmserver2:/etc/pve/qemu-server#
root@vmserver2:/etc/pve/qemu-server#
root@vmserver2:/etc/pve/qemu-server# tail -f /var/log/syslog
Apr 3 13:35:42 vmserver2 pmxcfs[1951]: [dcdb] crit: cpg_initialize failed: 2
Apr 3 13:35:42 vmserver2 pmxcfs[1951]: [status] crit: cpg_initialize failed: 2
Apr 3 13:35:48 vmserver2 pmxcfs[1951]: [quorum] crit: quorum_initialize failed: 2
Apr 3 13:35:48 vmserver2 pmxcfs[1951]: [confdb] crit: cmap_initialize failed: 2
Apr 3 13:35:48 vmserver2 pmxcfs[1951]: [dcdb] crit: cpg_initialize failed: 2
Apr 3 13:35:48 vmserver2 pmxcfs[1951]: [status] crit: cpg_initialize failed: 2
Apr 3 13:35:54 vmserver2 pmxcfs[1951]: [quorum] crit: quorum_initialize failed: 2
Apr 3 13:35:54 vmserver2 pmxcfs[1951]: [confdb] crit: cmap_initialize failed: 2
Apr 3 13:35:54 vmserver2 pmxcfs[1951]: [dcdb] crit: cpg_initialize failed: 2
Apr 3 13:35:54 vmserver2 pmxcfs[1951]: [status] crit: cpg_initialize failed: 2
Apr 3 13:36:00 vmserver2 pmxcfs[1951]: [quorum] crit: quorum_initialize failed: 2
Apr 3 13:36:00 vmserver2 pmxcfs[1951]: [confdb] crit: cmap_initialize failed: 2
Apr 3 13:36:00 vmserver2 pmxcfs[1951]: [dcdb] crit: cpg_initialize failed: 2
Apr 3 13:36:00 vmserver2 pmxcfs[1951]: [status] crit: cpg_initialize failed: 2



root@vmserver2:/etc/pve/qemu-server# journalctl -b -u corosync
-- Journal begins at Mon 2022-03-21 16:15:56 EDT, ends at Wed 2024-04-03 13:42:12 EDT. --
-- No entries --
 
Thanks all, The following is the process we followed and it worked

systemd-analyze set-log-level debug ; gave us an idea what was failing

We were running chrony and for some reason the pve-cluster was waiting on time-sync.target upon futher investigation we install systemd-timesyncd this removed chrony and we are back in business.

Currently we have Chrony on all the other nodes and Systemd.timesyncd on one node. Should I change all of them to Systemd.timesyncd?


Any idea if Cortex XDr can create troubles as we have to install Cortex XDR on all the containers and VMs as requirement from our security administrators?
 
You should find out why chrony was not starting/running properly. That's why that system never reached time-sync.target.
The main difference among timesyncd and chrony is that the former only syncs time when the server boots instead of periodically as the later does. Given that servers are not rebooted that much often allows for time drifts between your nodes, so timesyncd isn't best practice. In Ceph, monitor nodes must be within 5 minutes each other.
 
Any idea if Cortex XDr can create troubles as we have to install Cortex XDR on all the containers and VMs as requirement from our security administrators?
Idk that product. Though inside VMs shouldn't be a problem. I wouldn't think that it would cause trouble with container either, but not sure there.

Currently we have Chrony on all the other nodes and Systemd.timesyncd on one node. Should I change all of them to Systemd.timesyncd?
No. systemd-timesyncd was replaced with chrony, as timesyncd only uses a single source for time and easier led to time skews. In general the time needs to be synchronized, as proxmox & ceph heavily rely on it.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_time_synchronization
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!