[SOLVED] pve-cluster service does not start PVE 6

Sultan

Member
Oct 1, 2019
8
0
6
31
After unsuccessfully adding new node (bad node) to cluster and restarting pve-cluster service on nodes. Pve-cluster service not starting up. As was expected pvecm and other commands not working.
Code:
systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: activating (start) since Tue 2019-10-01 10:02:53 +06; 36s ago
Cntrl PID: 6022 (pmxcfs)
    Tasks: 3 (limit: 4915)
   Memory: 2.6M
   CGroup: /system.slice/pve-cluster.service
           ├─6022 /usr/bin/pmxcfs
           └─6024 /usr/bin/pmxcfs

Қаз 01 10:02:53 hp1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Қаз 01 10:02:53 hp1 pmxcfs[6024]: [status] notice: update cluster info (cluster name  cluster, version = 4)
Қаз 01 10:02:54 hp1 pmxcfs[6024]: [dcdb] notice: cpg_join retry 10
Қаз 01 10:02:55 hp1 pmxcfs[6024]: [dcdb] notice: cpg_join retry 20
Қаз 01 10:02:56 hp1 pmxcfs[6024]: [dcdb] notice: cpg_join retry 30
Code:
root@hp1:~# systemctl start pve-cluster.service
Job for pve-cluster.service failed because a timeout was exceeded.
See "systemctl status pve-cluster.service" and "journalctl -xe" for details.
root@hp1:~#
root@hp1:~# journalctl -xe
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Automatic restarting of the unit pve-cluster.service has been scheduled, as the result for
-- the configured Restart= setting for the unit.
Қаз 01 10:04:24 hp1 systemd[1]: Stopped The Proxmox VE cluster filesystem.
-- Subject: A stop job for unit pve-cluster.service has finished
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A stop job for unit pve-cluster.service has finished.
--
-- The job identifier is 464212 and the job result is done.
Қаз 01 10:04:24 hp1 systemd[1]: Starting The Proxmox VE cluster filesystem...
-- Subject: A start job for unit pve-cluster.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit pve-cluster.service has begun execution.
--
-- The job identifier is 464212.
Қаз 01 10:04:24 hp1 systemd[4408]: etc-pve.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit UNIT has successfully entered the 'dead' state.
Қаз 01 10:04:24 hp1 systemd[1]: etc-pve.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit etc-pve.mount has successfully entered the 'dead' state.
Қаз 01 10:04:24 hp1 pmxcfs[6058]: [status] notice: update cluster info (cluster name  cluster, version = 4)
Қаз 01 10:04:25 hp1 pmxcfs[6058]: [dcdb] notice: cpg_join retry 10
Қаз 01 10:04:25 hp1 pve-firewall[1174]: status update error: Connection refused
Қаз 01 10:04:26 hp1 pmxcfs[6058]: [dcdb] notice: cpg_join retry 20
Қаз 01 10:04:26 hp1 pvestatd[1176]: ipcc_send_rec[1] failed: Connection refused
Қаз 01 10:04:26 hp1 pvestatd[1176]: ipcc_send_rec[2] failed: Connection refused
Қаз 01 10:04:26 hp1 pvestatd[1176]: ipcc_send_rec[3] failed: Connection refused
Қаз 01 10:04:26 hp1 pvestatd[1176]: ipcc_send_rec[4] failed: Connection refused
Қаз 01 10:04:26 hp1 pvestatd[1176]: status update error: Connection refused
Қаз 01 10:04:27 hp1 pmxcfs[6058]: [dcdb] notice: cpg_join retry 30
Қаз 01 10:04:28 hp1 pmxcfs[6058]: [dcdb] notice: cpg_join retry 40
Қаз 01 10:04:28 hp1 pve-ha-lrm[1210]: loop take too long (90 seconds)
Қаз 01 10:04:28 hp1 pve-ha-crm[1202]: loop take too long (90 seconds)
Қаз 01 10:04:28 hp1 pve-ha-lrm[1210]: updating service status from manager failed: Connection refused
Қаз 01 10:04:28 hp1 corosync[1157]:   [TOTEM ] A new membership (1:56716) was formed. Members
Қаз 01 10:04:34 hp1 corosync[1157]:   [TOTEM ] A new membership (1:56736) was formed. Members
Қаз 01 10:04:35 hp1 pve-firewall[1174]: status update error: Connection refused
Қаз 01 10:04:36 hp1 pvestatd[1176]: ipcc_send_rec[1] failed: Connection refused
Қаз 01 10:04:36 hp1 pvestatd[1176]: ipcc_send_rec[2] failed: Connection refused
Қаз 01 10:04:36 hp1 pvestatd[1176]: ipcc_send_rec[3] failed: Connection refused
Қаз 01 10:04:36 hp1 pvestatd[1176]: ipcc_send_rec[4] failed: Connection refused
Қаз 01 10:04:36 hp1 pvestatd[1176]: status update error: Connection refused

On bad node
Code:
root@dell2:/etc/pve# pvecm status
Quorum information
------------------
Date:             Tue Oct  1 09:54:56 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4/54576
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 192.168.1.19 (local)
Code:
Corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: dell1
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.38
  }
  node {
    name: dell2
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.1.19
  }
  node {
    name: hp1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.4
  }
  node {
    name: hp2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.10
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster
  config_version: 4
  interface {
    bindnetaddr: 192.168.1.4
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Version of PVE is the same on all nodes
Code:
root@dell2:/etc/pve# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 
Hi,

it looks like your node has no connection to the other nodes.
I would check your network connection.
When the connection to the other nodes are back, your cluster will work normally, and you can remove the bad node.
 
Hi,

it looks like your node has no connection to the other nodes.
I would check your network connection.
When the connection to the other nodes are back, your cluster will work normally, and you can remove the bad node.
Hi, thx for answer. But my network seams fine. I can ping nodes one from another and use ssh. though ssh takes to long to establish. And i was adding node to cluster via web interface
 
Is a firewall enabled on this network?
Try to restart corosync.service on all nodes
 
  • Like
Reactions: Sultan
there was none problems with network when i added dell1 node.
Is a firewall enabled on this network?
Try to restart corosync.service on all nodes
I restarted corosync.service on all nodes and pve-cluster service is back to live. but on hp1 node i see this
Code:
root@hp1:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2019-10-01 11:04:12 +06; 5min ago
  Process: 7137 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
  Process: 7159 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
Main PID: 7139 (pmxcfs)
    Tasks: 7 (limit: 4915)
   Memory: 30.8M
   CGroup: /system.slice/pve-cluster.service
           └─7139 /usr/bin/pmxcfs

Қаз 01 11:04:14 hp1 pmxcfs[7139]: [quorum] crit: quorum_initialize failed: 2
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [quorum] crit: can't initialize service
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [confdb] crit: cmap_initialize failed: 2
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [confdb] crit: can't initialize service
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [dcdb] notice: start cluster connection
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [dcdb] crit: cpg_initialize failed: 2
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [dcdb] crit: can't initialize service
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [status] notice: start cluster connection
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [status] crit: cpg_initialize failed: 2
Қаз 01 11:04:14 hp1 pmxcfs[7139]: [status] crit: can't initialize service
I now can access to web interface. but only 2 of my node online and dell1 note status unknown
Screenshot from 2019-10-01 11-13-38.png
Screenshot from 2019-10-01 11-15-29.png
Code:
root@hp1:~# pvecm status
Quorum information
------------------
Date:             Tue Oct  1 11:17:29 2019
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1/70032
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.4 (local)
0x00000002          1 192.168.1.10
0x00000003          1 192.168.1.38
0x00000004          1 192.168.1.19

Update.
restarted pve-cluster service on node hp1. And it seems works fine.
But in web interface my node dell1 is listed as unknown but a can access VMs from web in this node.
My bad node is seems as connected in web bu a cannot list any properties of this node.
Is it right to delete dell2 node from cluster, reinstall it and connect to cluster?
https://pve.proxmox.com/wiki/Cluster_Manager
 
Last edited:
I deleted node as said in manual. But what i have do for status of my dell1 node change from unknown state?
Screenshot from 2019-10-01 15-29-06.png
 
You have to delete the dell1 dir in the /etc/pve.
If you have config here pleas backup them.
I mean the VM Configs (107,108,...)
 
You have to delete the dell1 dir in the /etc/pve.
If you have config here pleas backup them.
I mean the VM Configs (107,108,...)
Deleting a folder dell1 will not affect running virtual machines (vms from 107 to 119) on this node (dell1) ?
Because I deleted dell2 from the cluster not dell1
Update.
Restarted pvestatd service on node dell1 and it solved my problem.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!