pve-cluster not starting

aulandsdalen

New Member
Jan 19, 2018
2
0
1
27
I've got a cluster of three nodes, node1, node2 and node3.

For some reason webui stopped and I've restarted pveproxy service, but it won't start.

Code:
# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
   Active: failed (Result: timeout) since Fri 2018-01-19 13:42:21 MSK; 4h 52min ago
 Main PID: 19148 (code=exited, status=0/SUCCESS)

Jan 19 13:39:21 dime-node-0cc47aded8f4 systemd[1]: pveproxy.service start operation timed out. Terminating.
Jan 19 13:40:51 dime-node-0cc47aded8f4 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Jan 19 13:42:21 dime-node-0cc47aded8f4 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Jan 19 13:42:21 dime-node-0cc47aded8f4 systemd[1]: Failed to start PVE API Proxy Server.
Jan 19 13:42:21 dime-node-0cc47aded8f4 systemd[1]: Unit pveproxy.service entered failed state.

I've restarted pve-cluster service, it won't start too:

Code:
# systemctl status pve-cluster -l
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: failed (Result: signal) since Fri 2018-01-19 18:22:48 MSK; 12min ago
  Process: 19674 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 19678 (code=killed, signal=KILL)

Jan 19 18:21:17 dime-node-0cc47aded8f4 pmxcfs[19678]: [status] notice: received sync request (epoch 1/19678/00000002)
Jan 19 18:22:07 dime-node-0cc47aded8f4 pmxcfs[19678]: [dcdb] notice: members: 1/19678, 2/4372, 3/14138, 4/14154, 5/14247
Jan 19 18:22:07 dime-node-0cc47aded8f4 pmxcfs[19678]: [status] notice: members: 1/19678, 2/4372, 3/14138, 4/14154, 5/14247
Jan 19 18:22:07 dime-node-0cc47aded8f4 pmxcfs[19678]: [dcdb] notice: received sync request (epoch 1/19678/00000002)
Jan 19 18:22:07 dime-node-0cc47aded8f4 pmxcfs[19678]: [status] notice: received sync request (epoch 1/19678/00000003)
Jan 19 18:22:38 dime-node-0cc47aded8f4 systemd[1]: pve-cluster.service start-post operation timed out. Stopping.
Jan 19 18:22:48 dime-node-0cc47aded8f4 systemd[1]: pve-cluster.service stop-sigterm timed out. Killing.
Jan 19 18:22:48 dime-node-0cc47aded8f4 systemd[1]: pve-cluster.service: main process exited, code=killed, status=9/KILL
Jan 19 18:22:48 dime-node-0cc47aded8f4 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Jan 19 18:22:48 dime-node-0cc47aded8f4 systemd[1]: Unit pve-cluster.service entered failed state.

journalctl -xe outputs numerous ipcc_send_rec errors:

Code:
# journalctl -xe
# journalctl -xe
Jan 19 18:41:09 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:09 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:09 dime-node-0cc47aded8f4 pve-ha-crm[4349]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:09 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:09 dime-node-0cc47aded8f4 pve-ha-crm[4349]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:14 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:14 dime-node-0cc47aded8f4 pve-ha-crm[4349]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:14 dime-node-0cc47aded8f4 pve-ha-crm[4349]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:14 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:14 dime-node-0cc47aded8f4 pve-ha-crm[4349]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:14 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:18 dime-node-0cc47aded8f4 pvestatd[4322]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:18 dime-node-0cc47aded8f4 pvestatd[4322]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:18 dime-node-0cc47aded8f4 pvestatd[4322]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:18 dime-node-0cc47aded8f4 pvestatd[4322]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:18 dime-node-0cc47aded8f4 pvestatd[4322]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:18 dime-node-0cc47aded8f4 pvestatd[4322]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:19 dime-node-0cc47aded8f4 pve-ha-crm[4349]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:19 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:19 dime-node-0cc47aded8f4 pve-ha-crm[4349]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:19 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:19 dime-node-0cc47aded8f4 pve-ha-crm[4349]: ipcc_send_rec failed: Connection refused

Jan 19 18:41:19 dime-node-0cc47aded8f4 pve-ha-lrm[4398]: ipcc_send_rec failed: Connection refused

I've also restarted pve-cluster on node3, it won't start either.

PVE version:

Code:
# pveversion --verbose
proxmox-ve: 4.4-88 (running kernel: 4.4.62-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.62-1-pve: 4.4.62-88
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-50
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-95
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-100
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.9.1-1
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
 
Is corosync running?

# systemctl status corosync.service
Yep.

Code:
# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: active (running) since Fri 2018-01-19 17:36:52 MSK; 1h 7min ago
  Process: 13026 ExecStop=/usr/share/corosync/corosync stop (code=killed, signal=TERM)
  Process: 13587 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
 Main PID: 13596 (corosync)
   CGroup: /system.slice/corosync.service
           └─13596 corosync

Jan 19 18:23:10 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 326eab 326eac
Jan 19 18:23:11 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 3289d5 3289d6 3289d8
Jan 19 18:23:11 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 3289d6
Jan 19 18:23:11 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 32ab20
Jan 19 18:23:11 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 32b5f1 32b5f6
Jan 19 18:23:11 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 32b935
Jan 19 18:23:11 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 32bbd6
Jan 19 18:23:15 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 33a5fa 33a5fd 33a5fe 33a5ff
Jan 19 18:23:15 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 33a89d
Jan 19 18:23:20 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 3551f1 3551f2 3551f3
 
Code:
Jan 19 18:23:20 dime-node-0cc47aded8f4 corosync[13596]: [TOTEM ] Retransmit List: 3551f1 3551f2 3551f3

Such lines are an indication for networking problems. Check your network, especially if multicast works well.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!