The cluster does not start and /etc pve transport endpoint is not connected

  • Thread starter Thread starter Shazams
  • Start date Start date
S

Shazams

Guest
Hello everyone.
I have Proxmox 6.1 with the latest updates.
There was a node in the cluster, but I deleted it through pvecm delnode nodename.
When trying to add back pvecm add nodename swore at various files from the cluster. Files were deleted by the script https://gist.github.com/ianchen06/73acc392c72d6680099b7efac1351f56#gistcomment-3054405
There was a warning when trying to add a node
* this host already contains virtual guests

WARNING: detected error but forced to continue!

I ignored it, because there were no files and the node was initially empty.
After adding, a quorum hung on me, the task I canceled with ctrl + c.
After trying to enter the web interface, everything hung. I restarted the cluster on the master node and it did not start. Also, I could not get into the /etc/pve directory.

systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2020-05-22 03:29:54 MSK; 7s ago
Process: 23490 ExecStart = / usr / bin / pmxcfs (code = exited, status = 255 / EXCEPTION)

May 22 03:29:53 VT-SupportMachine1 pmxcfs [23490]: [main] crit: fuse_mount error: Transport endpoint is not connected
May 22 03:29:53 VT-SupportMachine1 systemd [1]: pve-cluster.service: Failed with result 'exit-code'.
May 22 03:29:53 VT-SupportMachine1 pmxcfs [23490]: [main] notice: exit proxmox configuration filesystem (-1)
May 22 03:29:53 VT-SupportMachine1 systemd [1]: Failed to start The Proxmox VE cluster filesystem.
May 22 03:29:53 VT-SupportMachine1 systemd [1]: pve-cluster.service: Service RestartSec = 100ms expired, scheduling restart.
May 22 03:29:54 VT-SupportMachine1 systemd [1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
May 22 03:29:54 VT-SupportMachine1 systemd [1]: Stopped The Proxmox VE cluster filesystem.
May 22 03:29:54 VT-SupportMachine1 systemd [1]: pve-cluster.service: Start request repeated too quickly.
May 22 03:29:54 VT-SupportMachine1 systemd [1]: pve-cluster.service: Failed with result 'exit-code'.
May 22 03:29:54 VT-SupportMachine1 systemd [1]: Failed to start The Proxmox VE cluster filesystem.

journalctl -xe
May 22 03:31:00 VT-SupportMachine1 systemd[1]: Starting Proxmox VE replication runner...

-- Subject: A start job for unit pvesr.service has begun execution

-- Defined-By: systemd

-- Support: https://www.debian.org/support

--

-- A start job for unit pvesr.service has begun execution.

--

-- The job identifier is 2589688.

May 22 03:31:01 VT-SupportMachine1 pve-ha-lrm[1440]: updating service status from manager failed: Connection refused

May 22 03:31:01 VT-SupportMachine1 pve-firewall[1396]: status update error: Connection refused

May 22 03:31:01 VT-SupportMachine1 pvesr[23535]: ipcc_send_rec[1] failed: Connection refused

May 22 03:31:01 VT-SupportMachine1 pvesr[23535]: ipcc_send_rec[2] failed: Connection refused

May 22 03:31:01 VT-SupportMachine1 pvesr[23535]: ipcc_send_rec[3] failed: Connection refused

May 22 03:31:01 VT-SupportMachine1 pvesr[23535]: Unable to load access control list: Connection refused

May 22 03:31:01 VT-SupportMachine1 systemd[1]: pvesr.service: Main process exited, code=exited, status=111/n/a

-- Subject: Unit process exited

-- Defined-By: systemd

-- Support: https://www.debian.org/support

--

-- An ExecStart= process belonging to unit pvesr.service has exited.

--

-- The process' exit code is 'exited' and its exit status is 111.

May 22 03:31:01 VT-SupportMachine1 systemd[1]: pvesr.service: Failed with result 'exit-code'.

-- Subject: Unit failed

-- Defined-By: systemd

-- Support: https://www.debian.org/support

--

-- The unit pvesr.service has entered the 'failed' state with result 'exit-code'.

May 22 03:31:01 VT-SupportMachine1 systemd[1]: Failed to start Proxmox VE replication runner.

-- Subject: A start job for unit pvesr.service has failed

-- Defined-By: systemd

-- Support: https://www.debian.org/support

--

-- A start job for unit pvesr.service has finished with a failure.

--

-- The job identifier is 2589688 and the job result is failed.

May 22 03:31:01 VT-SupportMachine1 cron[1364]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)

May 22 03:31:06 VT-SupportMachine1 pve-ha-lrm[1440]: updating service status from manager failed: Connection refused

May 22 03:31:11 VT-SupportMachine1 pve-ha-lrm[1440]: updating service status from manager failed: Connection refused

May 22 03:31:11 VT-SupportMachine1 pve-firewall[1396]: status update error: Connection refused

May 22 03:31:16 VT-SupportMachine1 pve-ha-lrm[1440]: updating service status from manager failed: Connection refused

May 22 03:31:21 VT-SupportMachine1 pve-ha-lrm[1440]: updating service status from manager failed: Connection refused

May 22 03:31:21 VT-SupportMachine1 pve-firewall[1396]: status update error: Connection refused
May 22 03:31:26 VT-SupportMachine1 pve-ha-lrm[1440]: updating service status from manager failed: Connection refused

The master node in
/var/lib/pve-cluster/config.db
has all the data, but I did not touch the database, but only made a backup

When I try to view the master node console from another working node, I get an error
Connection failed (Error 500: hostname lookup 'VT-SupportMachine1' failed - failed to get address info for: VT-SupportMachine1: Name or service not known)

How to make the master node work correctly? At the same time, other nodes work correctly (the cluster restarts)
 
qm list -full
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused
 
pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-4.15: 5.4-13
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-4.15.18-25-pve: 4.15.18-53
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.128-1-pve: 4.4.128-111
pve-kernel-4.4.114-1-pve: 4.4.114-108
pve-kernel-4.4.98-6-pve: 4.4.98-107
pve-kernel-4.4.98-2-pve: 4.4.98-101
pve-kernel-4.4.95-1-pve: 4.4.95-99
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.79-1-pve: 4.4.79-95
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.40-1-pve: 4.4.40-82
pve-kernel-4.4.35-2-pve: 4.4.35-79
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
I rebooted the node and it exited the cluster.
Another one too.
When added to an existing cluster:

* authentication key '/ etc / corosync / authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* this host already contains virtual guests
Check if node may join a cluster failed!


pvecm status
Cluster information
-------------------
Name: VT-Cluster
Config Version: 7
Transport: knet
Secure auth: on

Cannot initialize CMAP service



What should I do?
 
on the separated node do:

Code:
systemctl stop corosync
systemctl stop pve-cluster
pmxcfs -l
rm /etc/pve/corosync.conf
rm /etc/corosync/authkey
killall pmxcfs
systemctl start pve-cluster
[/icode]

also this line [icode]ipcc_send_rec[2] failed: Connection refused[/icode] in your journal usually suggests a misconfiguration in [icode]/etc/hosts[/icode] file. if there are any lines starting with [icode]127.0.1.1[/icode] in your [icode]/etc/hosts[/icode] files in the nodes, please delete/comment these lines.