Corosync OK but node won't up on a cluster

dewangga

Member
May 2, 2020
16
2
8
35
Hi,

I am having an issue, one of four nodes was broken, I tried to restart several components (pve-cluster, pvestatd, pvedaemon and pveproxy) and also make sure the corosync was quorate by invoking corosync-quorumtool -s

Code:
# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-helper: 6.1-6
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-13
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-21
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.1-3
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-3
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-6
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve2

Logs from each components is:

# pvecluster
Code:
systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2021-06-15 10:26:20 WIB; 2h 2min ago
  Process: 26604 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
    Tasks: 1 (limit: 4915)
   Memory: 10.6M
   CGroup: /system.slice/pve-cluster.service
           └─26384 /usr/bin/pmxcfs

Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] notice: exit proxmox configuration filesystem (-1)
Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] notice: exit proxmox configuration filesystem (-1)
Jun 15 10:26:00 yamato-u5-sp101 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Jun 15 10:26:00 yamato-u5-sp101 systemd[1]: pve-cluster.service: Killing process 26384 (pmxcfs) with signal SIGKILL.
Jun 15 10:26:10 yamato-u5-sp101 systemd[1]: pve-cluster.service: Processes still around after SIGKILL. Ignoring.
Jun 15 10:26:10 yamato-u5-sp101 systemd[1]: pve-cluster.service: Killing process 26384 (pmxcfs) with signal SIGKILL.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: pve-cluster.service: Processes still around after final SIGKILL. Entering failed mode.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: Stopped The Proxmox VE cluster filesystem.

#corosync
Code:
systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-06-15 09:43:30 WIB; 2h 46min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 14848 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 128.6M
   CGroup: /system.slice/corosync.service
           └─14848 /usr/sbin/corosync -f

Jun 15 09:43:31 yamato-u5-sp101 corosync[14848]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jun 15 09:43:31 yamato-u5-sp101 corosync[14848]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [TOTEM ] A new membership (1.9cb) was formed. Members joined: 1 2 4
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [CPG   ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [CPG   ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [CPG   ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [CPG   ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [QUORUM] This node is within the primary component and will provide service.
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [QUORUM] Members[4]: 1 2 3 4
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [MAIN  ] Completed service synchronization, ready to provide service.

#pveproxy
Code:
systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2021-06-15 10:26:19 WIB; 2h 3min ago
  Process: 3133 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=111)
  Process: 3156 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
  Process: 31394 ExecStop=/usr/bin/pveproxy stop (code=exited, status=0/SUCCESS)
 Main PID: 3183 (code=exited, status=0/SUCCESS)
    Tasks: 1 (limit: 4915)
   Memory: 34.7M
   CGroup: /system.slice/pveproxy.service

Jun 15 10:26:17 yamato-u5-sp101 pveproxy[3183]: worker 31361 started
Jun 15 10:26:17 yamato-u5-sp101 pveproxy[31361]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/
Jun 15 10:26:17 yamato-u5-sp101 systemd[1]: Stopping PVE API Proxy Server...
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: received signal TERM
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: server closing
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: worker 31361 finished
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: worker 31348 finished
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: server stopped
Jun 15 10:26:19 yamato-u5-sp101 systemd[1]: pveproxy.service: Succeeded.
Jun 15 10:26:19 yamato-u5-sp101 systemd[1]: Stopped PVE API Proxy Server.

#pvestatd
Code:
systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
   Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2021-06-15 10:23:19 WIB; 2h 7min ago
  Process: 29453 ExecStart=/usr/bin/pvestatd start (code=exited, status=111)

Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[1] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[1] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[2] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[2] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[3] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[3] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: Unable to load access control list: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: Failed to start PVE Status Daemon.

And something strange happen to the logs is, I found the [Tue Jun 15 12:23:40 2021] nfs: server 192.168.222.1 not responding, timed out, all network component was ok during the inspection, including the rpcinfo

Code:
rpcinfo -p
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper

Now this node is on unhealthy state. Any chance to fix this node without rebooting? Do the pmxcfs -l safe? And is it possible to rejoin the cluster?
Any helps would appreciated :)
 
the corosync part seem to be ok,

only pve-cluster / pmxcfs seem to hang.

do you have checked that you don't have any file in /etc/pve/ when is not mounted by pmxcfs ?
Yes, the files in /etc/pve is gone, it should be caused by pmxcfs hang or something.

Code:
ls -lah /etc/pve
total 8.0K
drwxr-xr-x   2 root root 4.0K Oct 14  2020 .
drwxr-xr-x 102 root root 4.0K Jun 15 13:38 ..
 
can try to launch pmxcfs in debug mode and send the result ?

"pmxcfs -d"
I got the result :

Code:
pmxcfs -d
[main] notice: unable to acquire pmxcfs lock - trying again (pmxcfs.c:880:main)
^C

And it seems pmxcfs hanging when I submit this post and can't be killed.

Suddenly, after I ran pmxcfs -d, I am able to restart all components and the nodes back to cluster again (don't know what happen exactly)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!