Corosync OK but node won't up on a cluster

dewangga

Member
May 2, 2020
16
2
8
36
Hi,

I am having an issue, one of four nodes was broken, I tried to restart several components (pve-cluster, pvestatd, pvedaemon and pveproxy) and also make sure the corosync was quorate by invoking corosync-quorumtool -s

Code:
# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-helper: 6.1-6
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-13
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-21
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.1-3
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-3
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-6
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve2

Logs from each components is:

# pvecluster
Code:
systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2021-06-15 10:26:20 WIB; 2h 2min ago
  Process: 26604 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
    Tasks: 1 (limit: 4915)
   Memory: 10.6M
   CGroup: /system.slice/pve-cluster.service
           └─26384 /usr/bin/pmxcfs

Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] notice: exit proxmox configuration filesystem (-1)
Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] notice: exit proxmox configuration filesystem (-1)
Jun 15 10:26:00 yamato-u5-sp101 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Jun 15 10:26:00 yamato-u5-sp101 systemd[1]: pve-cluster.service: Killing process 26384 (pmxcfs) with signal SIGKILL.
Jun 15 10:26:10 yamato-u5-sp101 systemd[1]: pve-cluster.service: Processes still around after SIGKILL. Ignoring.
Jun 15 10:26:10 yamato-u5-sp101 systemd[1]: pve-cluster.service: Killing process 26384 (pmxcfs) with signal SIGKILL.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: pve-cluster.service: Processes still around after final SIGKILL. Entering failed mode.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: Stopped The Proxmox VE cluster filesystem.

#corosync
Code:
systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-06-15 09:43:30 WIB; 2h 46min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 14848 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 128.6M
   CGroup: /system.slice/corosync.service
           └─14848 /usr/sbin/corosync -f

Jun 15 09:43:31 yamato-u5-sp101 corosync[14848]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jun 15 09:43:31 yamato-u5-sp101 corosync[14848]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [TOTEM ] A new membership (1.9cb) was formed. Members joined: 1 2 4
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [CPG   ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [CPG   ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [CPG   ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [CPG   ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [QUORUM] This node is within the primary component and will provide service.
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [QUORUM] Members[4]: 1 2 3 4
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]:   [MAIN  ] Completed service synchronization, ready to provide service.

#pveproxy
Code:
systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2021-06-15 10:26:19 WIB; 2h 3min ago
  Process: 3133 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=111)
  Process: 3156 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
  Process: 31394 ExecStop=/usr/bin/pveproxy stop (code=exited, status=0/SUCCESS)
 Main PID: 3183 (code=exited, status=0/SUCCESS)
    Tasks: 1 (limit: 4915)
   Memory: 34.7M
   CGroup: /system.slice/pveproxy.service

Jun 15 10:26:17 yamato-u5-sp101 pveproxy[3183]: worker 31361 started
Jun 15 10:26:17 yamato-u5-sp101 pveproxy[31361]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/
Jun 15 10:26:17 yamato-u5-sp101 systemd[1]: Stopping PVE API Proxy Server...
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: received signal TERM
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: server closing
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: worker 31361 finished
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: worker 31348 finished
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: server stopped
Jun 15 10:26:19 yamato-u5-sp101 systemd[1]: pveproxy.service: Succeeded.
Jun 15 10:26:19 yamato-u5-sp101 systemd[1]: Stopped PVE API Proxy Server.

#pvestatd
Code:
systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
   Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2021-06-15 10:23:19 WIB; 2h 7min ago
  Process: 29453 ExecStart=/usr/bin/pvestatd start (code=exited, status=111)

Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[1] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[1] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[2] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[2] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[3] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[3] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: Unable to load access control list: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: Failed to start PVE Status Daemon.

And something strange happen to the logs is, I found the [Tue Jun 15 12:23:40 2021] nfs: server 192.168.222.1 not responding, timed out, all network component was ok during the inspection, including the rpcinfo

Code:
rpcinfo -p
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper

Now this node is on unhealthy state. Any chance to fix this node without rebooting? Do the pmxcfs -l safe? And is it possible to rejoin the cluster?
Any helps would appreciated :)
 
the corosync part seem to be ok,

only pve-cluster / pmxcfs seem to hang.

do you have checked that you don't have any file in /etc/pve/ when is not mounted by pmxcfs ?
Yes, the files in /etc/pve is gone, it should be caused by pmxcfs hang or something.

Code:
ls -lah /etc/pve
total 8.0K
drwxr-xr-x   2 root root 4.0K Oct 14  2020 .
drwxr-xr-x 102 root root 4.0K Jun 15 13:38 ..
 
can try to launch pmxcfs in debug mode and send the result ?

"pmxcfs -d"
I got the result :

Code:
pmxcfs -d
[main] notice: unable to acquire pmxcfs lock - trying again (pmxcfs.c:880:main)
^C

And it seems pmxcfs hanging when I submit this post and can't be killed.

Suddenly, after I ran pmxcfs -d, I am able to restart all components and the nodes back to cluster again (don't know what happen exactly)