Hi,
I am having an issue, one of four nodes was broken, I tried to restart several components (pve-cluster, pvestatd, pvedaemon and pveproxy) and also make sure the corosync was quorate by invoking corosync-quorumtool -s
Logs from each components is:
# pvecluster
#corosync
#pveproxy
#pvestatd
And something strange happen to the logs is, I found the [Tue Jun 15 12:23:40 2021] nfs: server 192.168.222.1 not responding, timed out, all network component was ok during the inspection, including the rpcinfo
Now this node is on unhealthy state. Any chance to fix this node without rebooting? Do the pmxcfs -l safe? And is it possible to rejoin the cluster?
Any helps would appreciated
I am having an issue, one of four nodes was broken, I tried to restart several components (pve-cluster, pvestatd, pvedaemon and pveproxy) and also make sure the corosync was quorate by invoking corosync-quorumtool -s
Code:
# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-helper: 6.1-6
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-13
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-21
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.1-3
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-3
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-6
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve2
Logs from each components is:
# pvecluster
Code:
systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-06-15 10:26:20 WIB; 2h 2min ago
Process: 26604 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
Tasks: 1 (limit: 4915)
Memory: 10.6M
CGroup: /system.slice/pve-cluster.service
└─26384 /usr/bin/pmxcfs
Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] crit: unable to acquire pmxcfs lock: Resource temporarily unavailable
Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] notice: exit proxmox configuration filesystem (-1)
Jun 15 10:26:00 yamato-u5-sp101 pmxcfs[26604]: [main] notice: exit proxmox configuration filesystem (-1)
Jun 15 10:26:00 yamato-u5-sp101 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Jun 15 10:26:00 yamato-u5-sp101 systemd[1]: pve-cluster.service: Killing process 26384 (pmxcfs) with signal SIGKILL.
Jun 15 10:26:10 yamato-u5-sp101 systemd[1]: pve-cluster.service: Processes still around after SIGKILL. Ignoring.
Jun 15 10:26:10 yamato-u5-sp101 systemd[1]: pve-cluster.service: Killing process 26384 (pmxcfs) with signal SIGKILL.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: pve-cluster.service: Processes still around after final SIGKILL. Entering failed mode.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Jun 15 10:26:20 yamato-u5-sp101 systemd[1]: Stopped The Proxmox VE cluster filesystem.
#corosync
Code:
systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-06-15 09:43:30 WIB; 2h 46min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 14848 (corosync)
Tasks: 9 (limit: 4915)
Memory: 128.6M
CGroup: /system.slice/corosync.service
└─14848 /usr/sbin/corosync -f
Jun 15 09:43:31 yamato-u5-sp101 corosync[14848]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jun 15 09:43:31 yamato-u5-sp101 corosync[14848]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]: [TOTEM ] A new membership (1.9cb) was formed. Members joined: 1 2 4
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]: [CPG ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]: [CPG ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]: [CPG ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]: [CPG ] downlist left_list: 0 received
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]: [QUORUM] This node is within the primary component and will provide service.
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]: [QUORUM] Members[4]: 1 2 3 4
Jun 15 09:43:33 yamato-u5-sp101 corosync[14848]: [MAIN ] Completed service synchronization, ready to provide service.
#pveproxy
Code:
systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2021-06-15 10:26:19 WIB; 2h 3min ago
Process: 3133 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=111)
Process: 3156 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Process: 31394 ExecStop=/usr/bin/pveproxy stop (code=exited, status=0/SUCCESS)
Main PID: 3183 (code=exited, status=0/SUCCESS)
Tasks: 1 (limit: 4915)
Memory: 34.7M
CGroup: /system.slice/pveproxy.service
Jun 15 10:26:17 yamato-u5-sp101 pveproxy[3183]: worker 31361 started
Jun 15 10:26:17 yamato-u5-sp101 pveproxy[31361]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/
Jun 15 10:26:17 yamato-u5-sp101 systemd[1]: Stopping PVE API Proxy Server...
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: received signal TERM
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: server closing
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: worker 31361 finished
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: worker 31348 finished
Jun 15 10:26:18 yamato-u5-sp101 pveproxy[3183]: server stopped
Jun 15 10:26:19 yamato-u5-sp101 systemd[1]: pveproxy.service: Succeeded.
Jun 15 10:26:19 yamato-u5-sp101 systemd[1]: Stopped PVE API Proxy Server.
#pvestatd
Code:
systemctl status pvestatd
● pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-06-15 10:23:19 WIB; 2h 7min ago
Process: 29453 ExecStart=/usr/bin/pvestatd start (code=exited, status=111)
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[1] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[1] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[2] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[2] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[3] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: ipcc_send_rec[3] failed: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 pvestatd[29453]: Unable to load access control list: Connection refused
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: pvestatd.service: Control process exited, code=exited, status=111/n/a
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: pvestatd.service: Failed with result 'exit-code'.
Jun 15 10:23:19 yamato-u5-sp101 systemd[1]: Failed to start PVE Status Daemon.
And something strange happen to the logs is, I found the [Tue Jun 15 12:23:40 2021] nfs: server 192.168.222.1 not responding, timed out, all network component was ok during the inspection, including the rpcinfo
Code:
rpcinfo -p
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
Now this node is on unhealthy state. Any chance to fix this node without rebooting? Do the pmxcfs -l safe? And is it possible to rejoin the cluster?
Any helps would appreciated