9 node setup - 4 without quorum, 1 run out of space in /boot

Gh0st · Mar 14, 2022

Hi,

This morning I found that one of my nodes had run out of space in /boot, I removed some old kernels but I can't start any VMs on the node which run out of space. There are 3 others that show as offline in the GUI but they are actually online. When I try to restart the cluster on the node that run out of space I see this

Mar 14 07:47:19 xxxx pmxcfs[897]: [quorum] crit: quorum_initialize failed: 2
Mar 14 07:47:19 xxxx pmxcfs[897]: [quorum] crit: can't initialize service
Mar 14 07:47:19 xxxx pmxcfs[897]: [confdb] crit: cmap_initialize failed: 2
Mar 14 07:47:19 xxxx pmxcfs[897]: [confdb] crit: can't initialize service
Mar 14 07:47:19 xxxx pmxcfs[897]: [dcdb] crit: cpg_initialize failed: 2
Mar 14 07:47:19 xxxx pmxcfs[897]: [dcdb] crit: can't initialize service
Mar 14 07:47:19 xxxx pmxcfs[897]: [status] crit: cpg_initialize failed: 2
Mar 14 07:47:19 xxxx pmxcfs[897]: [status] crit: can't initialize service
Mar 14 07:47:20 xxxx systemd[1]: Started The Proxmox VE cluster filesystem.

Checking the corosync.service doesn't show any errors, it's running. In my syslog I see this

Mar 14 08:11:09 xxxx pvesr[10870]: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 14 08:11:09 xxxx systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Mar 14 08:11:09 xxxx systemd[1]: pvesr.service: Failed with result 'exit-code'.
Mar 14 08:11:09 xxxx systemd[1]: Failed to start Proxmox VE replication runner.
Mar 14 08:12:00 xxxx systemd[1]: Starting Proxmox VE replication runner...
Mar 14 08:12:01 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 10
Mar 14 08:12:02 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 20
Mar 14 08:12:03 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 30
Mar 14 08:12:04 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 40
Mar 14 08:12:05 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 50
Mar 14 08:12:06 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 60
Mar 14 08:12:07 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 70
Mar 14 08:12:08 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 80
Mar 14 08:12:09 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 90
Mar 14 08:12:10 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 100
Mar 14 08:12:10 xxxx pmxcfs[897]: [status] notice: cpg_send_message retried 100 times
Mar 14 08:12:10 xxxx pmxcfs[897]: [status] crit: cpg_send_message failed: 6
Mar 14 08:12:11 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 10
Mar 14 08:12:12 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 20
Mar 14 08:12:13 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 30
Mar 14 08:12:14 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 40
Mar 14 08:12:15 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 50
Mar 14 08:12:16 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 60
Mar 14 08:12:17 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 70
Mar 14 08:12:18 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 90

pvecm status

Cluster information
-------------------
Name: Network
Config Version: 21
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Mar 14 08:32:06 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.8afc
Quorate: No

Votequorum information
----------------------
Expected votes: 9
Highest expected: 9
Total votes: 1
Quorum: 5 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 xxx.x.xx.xx (local)

pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.162-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-helper: 6.4-15
pve-kernel-5.4: 6.4-12
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1

corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
addr = xxx.x.xx.xx
status:
nodeid: 1: localhost
nodeid: 2: connected
nodeid: 3: connected
nodeid: 5: connected
nodeid: 6: connected
nodeid: 7: connected
nodeid: 8: connected
nodeid: 9: connected
nodeid: 10: connected

Not what you need on a Monday morning! Can anyone help me fix this? I can ping all nodes.

Search

Search

9 node setup - 4 without quorum, 1 run out of space in /boot

Gh0st

Member