9 node setup - 4 without quorum, 1 run out of space in /boot

Gh0st

Member
Apr 15, 2021
29
4
8
39
Hi,

This morning I found that one of my nodes had run out of space in /boot, I removed some old kernels but I can't start any VMs on the node which run out of space. There are 3 others that show as offline in the GUI but they are actually online. When I try to restart the cluster on the node that run out of space I see this

Mar 14 07:47:19 xxxx pmxcfs[897]: [quorum] crit: quorum_initialize failed: 2
Mar 14 07:47:19 xxxx pmxcfs[897]: [quorum] crit: can't initialize service
Mar 14 07:47:19 xxxx pmxcfs[897]: [confdb] crit: cmap_initialize failed: 2
Mar 14 07:47:19 xxxx pmxcfs[897]: [confdb] crit: can't initialize service
Mar 14 07:47:19 xxxx pmxcfs[897]: [dcdb] crit: cpg_initialize failed: 2
Mar 14 07:47:19 xxxx pmxcfs[897]: [dcdb] crit: can't initialize service
Mar 14 07:47:19 xxxx pmxcfs[897]: [status] crit: cpg_initialize failed: 2
Mar 14 07:47:19 xxxx pmxcfs[897]: [status] crit: can't initialize service
Mar 14 07:47:20 xxxx systemd[1]: Started The Proxmox VE cluster filesystem.

Checking the corosync.service doesn't show any errors, it's running. In my syslog I see this

Mar 14 08:11:09 xxxx pvesr[10870]: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 14 08:11:09 xxxx systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Mar 14 08:11:09 xxxx systemd[1]: pvesr.service: Failed with result 'exit-code'.
Mar 14 08:11:09 xxxx systemd[1]: Failed to start Proxmox VE replication runner.
Mar 14 08:12:00 xxxx systemd[1]: Starting Proxmox VE replication runner...
Mar 14 08:12:01 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 10
Mar 14 08:12:02 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 20
Mar 14 08:12:03 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 30
Mar 14 08:12:04 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 40
Mar 14 08:12:05 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 50
Mar 14 08:12:06 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 60
Mar 14 08:12:07 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 70
Mar 14 08:12:08 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 80
Mar 14 08:12:09 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 90
Mar 14 08:12:10 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 100
Mar 14 08:12:10 xxxx pmxcfs[897]: [status] notice: cpg_send_message retried 100 times
Mar 14 08:12:10 xxxx pmxcfs[897]: [status] crit: cpg_send_message failed: 6
Mar 14 08:12:11 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 10
Mar 14 08:12:12 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 20
Mar 14 08:12:13 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 30
Mar 14 08:12:14 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 40
Mar 14 08:12:15 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 50
Mar 14 08:12:16 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 60
Mar 14 08:12:17 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 70
Mar 14 08:12:18 xxxx pmxcfs[897]: [status] notice: cpg_send_message retry 90

pvecm status

Cluster information
-------------------
Name: Network
Config Version: 21
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Mar 14 08:32:06 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.8afc
Quorate: No

Votequorum information
----------------------
Expected votes: 9
Highest expected: 9
Total votes: 1
Quorum: 5 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 xxx.x.xx.xx (local)


pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.162-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-helper: 6.4-15
pve-kernel-5.4: 6.4-12
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1

corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
addr = xxx.x.xx.xx
status:
nodeid: 1: localhost
nodeid: 2: connected
nodeid: 3: connected
nodeid: 5: connected
nodeid: 6: connected
nodeid: 7: connected
nodeid: 8: connected
nodeid: 9: connected
nodeid: 10: connected



Not what you need on a Monday morning! Can anyone help me fix this? I can ping all nodes.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!