Consequences of Restarting the pve-cluster.service?

crt0r

New Member
Feb 27, 2024
2
0
1
Hello, folks!

Recently, I came across a problem—which by itself doesn't matter in this context—involved rapid disk fill up to nearly 100% on several independent Proxmox VE servers in different cities.

I've freed a considerable amount of disk space on these servers, so that's ok now.

However, the web interface on some of these nodes doesn't allow any users to log in, and pve config files can't be edited. The /etc/pve filesystem throws I/O errors on write, and there are regular errors in the system journal just like in the following snippet:

Code:
-- Journal begins at Thu 2022-06-09 00:00:50 <edited>, ends at Thu 2024-02-22 16:13:10 <edit>. --
Feb 13 00:22:30 pve pmxcfs[1569]: [database] crit: commit transaction failed: database or disk is full#010
Feb 13 00:22:30 pve pmxcfs[1569]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010
Feb 13 00:22:30 pve pvescheduler[3785489]: unable to delete old temp file: Input/output error
Feb 13 00:22:30 pve rsyslogd[1243]: file '/var/log/syslog'[8] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
Feb 13 00:22:30 pve rsyslogd[1243]: action 'action-1-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
Feb 13 00:22:33 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:22:34 pve pvescheduler[3785489]: ERROR: Backup of VM 100 failed - vma_queue_write: write error - Broken pipe
Feb 13 00:22:34 pve pvescheduler[3785489]: INFO: Backup job finished with errors
Feb 13 00:22:34 pve pvescheduler[3785489]: job errors
Feb 13 00:22:38 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:22:43 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:22:48 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:22:53 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:22:58 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:03 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:08 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:13 pve pvescheduler[559668]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Feb 13 00:23:13 pve pvescheduler[559667]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Feb 13 00:23:13 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:18 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:23 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:28 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:33 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:38 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:43 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:48 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error
Feb 13 00:23:53 pve pve-ha-lrm[1760]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1760' - Input/output error

It seems like people are able to solve similar problems by restarting the pve-cluster.service.

Is there a possibility to bring down the whole system or VMs and LXC containers running on it by issuing this command? Can this cause system config to go nuts because it's running in such a crippled state for more than 2 weeks?

Someone on the Proxmox forum stated that it should be fine, but still...

Proxmox VE version: 7.2.1

P.S. I'm not a Proxmox VE admin. I'm a DevOps engineer, and was asked to help some folks at my company who don't understand Linux.
P.P.S. Originally asked this question here.