Login to Proxmox GUI fails in single-node setup due to timeout of pve-cluster service

siebo

Member
Jun 3, 2020
4
0
6
40
Hi forum,

since a couple of months I have an intermittend issue with my Proxmox single-node server.
Every once in a while, the login to the Proxmox GUI fails with the following error message:

Login failed. Please try again.

Of course, password is fine, realm is fine, ... and also system is up to date:
pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve)

As I could still login via SSH, I was able to do some further investigation.
After some research, I found that the issue can be resolved by restarting the pve-cluster service.
However, the messages which journalctl -f shows, indicate there might be some bug or misconfiguration present which I cannot identify and mitigate by my own:

Code:
root@pmx:~# systemctl restart pve-cluster.service && journalctl -f
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Killing process 3105109 (cfs_loop) with signal SIGKILL.
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Jan 29 13:56:34 pmx systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Consumed 4min 7.698s CPU time.
Jan 29 13:56:34 pmx systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Jan 29 13:56:34 pmx pmxcfs[610665]: [main] notice: resolved node name 'pmx' to '192.168.253.2' for default node IP address
Jan 29 13:56:34 pmx pmxcfs[610665]: [main] notice: resolved node name 'pmx' to '192.168.253.2' for default node IP address
Jan 29 13:56:35 pmx systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Jan 29 13:56:35 pmx systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.conf).
Jan 29 13:56:35 pmx pvestatd[2442]: status update time (14.695 seconds)
Jan 29 13:56:35 pmx pve-firewall[2440]: firewall update time (8.015 seconds)
Jan 29 13:56:35 pmx pvestatd[2442]: auth key pair too old, rotating..
Jan 29 13:56:57 pmx pvedaemon[2346307]: <root@pam> successful auth for user 'root@pam'
Jan 29 13:57:04 pmx pvescheduler[612014]: <root@pam> starting task UPID:pmx:000956AF:02FCE544:65B7A0A0:vzdump::root@pam:
Jan 29 13:57:04 pmx pvescheduler[612015]: INFO: starting new backup job: vzdump --all 1 --mailto *** --compress zstd --storage backup --quiet 1 --mailnotification failure --mode snapshot --prune-backups 'keep-last=5'
Jan 29 13:57:04 pmx pvescheduler[612015]: INFO: Starting Backup of VM 101 (qemu)
Jan 29 13:57:37 pmx pvescheduler[612015]: INFO: Finished Backup of VM 101 (00:00:33)
Jan 29 13:57:37 pmx pvescheduler[612015]: INFO: Starting Backup of VM 102 (lxc)
[...]

I see that the first part of the restart (shutting down the potentially running service) seems to fail due to timeout which I guess shouldn't happen:
systemd[1]: pve-cluster.service: Failed with result 'timeout'.

Underlying cause for the failing login seems to be an outdated auth key pair which is renewed immediately after service restart:
pvestatd[2442]: auth key pair too old, rotating..

Last observation is that immediately after service restart, also backup jobs start which are not supposed to happen during this time of the day. Hence, I think, it might have to do with some interrupted/failed/... backup job also ?!
pvescheduler[612014]: <root@pam> starting task UPID:pmx:000956AF:02FCE544:65B7A0A0:vzdump::root@pam:

I would be happy if you could help me understand why what fails to find some permanent solution to restore login-capability.

Thank you and best regards,
Siebo
 
please post the logs before you attempted the restart of pve-cluster..
 
Hi Fabian,

yes, sorry, you are completely right and by looking through the logs I am now sure it is about an issue with the backup consuming too much space.

Code:
Jan 27 03:32:59 pmx pvescheduler[4030922]: INFO: Finished Backup of VM 103 (00:00:30)
Jan 27 03:32:59 pmx pvescheduler[4030922]: INFO: Starting Backup of VM 104 (lxc)
Jan 27 03:38:01 pmx pvescheduler[4030922]: INFO: Finished Backup of VM 104 (00:05:02)
Jan 27 03:38:01 pmx pvescheduler[4030922]: INFO: Starting Backup of VM 105 (lxc)
Jan 27 03:38:38 pmx pvescheduler[4030922]: INFO: Finished Backup of VM 105 (00:00:37)
Jan 27 03:38:38 pmx pvescheduler[4030922]: INFO: Starting Backup of VM 106 (lxc)
Jan 27 03:38:54 pmx pvescheduler[4030922]: INFO: Finished Backup of VM 106 (00:00:16)
Jan 27 03:38:54 pmx pvescheduler[4030922]: INFO: Starting Backup of VM 107 (qemu)
Jan 27 03:42:30 pmx pve-firewall[2440]: firewall update time (6.280 seconds)
Jan 27 03:42:30 pmx pvestatd[2442]: status update time (6.809 seconds)
Jan 27 03:42:38 pmx smartd[1812]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 60 to 49
Jan 27 03:46:11 pmx pveproxy[3504272]: Warning: unable to close filehandle GEN5 properly: No space left on device at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1901.
Jan 27 03:46:11 pmx pveproxy[3504272]: error writing access log
Jan 27 03:46:11 pmx rsyslogd[1809]: file '/var/log/syslog'[8] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2302.0 try https://www.rsyslog.com/e/2027 ]
Jan 27 03:46:11 pmx rsyslogd[1809]: action 'action-0-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2302.0 try https://www.rsyslog.com/e/2027 ]
Jan 27 03:46:11 pmx rsyslogd[1809]: rsyslogd[internal_messages]: 569 messages lost due to rate-limiting (500 allowed within 5 seconds)
[...]
Jan 27 03:46:11 pmx pmxcfs[3105107]: [database] crit: commit transaction failed: database or disk is full#010
Jan 27 03:46:11 pmx pmxcfs[3105107]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010
Jan 27 03:46:11 pmx pvescheduler[4030922]: unable to delete old temp file: Input/output error
[...]
Jan 27 03:46:12 pmx rsyslogd[1809]: file '/var/log/syslog'[8] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2302.0 try https://www.rsyslog.com/e/2027 ]
Jan 27 03:46:12 pmx rsyslogd[1809]: action 'action-0-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2302.0 try https://www.rsyslog.com/e/2027 ]
Jan 27 03:46:12 pmx rsyslogd[1809]: file '/var/log/syslog'[8] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2302.0 try https://www.rsyslog.com/e/2027 ]
Jan 27 03:46:12 pmx pvescheduler[4030922]: ERROR: Backup of VM 107 failed - vma_queue_write: write error - Broken pipe
Jan 27 03:46:12 pmx pvescheduler[4030922]: ERROR: Backup of VM 108 failed - unable to create temporary directory '/rpool/backup/dump/vzdump-qemu-108-2024_01_27-03_46_12.tmp' at /usr/share/perl5/PVE/VZDump.pm line 1005.
Jan 27 03:46:12 pmx pvescheduler[4030922]: ERROR: Backup of VM 109 failed - unable to create temporary directory '/rpool/backup/dump/vzdump-qemu-109-2024_01_27-03_46_12.tmp' at /usr/share/perl5/PVE/VZDump.pm line 1005.
Jan 27 03:46:12 pmx pvescheduler[4030922]: ERROR: Backup of VM 110 failed - unable to create temporary directory '/var/tmp/vzdumptmp4030922_110' at /usr/share/perl5/PVE/VZDump.pm line 1005.
Jan 27 03:46:12 pmx pvescheduler[4030922]: ERROR: Backup of VM 111 failed - unable to create temporary directory '/var/tmp/vzdumptmp4030922_111' at /usr/share/perl5/PVE/VZDump.pm line 1005.
Jan 27 03:46:12 pmx pvescheduler[4030922]: INFO: Backup job finished with errors
Jan 27 03:46:12 pmx postfix/postdrop[4071802]: warning: mail_queue_enter: create file maildrop/170592.4071802: No space left on device
Jan 27 03:46:13 pmx pveproxy[3504272]: worker exit
Jan 27 03:46:13 pmx pveproxy[3504274]: Warning: unable to close filehandle GEN5 properly: No space left on device at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1901.
Jan 27 03:46:13 pmx pveproxy[3504274]: error writing access log
Jan 27 03:46:13 pmx pveproxy[3037345]: worker 3504272 finished
Jan 27 03:46:13 pmx pveproxy[3037345]: starting 1 worker(s)
Jan 27 03:46:13 pmx pveproxy[3037345]: worker 4071881 started
Jan 27 03:46:13 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 27 03:46:18 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 27 03:46:22 pmx pvescheduler[4030922]: job errors
Jan 27 03:46:22 pmx postfix/pickup[3913091]: 2D1E81279B: uid=0 from=<***>
Jan 27 03:46:22 pmx postfix/cleanup[4072259]: 2D1E81279B: message-id=<20240127024622.2D1E81279B@pmx.dev.null>
Jan 27 03:46:22 pmx postfix/qmgr[2428]: 2D1E81279B: from=<***>, size=24637, nrcpt=1 (queue active)
Jan 27 03:46:22 pmx postfix/smtp[4072261]: Trusted TLS connection established to smtp.gmail.com[64.233.184.109]:465: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256
Jan 27 03:46:22 pmx postfix/smtp[4072261]: 2D1E81279B: to=<***>, relay=smtp.gmail.com[64.233.184.109]:465, delay=11, delays=10/0.05/0.21/0.42, dsn=2.0.0, status=sent (250 2.0.0 OK  1706323582 m5-20020a05600c3b0500b0040d81ca11casm3324746wms.28 - gsmtp)
Jan 27 03:46:22 pmx postfix/qmgr[2428]: 2D1E81279B: removed
Jan 27 03:46:23 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 27 03:46:28 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
[...]
Jan 29 13:39:20 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:39:25 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:39:29 pmx pvestatd[2442]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 29 13:39:29 pmx pvestatd[2442]: status update time (9.243 seconds)
Jan 29 13:39:30 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:39:35 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:39:40 pmx pvestatd[2442]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 29 13:39:40 pmx pvestatd[2442]: status update time (9.246 seconds)
Jan 29 13:39:40 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:39:45 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:39:49 pmx pvestatd[2442]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
[...]
Jan 29 13:56:05 pmx systemd[1]: Started user@0.service - User Manager for UID 0.
Jan 29 13:56:05 pmx systemd[1]: Started session-4289.scope - Session 4289 of User root.
Jan 29 13:56:05 pmx sshd[609296]: pam_env(sshd:session): deprecated reading of user environment enabled
Jan 29 13:56:06 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:56:09 pmx pvestatd[2442]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 29 13:56:09 pmx pvestatd[2442]: status update time (9.260 seconds)
Jan 29 13:56:11 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:56:13 pmx pvescheduler[609376]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Jan 29 13:56:13 pmx pvescheduler[609375]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Jan 29 13:56:16 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:56:19 pmx pvestatd[2442]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 29 13:56:19 pmx pvestatd[2442]: status update time (9.242 seconds)
Jan 29 13:56:21 pmx pve-ha-lrm[2486]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx/lrm_status.tmp.2486' - Input/output error
Jan 29 13:56:24 pmx systemd[1]: Stopping pve-cluster.service - The Proxmox VE cluster filesystem...
Jan 29 13:56:24 pmx pmxcfs[3105107]: [main] notice: teardown filesystem
Jan 29 13:56:24 pmx systemd[1]: etc-pve.mount: Deactivated successfully.
Jan 29 13:56:29 pmx pvestatd[2442]: authkey rotation error: cfs-lock 'authkey' error: no quorum!
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Killing process 3105107 (pmxcfs) with signal SIGKILL.
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Killing process 3105109 (cfs_loop) with signal SIGKILL.
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Jan 29 13:56:34 pmx systemd[1]: Stopped pve-cluster.service - The Proxmox VE cluster filesystem.
Jan 29 13:56:34 pmx systemd[1]: pve-cluster.service: Consumed 4min 7.698s CPU time.
Jan 29 13:56:34 pmx systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Jan 29 13:56:34 pmx pmxcfs[610665]: [main] notice: resolved node name 'pmx' to '192.168.253.2' for default node IP address
Jan 29 13:56:34 pmx pmxcfs[610665]: [main] notice: resolved node name 'pmx' to '192.168.253.2' for default node IP address
Jan 29 13:56:35 pmx systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Jan 29 13:56:35 pmx systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync>
Jan 29 13:56:35 pmx pvestatd[2442]: status update time (14.695 seconds)
Jan 29 13:56:35 pmx pve-firewall[2440]: firewall update time (8.015 seconds)
Jan 29 13:56:35 pmx pvestatd[2442]: auth key pair too old, rotating..
Jan 29 13:56:57 pmx pvedaemon[2346307]: <root@pam> successful auth for user 'root@pam'
Jan 29 13:57:04 pmx pvescheduler[612014]: <root@pam> starting task UPID:pmx:000956AF:02FCE544:65B7A0A0:vzdump::root@pam:
Jan 29 13:57:04 pmx pvescheduler[612015]: INFO: starting new backup job: vzdump --all 1 --mailto *** --compress zstd --storage backup --quiet 1 --mailnotification failure --mode snapshot --prune-backups 'keep-last=5'
Jan 29 13:57:04 pmx pvescheduler[612015]: INFO: Starting Backup of VM 101 (qemu)
Jan 29 13:57:37 pmx pvescheduler[612015]: INFO: Finished Backup of VM 101 (00:00:33)
Jan 29 13:57:37 pmx pvescheduler[612015]: INFO: Starting Backup of VM 102 (lxc)
[...]

I will investigate this further, as I was sure (so far...) to have set it up in a way that backup shouldn't interfere with any operational storage space, but we'll see...

Best regards,
Siebo
 
yeah that sounds like a plausible cause ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!