Login failure

megusta

New Member
Jun 20, 2020
9
0
1
43
After some days I can't login in no account, only ssh works, after reboot it works also fine.
I've fresh install of latest proxmox with ZFS on NVME SSD (maybe I've to make some special settings for that?)

restart of pvesr.service doesn't work, I'll get an error. ZFS status and scrub has no errors. Manually delete with root also doesn't work.

Code:
$ sudo journalctl -r
-- Logs begin at Thu 2020-06-18 08:56:34 CEST, end at Sat 2020-06-20 09:42:33 CEST. --
Jun 20 09:42:33 prox sudo[9060]: pam_unix(sudo:session): session opened for user root by prox(uid=0)
Jun 20 09:42:33 prox sudo[9060]:     prox : TTY=pts/0 ; PWD=/home/prox ; USER=root ; COMMAND=/usr/bin/journalctl -r
Jun 20 09:42:33 prox pvestatd[1927]: status update time (9.034 seconds)
Jun 20 09:42:33 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:42:32 prox sudo[7764]: pam_unix(sudo:session): session closed for user root
Jun 20 09:42:31 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:26 prox pvedaemon[1946]: authentication failure; rhost=192.168.178.2 user=prox@pam msg=Authentication failure
Jun 20 09:42:26 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:24 prox IPCC.xs[1946]: pam_unix(common-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=prox
Jun 20 09:42:23 prox pvestatd[1927]: status update time (9.034 seconds)
Jun 20 09:42:23 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:42:21 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:16 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:13 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:42:13 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:42:11 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:09 prox systemd[1]: Failed to start Proxmox VE replication runner.
Jun 20 09:42:09 prox systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 20 09:42:09 prox systemd[1]: pvesr.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 20 09:42:09 prox pvesr[8850]: error during cfs-locked 'file-replication_cfg' operation: got lock request timeout
Jun 20 09:42:08 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:07 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:06 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:06 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:05 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:04 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:03 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:03 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:42:03 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:42:02 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:01 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:01 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:00 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:00 prox systemd[1]: Starting Proxmox VE replication runner...
Jun 20 09:41:56 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:53 prox pvestatd[1927]: status update time (9.034 seconds)
Jun 20 09:41:53 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:41:51 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:46 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:43 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:41:43 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:41:41 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:36 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:33 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:41:33 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:41:31 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:26 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:23 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:41:23 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:41:21 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:16 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
 
hi,

to me it seems like a problem with the cluster filesystem.

maybe you can try:
Code:
systemctl stop pve-cluster
rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
systemctl start pve-cluster

edit: do each step on all nodes separately if you're using a cluster
 
hi,

to me it seems like a problem with the cluster filesystem.

maybe you can try:
Code:
systemctl stop pve-cluster
rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
systemctl start pve-cluster

edit: do each step on all nodes separately if you're using a cluster

Great, that works, but isn't comfortable! I can run a cronjob, but it is only a crutch, have you a better solution?
 
well, i'm not sure about the root cause behind this beyond the fact that the /etc/pve cluster filesystem (backed by /var/lib/pve-cluster/config.db) is likely getting full... (this file is limited to 30M)

how long does it usually last until this problem reoccurs?

are you holding any big files in /etc/pve ? du -h /etc/pve

what is the size of the config.db file? ls -al /var/lib/pve-cluster/
 
how long does it usually last until this problem reoccurs?

I'm not sure, 2-3 days maybe. (Now I have a timestamp 29.06.2020)
At the moment it works fine, I'm running only 4 VMs. I've disabled automatic backups in proxmox, can backup cause for overfill in config.db?

Code:
$ du -h /etc/pve

0       /etc/pve/ha
0       /etc/pve/priv/lock
0       /etc/pve/priv/acme
2.5K    /etc/pve/priv
0       /etc/pve/sdn
0       /etc/pve/nodes/prox/priv
0       /etc/pve/nodes/prox/openvz
0       /etc/pve/nodes/prox/lxc
2.0K    /etc/pve/nodes/prox/qemu-server
4.5K    /etc/pve/nodes/prox
4.5K    /etc/pve/nodes
0       /etc/pve/virtual-guest
11K     /etc/pve

Code:
$ls -al /var/lib/pve-cluster/

total 278
drwxr-xr-x  2 root root       6 Jun 29 14:07 .
drwxr-xr-x 37 root root      37 Jun  1 11:13 ..
-rw-------  1 root root   36864 Jun 30 13:34 config.db
-rw-------  1 root root   32768 Jun 30 13:34 config.db-shm
-rw-------  1 root root 4120032 Jun 30 13:34 config.db-wal
-rw-------  1 root root       0 Jun 29 14:07 .pmxcfs.lockfile
 
I've disabled automatic backups in proxmox, can backup cause for overfill in config.db?

no, i don't think so (since backups are stored separately)

looks good for now, if the problem reoccurs can you check the last two commands output again?
 
hi,

to me it seems like a problem with the cluster filesystem.

maybe you can try:
Code:
systemctl stop pve-cluster
rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
systemctl start pve-cluster

edit: do each step on all nodes separately if you're using a cluster
Sorry to bump old thread but I started to have the same error and followed your instruction and boom, webgui access is restored. Thank you.
 
  • Like
Reactions: drd2aiki and oguz
Same issue here and the solution above worked. Now, why is it happening? Makes me nervous.
 
Hi,
Same issue here and the solution above worked. Now, why is it happening? Makes me nervous.
please post the output of
Code:
journalctl -b -u pve-cluster.service -u corosync.service
pveversion -v
If you already rebooted since the issue happened use -b-<number of boots> for the first command instead. Did you try restarting the service without touching the lock file before? Otherwise we can't be sure it's even related to the lock.

If this ever happens again and if restarting pve-cluster.service alone doesn't help, you can check with lsof /var/lib/pve-cluster/.pmxcfs.lockfile and fuser -vau /var/lib/pve-cluster/.pmxcfs.lockfile if something else is holding the lock file.
 
I had issues logging in to the web ui today. Via ssh su - was working fine.

Standalone node - no cluster.

No obvious space issues or kernel messages # journalctl --since='5 days ago' -k

Based on the advice from post #2 I took a decision to run # systemctl restart pve-cluster before anything else and that resolved the issue. The journal was clean afterwards and logins started working again.

It would be cool if this could attempt to self-heal?

Looks like the following log entry could be related to the root cause?

Code:
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: commit transaction failed: disk I/O error#010

The following noteworthy diagnostic cmds and messages from the journal:

Code:
root@viper:~# pveversion
pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve)

root@viper:~# journalctl -f
-- Journal begins at Sun 2022-02-06 19:34:41 UTC. --
Jan 21 18:38:37 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:42 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:44 viper pvestatd[41034]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 21 18:38:44 viper pvestatd[41034]: status update time (9.175 seconds)
Jan 21 18:38:47 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:52 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:54 viper pvestatd[41034]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 21 18:38:54 viper pvestatd[41034]: status update time (9.207 seconds)

... this kept repeating

root@viper:~# ls -alh /var/lib/pve-cluster/config.db
-rw------- 1 root root 82K Jan 20 14:24 /var/lib/pve-cluster/config.db


root@viper:~# journalctl -b -u pve-cluster.service -u corosync.service
-- Journal begins at Sun 2022-02-06 19:34:41 UTC, ends at Sat 2023-01-21 18:43:02 UTC. --
Jan 06 17:32:04 viper systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 06 17:32:05 viper systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 06 17:32:05 viper systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Jan 06 17:32:06 viper pmxcfs[40572]: [main] notice: ignore insert of duplicate cluster log
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: commit transaction failed: disk I/O error#010
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010


root@viper:~# lsof /var/lib/pve-cluster/.pmxcfs.lockfile
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
pmxcfs  40572 root    3u   REG   0,26        0 57704 /var/lib/pve-cluster/.pmxcfs.lockfile
root@viper:~# fuser -vau /var/lib/pve-cluster/.pmxcfs.lockfile
                     USER        PID ACCESS COMMAND
/var/lib/pve-cluster/.pmxcfs.lockfile:
                     root      40572 F.... (root)pmxcfs
 
Hi,
I had issues logging in to the web ui today. Via ssh su - was working fine.

Standalone node - no cluster.

No obvious space issues or kernel messages # journalctl --since='5 days ago' -k

Based on the advice from post #2 I took a decision to run # systemctl restart pve-cluster before anything else and that resolved the issue. The journal was clean afterwards and logins started working again.

It would be cool if this could attempt to self-heal?

Looks like the following log entry could be related to the root cause?

Code:
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: commit transaction failed: disk I/O error#010

The following noteworthy diagnostic cmds and messages from the journal:

Code:
root@viper:~# pveversion
pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve)

root@viper:~# journalctl -f
-- Journal begins at Sun 2022-02-06 19:34:41 UTC. --
Jan 21 18:38:37 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:42 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:44 viper pvestatd[41034]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 21 18:38:44 viper pvestatd[41034]: status update time (9.175 seconds)
Jan 21 18:38:47 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:52 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:54 viper pvestatd[41034]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 21 18:38:54 viper pvestatd[41034]: status update time (9.207 seconds)

... this kept repeating

root@viper:~# ls -alh /var/lib/pve-cluster/config.db
-rw------- 1 root root 82K Jan 20 14:24 /var/lib/pve-cluster/config.db


root@viper:~# journalctl -b -u pve-cluster.service -u corosync.service
-- Journal begins at Sun 2022-02-06 19:34:41 UTC, ends at Sat 2023-01-21 18:43:02 UTC. --
Jan 06 17:32:04 viper systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 06 17:32:05 viper systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 06 17:32:05 viper systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Jan 06 17:32:06 viper pmxcfs[40572]: [main] notice: ignore insert of duplicate cluster log
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: commit transaction failed: disk I/O error#010
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010


root@viper:~# lsof /var/lib/pve-cluster/.pmxcfs.lockfile
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
pmxcfs  40572 root    3u   REG   0,26        0 57704 /var/lib/pve-cluster/.pmxcfs.lockfile
root@viper:~# fuser -vau /var/lib/pve-cluster/.pmxcfs.lockfile
                     USER        PID ACCESS COMMAND
/var/lib/pve-cluster/.pmxcfs.lockfile:
                     root      40572 F.... (root)pmxcfs
getting I/O errors could mean that one of your disks is nearing its end. Please check e.g. /var/log/syslog for further information and whether your disks are still healthy using e.g. smartctl. There are tools like ddrescue to salvage data.
 
Hi,

getting I/O errors could mean that one of your disks is nearing its end. Please check e.g. /var/log/syslog for further information and whether your disks are still healthy using e.g. smartctl. There are tools like ddrescue to salvage data.
Thanks @fiona - i checked a few relevant metrics for the rpool mirror - everything seems ok.
 
Thanks @fiona - i checked a few relevant metrics for the rpool mirror - everything seems ok.
Glad to hear! Let's hope the IO error was just transient or not low-level.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!