Login failure

megusta · Jun 20, 2020

After some days I can't login in no account, only ssh works, after reboot it works also fine.
I've fresh install of latest proxmox with ZFS on NVME SSD (maybe I've to make some special settings for that?)

restart of pvesr.service doesn't work, I'll get an error. ZFS status and scrub has no errors. Manually delete with root also doesn't work.

Code:

$ sudo journalctl -r
-- Logs begin at Thu 2020-06-18 08:56:34 CEST, end at Sat 2020-06-20 09:42:33 CEST. --
Jun 20 09:42:33 prox sudo[9060]: pam_unix(sudo:session): session opened for user root by prox(uid=0)
Jun 20 09:42:33 prox sudo[9060]:     prox : TTY=pts/0 ; PWD=/home/prox ; USER=root ; COMMAND=/usr/bin/journalctl -r
Jun 20 09:42:33 prox pvestatd[1927]: status update time (9.034 seconds)
Jun 20 09:42:33 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:42:32 prox sudo[7764]: pam_unix(sudo:session): session closed for user root
Jun 20 09:42:31 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:26 prox pvedaemon[1946]: authentication failure; rhost=192.168.178.2 user=prox@pam msg=Authentication failure
Jun 20 09:42:26 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:24 prox IPCC.xs[1946]: pam_unix(common-auth:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=prox
Jun 20 09:42:23 prox pvestatd[1927]: status update time (9.034 seconds)
Jun 20 09:42:23 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:42:21 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:16 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:13 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:42:13 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:42:11 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:09 prox systemd[1]: Failed to start Proxmox VE replication runner.
Jun 20 09:42:09 prox systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 20 09:42:09 prox systemd[1]: pvesr.service: Main process exited, code=exited, status=5/NOTINSTALLED
Jun 20 09:42:09 prox pvesr[8850]: error during cfs-locked 'file-replication_cfg' operation: got lock request timeout
Jun 20 09:42:08 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:07 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:06 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:06 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:05 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:04 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:03 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:03 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:42:03 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:42:02 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:01 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:01 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:42:00 prox pvesr[8850]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 20 09:42:00 prox systemd[1]: Starting Proxmox VE replication runner...
Jun 20 09:41:56 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:53 prox pvestatd[1927]: status update time (9.034 seconds)
Jun 20 09:41:53 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:41:51 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:46 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:43 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:41:43 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:41:41 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:36 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:33 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:41:33 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:41:31 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:26 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:23 prox pvestatd[1927]: status update time (9.033 seconds)
Jun 20 09:41:23 prox pvestatd[1927]: authkey rotation error: error during cfs-locked 'authkey' operation: got lock request timeout
Jun 20 09:41:21 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jun 20 09:41:16 prox pve-ha-lrm[1961]: unable to write lrm status file - unable to delete old temp file: Input/output error

oguz · Jun 23, 2020

hi,

to me it seems like a problem with the cluster filesystem.

maybe you can try:

Code:

systemctl stop pve-cluster
rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
systemctl start pve-cluster

edit: do each step on all nodes separately if you're using a cluster

megusta · Jun 29, 2020

oguz said:
hi,

to me it seems like a problem with the cluster filesystem.

maybe you can try:

Code:

systemctl stop pve-cluster rm -f /var/lib/pve-cluster/.pmxcfs.lockfile systemctl start pve-cluster

edit: do each step on all nodes separately if you're using a cluster

Great, that works, but isn't comfortable! I can run a cronjob, but it is only a crutch, have you a better solution?

oguz · Jun 29, 2020

well, i'm not sure about the root cause behind this beyond the fact that the /etc/pve cluster filesystem (backed by /var/lib/pve-cluster/config.db) is likely getting full... (this file is limited to 30M)

how long does it usually last until this problem reoccurs?

are you holding any big files in /etc/pve ? du -h /etc/pve

what is the size of the config.db file? ls -al /var/lib/pve-cluster/

megusta · Jun 30, 2020

oguz said:
how long does it usually last until this problem reoccurs?

I'm not sure, 2-3 days maybe. (Now I have a timestamp 29.06.2020)
At the moment it works fine, I'm running only 4 VMs. I've disabled automatic backups in proxmox, can backup cause for overfill in config.db?

Code:

$ du -h /etc/pve

0       /etc/pve/ha
0       /etc/pve/priv/lock
0       /etc/pve/priv/acme
2.5K    /etc/pve/priv
0       /etc/pve/sdn
0       /etc/pve/nodes/prox/priv
0       /etc/pve/nodes/prox/openvz
0       /etc/pve/nodes/prox/lxc
2.0K    /etc/pve/nodes/prox/qemu-server
4.5K    /etc/pve/nodes/prox
4.5K    /etc/pve/nodes
0       /etc/pve/virtual-guest
11K     /etc/pve

Code:

$ls -al /var/lib/pve-cluster/

total 278
drwxr-xr-x  2 root root       6 Jun 29 14:07 .
drwxr-xr-x 37 root root      37 Jun  1 11:13 ..
-rw-------  1 root root   36864 Jun 30 13:34 config.db
-rw-------  1 root root   32768 Jun 30 13:34 config.db-shm
-rw-------  1 root root 4120032 Jun 30 13:34 config.db-wal
-rw-------  1 root root       0 Jun 29 14:07 .pmxcfs.lockfile

oguz · Jun 30, 2020

megusta said:
I've disabled automatic backups in proxmox, can backup cause for overfill in config.db?

no, i don't think so (since backups are stored separately)

looks good for now, if the problem reoccurs can you check the last two commands output again?

boethius · Aug 2, 2022

oguz said:
hi,

to me it seems like a problem with the cluster filesystem.

maybe you can try:

Code:

systemctl stop pve-cluster rm -f /var/lib/pve-cluster/.pmxcfs.lockfile systemctl start pve-cluster

edit: do each step on all nodes separately if you're using a cluster

Sorry to bump old thread but I started to have the same error and followed your instruction and boom, webgui access is restored. Thank you.

drd2aiki · Dec 5, 2022

Same issue here and the solution above worked. Now, why is it happening? Makes me nervous.

fiona · Dec 6, 2022

Hi,

drd2aiki said:
Same issue here and the solution above worked. Now, why is it happening? Makes me nervous.

please post the output of

Code:

journalctl -b -u pve-cluster.service -u corosync.service
pveversion -v

If you already rebooted since the issue happened use -b-<number of boots> for the first command instead. Did you try restarting the service without touching the lock file before? Otherwise we can't be sure it's even related to the lock.

If this ever happens again and if restarting pve-cluster.service alone doesn't help, you can check with lsof /var/lib/pve-cluster/.pmxcfs.lockfile and fuser -vau /var/lib/pve-cluster/.pmxcfs.lockfile if something else is holding the lock file.

Kyle · Jan 21, 2023

I had issues logging in to the web ui today. Via ssh su - was working fine.

Standalone node - no cluster.

No obvious space issues or kernel messages # journalctl --since='5 days ago' -k

Based on the advice from post #2 I took a decision to run # systemctl restart pve-cluster before anything else and that resolved the issue. The journal was clean afterwards and logins started working again.

It would be cool if this could attempt to self-heal?

Looks like the following log entry could be related to the root cause?

Code:

Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: commit transaction failed: disk I/O error#010

The following noteworthy diagnostic cmds and messages from the journal:

Code:

root@viper:~# pveversion
pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve)

root@viper:~# journalctl -f
-- Journal begins at Sun 2022-02-06 19:34:41 UTC. --
Jan 21 18:38:37 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:42 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:44 viper pvestatd[41034]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 21 18:38:44 viper pvestatd[41034]: status update time (9.175 seconds)
Jan 21 18:38:47 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:52 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:54 viper pvestatd[41034]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 21 18:38:54 viper pvestatd[41034]: status update time (9.207 seconds)

... this kept repeating

root@viper:~# ls -alh /var/lib/pve-cluster/config.db
-rw------- 1 root root 82K Jan 20 14:24 /var/lib/pve-cluster/config.db


root@viper:~# journalctl -b -u pve-cluster.service -u corosync.service
-- Journal begins at Sun 2022-02-06 19:34:41 UTC, ends at Sat 2023-01-21 18:43:02 UTC. --
Jan 06 17:32:04 viper systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 06 17:32:05 viper systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 06 17:32:05 viper systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Jan 06 17:32:06 viper pmxcfs[40572]: [main] notice: ignore insert of duplicate cluster log
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: commit transaction failed: disk I/O error#010
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010


root@viper:~# lsof /var/lib/pve-cluster/.pmxcfs.lockfile
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
pmxcfs  40572 root    3u   REG   0,26        0 57704 /var/lib/pve-cluster/.pmxcfs.lockfile
root@viper:~# fuser -vau /var/lib/pve-cluster/.pmxcfs.lockfile
                     USER        PID ACCESS COMMAND
/var/lib/pve-cluster/.pmxcfs.lockfile:
                     root      40572 F.... (root)pmxcfs

fiona · Jan 23, 2023

Hi,

Kyle said:

I had issues logging in to the web ui today. Via ssh su - was working fine.

Standalone node - no cluster.

No obvious space issues or kernel messages # journalctl --since='5 days ago' -k

Based on the advice from post #2 I took a decision to run # systemctl restart pve-cluster before anything else and that resolved the issue. The journal was clean afterwards and logins started working again.

It would be cool if this could attempt to self-heal?

Looks like the following log entry could be related to the root cause?

Code:

Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: commit transaction failed: disk I/O error#010

The following noteworthy diagnostic cmds and messages from the journal:

Code:

root@viper:~# pveversion
pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve)

root@viper:~# journalctl -f
-- Journal begins at Sun 2022-02-06 19:34:41 UTC. --
Jan 21 18:38:37 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:42 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:44 viper pvestatd[41034]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 21 18:38:44 viper pvestatd[41034]: status update time (9.175 seconds)
Jan 21 18:38:47 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:52 viper pve-ha-lrm[42292]: unable to write lrm status file - unable to delete old temp file: Input/output error
Jan 21 18:38:54 viper pvestatd[41034]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Jan 21 18:38:54 viper pvestatd[41034]: status update time (9.207 seconds)

... this kept repeating

root@viper:~# ls -alh /var/lib/pve-cluster/config.db
-rw------- 1 root root 82K Jan 20 14:24 /var/lib/pve-cluster/config.db


root@viper:~# journalctl -b -u pve-cluster.service -u corosync.service
-- Journal begins at Sun 2022-02-06 19:34:41 UTC, ends at Sat 2023-01-21 18:43:02 UTC. --
Jan 06 17:32:04 viper systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 06 17:32:05 viper systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 06 17:32:05 viper systemd[1]: Condition check resulted in Corosync Cluster Engine being skipped.
Jan 06 17:32:06 viper pmxcfs[40572]: [main] notice: ignore insert of duplicate cluster log
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: commit transaction failed: disk I/O error#010
Jan 20 14:25:48 viper pmxcfs[40572]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010


root@viper:~# lsof /var/lib/pve-cluster/.pmxcfs.lockfile
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
pmxcfs  40572 root    3u   REG   0,26        0 57704 /var/lib/pve-cluster/.pmxcfs.lockfile
root@viper:~# fuser -vau /var/lib/pve-cluster/.pmxcfs.lockfile
                     USER        PID ACCESS COMMAND
/var/lib/pve-cluster/.pmxcfs.lockfile:
                     root      40572 F.... (root)pmxcfs

getting I/O errors could mean that one of your disks is nearing its end. Please check e.g. /var/log/syslog for further information and whether your disks are still healthy using e.g. smartctl. There are tools like ddrescue to salvage data.

Kyle · Jan 28, 2023

fiona said:
Hi,

getting I/O errors could mean that one of your disks is nearing its end. Please check e.g. /var/log/syslog for further information and whether your disks are still healthy using e.g. smartctl. There are tools like ddrescue to salvage data.

Thanks @fiona - i checked a few relevant metrics for the rpool mirror - everything seems ok.

fiona · Jan 30, 2023

Kyle said:
Thanks @fiona - i checked a few relevant metrics for the rpool mirror - everything seems ok.

Glad to hear! Let's hope the IO error was just transient or not low-level.

Search

Search

Login failure

megusta

New Member

oguz

Proxmox Retired Staff

megusta

New Member

oguz

Proxmox Retired Staff

megusta

New Member

oguz

Proxmox Retired Staff

boethius

Renowned Member

drd2aiki

Member

fiona

Proxmox Staff Member

Kyle

Active Member

fiona

Proxmox Staff Member

Kyle

Active Member

fiona

Proxmox Staff Member