Login failed. Please try again

Riesling.Dry · Mar 1, 2022

Hi all,
after an apt update earlier today (dunno if the update is related to the error though) I cannot log in via GUI/Browser anymore and get above error-message.

systemctl status pvedaemon gives me:

msg=Can't call method "sign" on an undefined value at /usr/share/perl5/PVE/Ticket.pm line 73

Code:

● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-03-01 15:00:35 CET; 3h 54min ago
   Main PID: 7595 (pvedaemon)
      Tasks: 4 (limit: 9393)
     Memory: 138.0M
        CPU: 5.651s
     CGroup: /system.slice/pvedaemon.service
             ├─7595 pvedaemon
             ├─7596 pvedaemon worker
             ├─7603 pvedaemon worker
             └─7606 pvedaemon worker

Mar 01 15:02:00 golem pvedaemon[7596]: auth key new enough, skipping rotation
Mar 01 15:02:00 golem pvedaemon[7596]: authentication failure; rhost=::ffff:10.0.0.123 user=root@pam msg=Can't call method "sign" on an undefined value at /usr/share/perl5/PVE/Ticket.pm line 73.
Mar 01 15:34:30 golem pvedaemon[7596]: auth key new enough, skipping rotation
Mar 01 15:34:30 golem pvedaemon[7596]: authentication failure; rhost=::ffff:10.0.0.123 user=root@pam msg=Can't call method "sign" on an undefined value at /usr/share/perl5/PVE/Ticket.pm line 73.
Mar 01 18:35:29 golem pvedaemon[7603]: auth key new enough, skipping rotation
Mar 01 18:35:29 golem pvedaemon[7603]: authentication failure; rhost=::ffff:10.0.0.123 user=root@pam msg=Can't call method "sign" on an undefined value at /usr/share/perl5/PVE/Ticket.pm line 73.
Mar 01 18:44:01 golem pvedaemon[7596]: auth key new enough, skipping rotation
Mar 01 18:44:01 golem pvedaemon[7596]: authentication failure; rhost=::ffff:10.0.0.123 user=root@pam msg=Can't call method "sign" on an undefined value at /usr/share/perl5/PVE/Ticket.pm line 73.
Mar 01 18:45:26 golem pvedaemon[7603]: auth key new enough, skipping rotation
Mar 01 18:45:26 golem pvedaemon[7603]: authentication failure; rhost=::ffff:10.0.0.123 user=root@pam msg=Can't call method "sign" on an undefined value at /usr/share/perl5/PVE/Ticket.pm line 73.

What is broken and (how) can it be fixed?
This is a simple VE7.1-10 w. 4 LXC CTs not currently running.
If I can post anything that helps track down the issue, please do not hesitate to ask

Cheers,
~R.

fiona · Mar 2, 2022

Hi,
could you try if systemctl reload-or-restart pvedaemon.service helps? If not, please share the output of pveversion -v and check (but don't share!) your /etc/pve/priv/authkey.key.

Riesling.Dry · Mar 2, 2022

Fabian_E said:
Hi,
could you try if systemctl reload-or-restart pvedaemon.service helps? If not, please share the output of pveversion -v and check (but don't share!) your /etc/pve/priv/authkey.key.

Thank you Fabi!
yes,
systemctl reload-or-restart pvedaemon.service
fixed it

Curious, as several re-boots didn't...
Best,
~R.

Riesling.Dry · Mar 29, 2022

Hi all,
I'm afraid I have to follow up...

Different box, different error, same symptoms, i.e. I cannot log in via GUI/Browser anymore and get above error-message.:

systemctl status pvedaemon now gives me:
authentication failure; rhost=::ffff:10.0.0.123 user=root@pam msg=cfs-lock 'file-priv_tfa_cfg' error: got lock request timeout

● pvedaemon.service - PVE API Daemon
Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2022-03-24 11:11:03 CET; 5 days ago
Process: 2130 ExecStart=/usr/bin/pvedaemon start (code=exited, status=0/SUCCESS)
Process: 1959292 ExecReload=/usr/bin/pvedaemon restart (code=exited, status=0/SUCCESS)
Main PID: 2236 (pvedaemon)
Tasks: 4 (limit: 9406)
Memory: 144.5M
CPU: 1min 114ms
CGroup: /system.slice/pvedaemon.service
├─ 2236 pvedaemon
├─1959363 pvedaemon worker
├─1959364 pvedaemon worker
└─1959365 pvedaemon worker

Mar 27 16:14:43 skynet pvedaemon[2236]: worker 2237 finished
Mar 27 16:14:49 skynet pvedaemon[2239]: authentication failure; rhost=::ffff:10.0.0.125 user=root@pam msg=cfs-lock 'file-priv_tfa_cfg' error: got lock request timeout
Mar 27 16:14:49 skynet pvedaemon[2239]: worker exit
Mar 27 16:14:49 skynet pvedaemon[2236]: worker 2239 finished
Mar 27 16:15:42 skynet pvedaemon[1959363]: <root@pam> successful auth for user 'root@pam'
Mar 27 17:22:58 skynet pvedaemon[1959365]: <root@pam> successful auth for user 'root@pam'
Mar 27 17:23:12 skynet pvedaemon[1959365]: <root@pam> successful auth for user 'root@pam'
Mar 29 11:13:52 skynet pvedaemon[1959365]: authentication failure; rhost=::ffff:10.0.0.123 user=root@pam msg=cfs-lock 'file-priv_tfa_cfg' error: got lock request timeout
Mar 29 11:14:31 skynet pvedaemon[1959364]: authentication failure; rhost=::ffff:10.0.0.123 user=root@pam msg=cfs-lock 'file-priv_tfa_cfg' error: got lock request timeout
lines 1-24

systemctl reload-or-restart pvedaemon.service does NOT fix it this time, reboot does though...

proxmox-ve: 7.1-1 (running kernel: 5.13.19-6-pve)
pve-manager: 7.1-11 (running version: 7.1-11/8d529482)
pve-kernel-helper: 7.1-13
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-6-pve: 5.13.19-14
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-7
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-6
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

:~# l /etc/pve/priv/authkey.key
15 512 -rw------- 1 www-data 1.7K Mar 27 16:15 /etc/pve/priv/authkey.key
:~# cat /etc/pve/priv/authkey.key
-----BEGIN RSA PRIVATE KEY-----
[RSA PRIVATE DATA goes here.... :]
-----END RSA PRIVATE KEY-----

Box is 85% idle, 3.3wa, so it should answer in time.
Also I don't think the timeout in PVE is too short - the wheel spins for ~8 sec. before it throws the error.

What could be the cause, how can it be fixed (without reboot :°)?

Cheers,
~R.

fiona · Mar 30, 2022

What kind of second factor are you using? Did anybody/anything (e.g. script) attempt to log in (or do something with the TFA config) at the same time? Does it work if you wait for two minutes or so?

fabian · Mar 30, 2022

or maybe is your system part of a cluster, and you (temporarily/intermittently) lost quorum? obtaining a cluster wide lock only works if the cluster is quorate..

robert.kuropkat · Mar 31, 2022

@fabian what does your reply mean? I just created three nodes and was able to access all three. I created a cluster on the first and added the other two. Now I cannot login to the WebUI at all. Trying a command such as pvum group add gets a similar error as the above

create group failed: cfs-lock 'file-user_cfg' error: got lock request timeout

pvecm status says the cluster is Quorate.

Finally, the documentation here: https://pve.proxmox.com/wiki/Cluster_Manager mentions having to logout of a cluster node and log back in with "cluster credentials." What exactly does that mean? I created no users other than the root user at login, which is how I was logging into the WebUI.

At the moment, it seems I can do nothing with my cluster except ask for a status.

fabian · Mar 31, 2022

robert.kuropkat said:
@fabian what does your reply mean? I just created three nodes and was able to access all three. I created a cluster on the first and added the other two. Now I cannot login to the WebUI at all. Trying a command such as pvum group add gets a similar error as the above

it means that one potential cause of that error message is the cluster not being quorate (another one would be lock-contention, like @Fabian_E wrote

)

robert.kuropkat said:
create group failed: cfs-lock 'file-user_cfg' error: got lock request timeout

pvecm status says the cluster is Quorate.

can you write a test file to /etc/pve/something ?

robert.kuropkat said:
Finally, the documentation here: https://pve.proxmox.com/wiki/Cluster_Manager mentions having to logout of a cluster node and log back in with "cluster credentials." What exactly does that mean? I created no users other than the root user at login, which is how I was logging into the WebUI.

At the moment, it seems I can do nothing with my cluster except ask for a status.

it means that when you join a cluster, you might need to re-login afterwards (because the joining node loses its config and gets the one from the cluster - including things like the user.cfg file and the key used to sign session tickets). if you don't have any special config on the joining node, then this just means logging in again with the same root user.

Riesling.Dry · Mar 31, 2022

fabian said:
What kind of second factor are you using? Did anybody/anything (e.g. script) attempt to log in (or do something with the TFA config) at the same time? Does it work if you wait for two minutes or so? Or maybe is your system part of a cluster, and you (temporarily/intermittently) lost quorum? obtaining a cluster wide lock only works if the cluster is quorate..

Thanks, Fabian.

No cluster, no 2FA here, nobody else around to be trying to log in, box is secure behind firewall...
Waiting (even trying the next day) doesn't help. Only re-boot seems to help and fix, but only temporarily: rebooted when I posted above follow up, worked then, but I now again get the same error...

This could be relevant:

Code:

:~# touch /etc/pve/something
touch: cannot touch '/etc/pve/something': Input/output error

somehow the DIR is un-writable?
Funny dates to:

Code:

:/etc# l pve/

total 17K
   1   0 drwxr-xr-x  2 www-data    0 Jan  1  1970 .
 146 12K drwxr-xr-x 88 root      183 Mar 29 16:06 ..
  40 512 -rw-r-----  1 www-data  451 Mar 29 16:17 authkey.pub
  38 512 -rw-r-----  1 www-data  451 Mar 29 16:17 authkey.pub.old
7269   0 -r--r-----  1 www-data 4.4K Jan  1  1970 .clusterlog
7275 512 -rw-r-----  1 www-data   13 Sep 29  2021 datacenter.cfg
7264   0 -rw-r-----  1 www-data    2 Jan  1  1970 .debug
  22   0 drwxr-xr-x  2 www-data    0 Sep 29  2021 ha
7276 512 -rw-r-----  1 www-data  261 Mar 21 14:01 jobs.cfg
   3   0 lrwxr-xr-x  1 www-data    0 Jan  1  1970 local -> nodes/skynet
2287   0 lrwxr-xr-x  1 www-data    0 Jan  1  1970 lxc -> nodes/skynet/lxc
7266   0 -r--r-----  1 www-data   39 Jan  1  1970 .members
   5   0 drwxr-xr-x  2 www-data    0 Sep 29  2021 nodes
7270   0 lrwxr-xr-x  1 www-data    0 Jan  1  1970 openvz -> nodes/skynet/openvz
   9   0 drwx------  2 www-data    0 Sep 29  2021 priv
  17 512 -rw-r-----  1 www-data 2.1K Sep 29  2021 pve-root-ca.pem
  14 512 -rw-r-----  1 www-data 1.7K Sep 29  2021 pve-www.key
7271   0 lrwxr-xr-x  1 www-data    0 Jan  1  1970 qemu-server -> nodes/skynet/qemu-server
7274   0 -rw-r-----  1 www-data    0 Mar 21 13:57 replication.cfg
7267   0 -r--r-----  1 www-data 1.9K Jan  1  1970 .rrd
  24   0 drwxr-xr-x  2 www-data    0 Sep 29  2021 sdn
7272 512 -rw-r-----  1 www-data  167 Mar 27 16:16 storage.cfg
7273 512 -rw-r-----  1 www-data   32 Sep 29  2021 user.cfg
7268   0 -r--r-----  1 www-data  777 Jan  1  1970 .version
   4   0 drwxr-xr-x  2 www-data    0 Sep 29  2021 virtual-guest
7265   0 -r--r-----  1 www-data  791 Jan  1  1970 .vmlist
   2 512 -rw-r-----  1 www-data  120 Mar 21 13:57 vzdump.cron

Checked against another PVE: /etc/pve/ IS writable, but dates are similar...

Could the cause of the problem be hiding here? How can it be fixed?

Cheers,
~R.

fabian · Mar 31, 2022

so that would indicate that pmxcfs is blocking writes, which is really strange on a stand-alone node.. any errors/warnings visible in journalctl -u pve-cluster?

Riesling.Dry · Mar 31, 2022

fabian said:
any errors/warnings visible in journalctl -u pve-cluster?

This took ~10 sec. to complete, no current errs though...:

Code:

-- Journal begins at Wed 2021-09-29 14:28:05 CEST, ends at Thu 2022-03-31 14:04:43 CEST. --
Sep 29 14:28:12 skynet systemd[1]: Starting The Proxmox VE cluster filesystem...
Sep 29 14:28:13 skynet systemd[1]: Started The Proxmox VE cluster filesystem.
Sep 29 23:11:36 skynet pmxcfs[1469]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/skynet/local: -1
Sep 29 23:11:36 skynet pmxcfs[1469]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/skynet/local-zfs: -1
Sep 29 23:56:48 skynet pmxcfs[1469]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/skynet/local: -1
Sep 29 23:56:48 skynet pmxcfs[1469]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/skynet/local-zfs: -1
Oct 06 04:42:10 skynet pmxcfs[1469]: [database] crit: commit transaction failed: database or disk is full#010
Oct 06 04:42:10 skynet pmxcfs[1469]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010
Oct 06 11:17:50 skynet systemd[1]: Stopping The Proxmox VE cluster filesystem...
Oct 06 11:17:50 skynet pmxcfs[1469]: [main] notice: teardown filesystem
Oct 06 11:17:51 skynet pmxcfs[1469]: [main] notice: exit proxmox configuration filesystem (0)
Oct 06 11:17:51 skynet systemd[1]: pve-cluster.service: Succeeded.
Oct 06 11:17:51 skynet systemd[1]: Stopped The Proxmox VE cluster filesystem.
Oct 06 11:17:51 skynet systemd[1]: pve-cluster.service: Consumed 12min 39.929s CPU time.
-- Boot 0fbffc03125c4e2e92fdab3598558b45 --
Oct 06 11:20:29 skynet systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 06 11:20:30 skynet systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 13 09:48:34 skynet systemd[1]: Stopping The Proxmox VE cluster filesystem...
Oct 13 09:48:34 skynet pmxcfs[2450]: [main] notice: teardown filesystem
Oct 13 09:48:34 skynet pmxcfs[2450]: [main] notice: exit proxmox configuration filesystem (0)
Oct 13 09:48:34 skynet systemd[1]: pve-cluster.service: Succeeded.
Oct 13 09:48:34 skynet systemd[1]: Stopped The Proxmox VE cluster filesystem.
Oct 13 09:48:34 skynet systemd[1]: pve-cluster.service: Consumed 9min 57.325s CPU time.
-- Boot dc45f4dca3eb42cbb1508a25a6218f7a --
Oct 13 09:51:42 skynet systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 13 09:51:43 skynet systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 15 09:46:44 skynet systemd[1]: Stopping The Proxmox VE cluster filesystem...
Oct 15 09:46:44 skynet pmxcfs[3106]: [main] notice: teardown filesystem
Oct 15 09:46:44 skynet pmxcfs[3106]: [main] notice: exit proxmox configuration filesystem (0)
Oct 15 09:46:44 skynet systemd[1]: pve-cluster.service: Succeeded.
Oct 15 09:46:44 skynet systemd[1]: Stopped The Proxmox VE cluster filesystem.
Oct 15 09:46:44 skynet systemd[1]: pve-cluster.service: Consumed 3min 25.716s CPU time.
-- Boot 222f15ec190049c3a03af81fa0b241a9 --
Oct 15 09:49:20 skynet systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 15 09:49:22 skynet systemd[1]: Started The Proxmox VE cluster filesystem.
lines 1-36

fabian · Mar 31, 2022

Code:

Oct 06 04:42:10 skynet pmxcfs[1469]: [database] crit: commit transaction failed: database or disk is full#010
Oct 06 04:42:10 skynet pmxcfs[1469]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

possibly the backing DB got corrupted back then? would be kind of strange to not get any errors now for that though.. could you post the output of mount as well please?

Riesling.Dry · Mar 31, 2022

sure:

Code:

mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=4013612k,nr_inodes=1003403,mode=755,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=808792k,mode=755,inode64)
rpool/ROOT/pve-1 on / type zfs (rw,relatime,xattr,noacl)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=22279)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
nfsd on /proc/fs/nfsd type nfsd (rw,relatime)
rpool on /rpool type zfs (rw,noatime,xattr,noacl)
rpool/ROOT on /rpool/ROOT type zfs (rw,noatime,xattr,noacl)
rpool/data on /rpool/data type zfs (rw,noatime,xattr,noacl)
rpool/data/subvol-400-disk-0 on /rpool/data/subvol-400-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-222-disk-0 on /rpool/data/subvol-222-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-1010-disk-0 on /rpool/data/subvol-1010-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-111-disk-0 on /rpool/data/subvol-111-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-311-disk-0 on /rpool/data/subvol-311-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-444-disk-0 on /rpool/data/subvol-444-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-333-disk-0 on /rpool/data/subvol-333-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-1111-disk-0 on /rpool/data/subvol-1111-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-777-disk-0 on /rpool/data/subvol-777-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-600-disk-1 on /rpool/data/subvol-600-disk-1 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-501-disk-0 on /rpool/data/subvol-501-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-888-disk-0 on /rpool/data/subvol-888-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-555-disk-0 on /rpool/data/subvol-555-disk-0 type zfs (rw,noatime,xattr,posixacl)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=808788k,nr_inodes=202197,mode=700,inode64)
root@10.2.0.0:/backup/skynet/ on /mnt/colossus type fuse.sshfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)

fabian · Mar 31, 2022

just wanted to ensure no shenanigans like double-mounted / are in play. my best guess is that the sqlite DB got corrupted some while ago and now randomly causes pmxcfs to go into ro mode when it encounters some inconsistency. maybe it depends on accessing some file/dir/path? you could try enabling debug mode (systemctl stop pve-cluster; pmxcfs -d -f - but beware, this will produce LOTS of output..), then run something like find /etc/pve -type f -print -exec md5sum {} \; to read all files and see if it starts to go wrong at a certain file and what the debug output contains.

Riesling.Dry · Mar 31, 2022

Thank you Fabian,
here's what I got - indeed a LOT

I can't post it all ("The submitted message is too long to be processed. Please shorten it." ;°) so here's what fits into one post.
Anything useful?

Code:

Last login: Thu Mar 31 15:05:13 on ttys004
You have new mail.
root@skynet:~# systemctl stop pve-cluster; pmxcfs -d -f
[database] debug: name __version__ (inode = 0000000000000000, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name user.cfg (inode = 0000000000000004, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name datacenter.cfg (inode = 0000000000000006, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name virtual-guest (inode = 0000000000000008, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name priv (inode = 0000000000000009, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name nodes (inode = 000000000000000B, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name skynet (inode = 000000000000000C, parent = 000000000000000B) (database.c:375:bdb_backend_load_index)
[database] debug: name lxc (inode = 000000000000000D, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name qemu-server (inode = 000000000000000E, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name openvz (inode = 000000000000000F, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name priv (inode = 0000000000000010, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name lock (inode = 0000000000000011, parent = 0000000000000009) (database.c:375:bdb_backend_load_index)
[database] debug: name pve-www.key (inode = 0000000000000018, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name pve-ssl.key (inode = 000000000000001A, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name pve-root-ca.key (inode = 000000000000001C, parent = 0000000000000009) (database.c:375:bdb_backend_load_index)
[database] debug: name pve-root-ca.pem (inode = 000000000000001E, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name pve-root-ca.srl (inode = 0000000000000020, parent = 0000000000000009) (database.c:375:bdb_backend_load_index)
[database] debug: name pve-ssl.pem (inode = 0000000000000023, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name ha (inode = 0000000000000031, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name acme (inode = 0000000000000032, parent = 0000000000000009) (database.c:375:bdb_backend_load_index)
[database] debug: name sdn (inode = 0000000000000033, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name lrm_status.tmp.1609 (inode = 0000000000057F60, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name lrm_status.tmp.2570099 (inode = 00000000007B9B8B, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name storage (inode = 00000000007F567E, parent = 0000000000000009) (database.c:375:bdb_backend_load_index)
[database] debug: name 222.conf~ (inode = 000000000081025F, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name replication.cfg (inode = 00000000008BE6E8, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name vzdump.cron (inode = 00000000008BE6EC, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name jobs.cfg (inode = 00000000008BE79C, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name 888.conf (inode = 00000000008E6E9C, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 777.conf (inode = 00000000008E6EA2, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 555.conf (inode = 00000000008E6EA5, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 1010.conf (inode = 00000000008E6EA8, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 600.conf (inode = 00000000008E6EAB, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 501.conf (inode = 00000000008E6EBA, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 444.conf (inode = 00000000008E6EBD, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 400.conf (inode = 00000000008E6EE4, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 111.conf (inode = 00000000008E6EFC, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name storage.cfg (inode = 00000000008EF145, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name 333.conf.tmp.2521576 (inode = 00000000008F4243, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name authorized_keys (inode = 00000000008F4245, parent = 0000000000000009) (database.c:375:bdb_backend_load_index)
[database] debug: name known_hosts (inode = 00000000008F4248, parent = 0000000000000009) (database.c:375:bdb_backend_load_index)
[database] debug: name 333.conf (inode = 00000000008F4254, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name authkey.pub.old (inode = 00000000008F425E, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name authkey.pub (inode = 00000000008F4261, parent = 0000000000000000) (database.c:375:bdb_backend_load_index)
[database] debug: name authkey.key (inode = 00000000008F4264, parent = 0000000000000009) (database.c:375:bdb_backend_load_index)
[database] debug: name 1111.conf (inode = 00000000008F5B5B, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name 222.conf (inode = 00000000008F8E78, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name lrm_status (inode = 00000000008F92D5, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[database] debug: name 311.conf (inode = 00000000008F92D9, parent = 000000000000000D) (database.c:375:bdb_backend_load_index)
[database] debug: name lrm_status.tmp.2294 (inode = 00000000008F92DC, parent = 000000000000000C) (database.c:375:bdb_backend_load_index)
[main] debug: memdb open '/var/lib/pve-cluster/config.db' successful (version = 00000000008F92DE) (memdb.c:536:memdb_open)
[main] debug: using local mode (corosync.conf does not exist) (pmxcfs.c:921:main)
[loop] debug: start loop (loop.c:328:cfs_loop_worker_thread)
[loop] debug: worker started (loop.c:356:cfs_loop_start_worker)
[ipcs] debug: start event loop (server.c:522:worker_thread)
[libqb] info: server name: pve2 (ipc_setup.c:539:qb_ipcs_us_publish)
[ipcs] debug: server started (server.c:568:server_start)
[ipcs] debug: authenticated connection 0/0 (server.c:110:s1_connection_accept_fn)
[libqb] debug: IPC credentials authenticated (/dev/shm/qb-3870687-2137-14-XSCv3l/qb) (ipc_setup.c:706:handle_new_connection)
[libqb] debug: connecting to client [2137] (ipc_shm.c:348:qb_ipcs_shm_connect)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[ipcs] debug: Connection created > active:1 > closed:0 (server.c:134:s1_connection_created_fn)
[ipcs] debug: process msg:1, size:16 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:2, size:16 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:3, size:16 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:5439 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:27 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result -2 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:415 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: authenticated connection 0/0 (server.c:110:s1_connection_accept_fn)
[libqb] debug: IPC credentials authenticated (/dev/shm/qb-3870687-3380887-16-fyBo3o/qb) (ipc_setup.c:706:handle_new_connection)
[libqb] debug: connecting to client [3380887] (ipc_shm.c:348:qb_ipcs_shm_connect)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[ipcs] debug: Connection created > active:2 > closed:0 (server.c:134:s1_connection_created_fn)
[ipcs] debug: process msg:6, size:28 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 167 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:1, size:16 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:2, size:16 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:3, size:16 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[main] debug: enter cfs_fuse_getattr / (pmxcfs.c:127:cfs_fuse_getattr)
[main] debug: find_plug start  (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug  (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: find_plug end  = 0x55eed9c97140 () (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_getattr  (cfs-plug.c:84:cfs_plug_base_getattr)
[main] debug: leave cfs_plug_base_getattr  (cfs-plug.c:103:cfs_plug_base_getattr)
[main] debug: leave cfs_fuse_getattr / (0) (pmxcfs.c:150:cfs_fuse_getattr)
[main] debug: enter cfs_fuse_getattr /local (pmxcfs.c:127:cfs_fuse_getattr)
[main] debug: find_plug start local (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug local (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = local new path = (null) (cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end local = 0x55eed9c971d0 ((null)) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_link_getattr  (cfs-plug-link.c:58:cfs_plug_link_getattr)
[main] debug: leave cfs_fuse_getattr /local (0) (pmxcfs.c:150:cfs_fuse_getattr)
[ipcs] debug: process msg:6, size:34 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result -2 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:33 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result -2 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 267 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 616 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 269 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 574 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 268 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 268 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 255 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 560 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 247 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 274 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:43 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 610 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:42 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 259 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:6, size:43 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 257 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:392 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:398 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:399 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:401 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:393 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:391 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:404 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:344 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:393 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:399 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:345 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:405 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: process msg:4, size:344 (server.c:180:s1_msg_process_fn)
[ipcs] debug: process result 0 (server.c:403:s1_msg_process_fn)
[ipcs] debug: authenticated connection 0/0 (server.c:110:s1_connection_accept_fn)
[libqb] debug: IPC credentials authenticated (/dev/shm/qb-3870687-3380459-17-8HJkZn/qb) (ipc_setup.c:706:handle_new_connection)
[libqb] debug: connecting to client [3380459] (ipc_shm.c:348:qb_ipcs_shm_connect)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[libqb] debug: shm size:1048589; real_size:1052672; rb->word_size:263168 (ringbuffer.c:236:qb_rb_open_2)
[ipcs] debug: Connection created > active:3 > closed:0 (server.c:134:s1_connection_created_fn)
...
root@skynet:~# find /etc/pve -type f -print -exec md5sum {} \;
root@skynet:~#

fabian · Mar 31, 2022

could you leave it running for a bit and collect the output once it starts to hang again?

Riesling.Dry · Mar 31, 2022

well, it is currently hanging...
Where can I post the output, or what shall I look for?

fabian · Mar 31, 2022

the last (few hundred

) lines of the debug output, and then any lines that get added if you do another 'touch /etc/pve/something'

Riesling.Dry · Mar 31, 2022

I had this running for like 2 hrs.

Now I CAN touch /etc/pve/something.

GUI on one browser gives me an empty page, the other lets me log in but gives:

Connection error 500: RPCEnvironment init request failed: Unable to load access control list: Connection refused

I'll have to reboot.
What shall I do w. the 170k lines of logs?

robert.kuropkat · Mar 31, 2022

fabian said:
just wanted to ensure no shenanigans like double-mounted / are in play. my best guess is that the sqlite DB got corrupted some while ago and now randomly causes pmxcfs to go into ro mode when it encounters some inconsistency. maybe it depends on accessing some file/dir/path? you could try enabling debug mode (systemctl stop pve-cluster; pmxcfs -d -f - but beware, this will produce LOTS of output..), then run something like find /etc/pve -type f -print -exec md5sum {} \; to read all files and see if it starts to go wrong at a certain file and what the debug output contains.

So I did this and the find command itself seemed to lock up on the /etc/pve/priv/lock directory. 'ls doesn't like it either. They also don't like /etc/pve/nodes/pve01. pve01 is the one I created the cluster on. pve02 and pve03 directories are fine. Not sure if that is meaningful.

I also tried creating a file with the touch command as suggested. On pve02 it worked, but took two minutes to execute. On pve01 it created the file, but the command never returned (waited 20 minutes or more). The files did not seem to replicate to the other hosts.

In the gui, I never get all nodes green. At best, the main node has a green check mark, and the others have a grey question mark. right now, for some reason I don't get, everything shows red. I killed the pmxfs -d -f command and restarted pve-cluster. systemctl status pve-cluster looks fine, but the gui disagrees. [note: by the time I finished writing this, they went back to green for pve01 and grey for the other two]

Finally, the documentation says the cluster needs port 5404 and 5405 to communicate on. Is there a way to test this from the command line? In the old days, I would telnet to an open port, but nothing seems to be listening. ss and lsof return the following:

ss -tulpn | grep 54
udp   UNCONN 0      0       10.1.120.202:5405      0.0.0.0:*    users:(("corosync",pid=956,fd=28))

lsof -i -P -n | grep coro
corosync    956     root   28u  IPv4  22199      0t0  UDP 10.1.120.202:5405

Telnet gets a refused connection, but I guess it would if you are only listening for UDP? I don't know netcat well enough to do anything useful with it:

netcat localhost 5405 -v -v -u
localhost.localdomain [127.0.0.1] 5405 (?) open

Login failed. Please try again

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

New Member

We value your privacy