[SOLVED] Unable to login to web GUI, console shows errors

iGadget

Member
Apr 9, 2020
26
7
8
45
Running VE 6.1-7 on a single SSD for my home environment. Worked like a charm for several months, but since a few days I'm starting to experience issues.
First I noticed by chance that an automatic snapshot of one of my LXC containers had failed. Log showed (a.o.) 'I/O error'.
Unfortunately I was unable to dive into that at that moment, but now the problem has worsened.
Since today I'm unable to log in to the web GUI - 'Login failed. Please try again'.
Fortunately I can still SSH into the machine.
/var/log/daemon.log shows this:
Bash:
Nov 23 10:00:10 server systemd[1]: Failed to start Proxmox VE replication runner.
Nov 23 10:00:14 server pve-ha-lrm[1472]: unable to write lrm status file - unable to open file '/etc/pve/nodes/simba/lrm_status.tmp.1472' - Input/output error
Nov 23 10:00:18 server pvestatd[1374]: authkey rotation error: error with cfs lock 'authkey': got lock request timeout
Nov 23 10:00:18 server pvestatd[1374]: status update time (9.213 seconds)
Nov 23 10:00:19 server pve-ha-lrm[1472]: unable to write lrm status file - unable to open file '/etc/pve/nodes/simba/lrm_status.tmp.1472' - Input/output error
Nov 23 10:00:24 server pve-ha-lrm[1472]: unable to write lrm status file - unable to open file '/etc/pve/nodes/simba/lrm_status.tmp.1472' - Input/output error
Nov 23 10:00:29 server pvestatd[1374]: authkey rotation error: error with cfs lock 'authkey': got lock request timeout
Nov 23 10:00:29 server pvestatd[1374]: status update time (9.212 seconds)
Nov 23 10:00:29 server pve-ha-lrm[1472]: unable to write lrm status file - unable to open file '/etc/pve/nodes/simba/lrm_status.tmp.1472' - Input/output error
Nov 23 10:00:34 server pve-ha-lrm[1472]: unable to write lrm status file - unable to open file '/etc/pve/nodes/simba/lrm_status.tmp.1472' - Input/output error

So to me, all these I/O errors sounds pretty much like my SSD is having serious issues. However, I have no idea if this is really the case.
smartctl -t short -a /dev/sda shows this:
Bash:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 EVO 250GB
Serial Number:    *
LU WWN Device Id: 5 002538 8a01e180b
Firmware Version: EXT0DB6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov 23 10:07:41 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         ( 4800) seconds.
Offline data collection
capabilities:              (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  80) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       18728
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       1858
177 Wear_Leveling_Count     0x0013   087   087   000    Pre-fail  Always       -       157
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   072   044   000    Old_age   Always       -       28
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       157
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       25446615681

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Nov 23 10:09:41 2020 CET
Use smartctl -X to abort test.
Nothing above tells me the disk is dying, but perhaps I'm missing something?
Anything else I should do / check?
 
Hi,

Have you restart your node?

Please provide your PVE version pveversion -v and status of services pvedaemon and pveproxy.

Bash:
~ systemctl restart pvedaemon.service
~ systemctl restart pveproxy.service

If the services are running please try to restart them.
 
Thanks for your reply, @Moayad .
I have not tried restarting the machine yet, in fear of a possible boot failure (since the problem is getting worse by the day). Will do anyway if required.
Output of pveversion -v:
Code:
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-7 (running version: 6.1-7/13e58d5e)
pve-kernel-helper: 6.1-6
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-13
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-4
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-21
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-8
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-3
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-6
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
Both pvedaemon and pveproxy seem to be running. I've restarted both, but I still can't login on the web GUI.
Output of pvedaemon.service after service restart:
Code:
 pvedaemon.service - PVE API Daemon
   Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-11-23 12:01:32 CET; 51s ago
  Process: 25950 ExecStart=/usr/bin/pvedaemon start (code=exited, status=0/SUCCESS)
 Main PID: 25977 (pvedaemon)
    Tasks: 4 (limit: 4915)
   Memory: 127.3M
   CGroup: /system.slice/pvedaemon.service
           ├─25977 pvedaemon
           ├─25978 pvedaemon worker
           ├─25979 pvedaemon worker
           └─25980 pvedaemon worker

nov 23 12:01:31 simba systemd[1]: Starting PVE API Daemon...
nov 23 12:01:32 simba pvedaemon[25977]: starting server
nov 23 12:01:32 simba pvedaemon[25977]: starting 3 worker(s)
nov 23 12:01:32 simba pvedaemon[25977]: worker 25978 started
nov 23 12:01:32 simba pvedaemon[25977]: worker 25979 started
nov 23 12:01:32 simba pvedaemon[25977]: worker 25980 started
nov 23 12:01:32 simba systemd[1]: Started PVE API Daemon.
nov 23 12:02:19 simba pvedaemon[25979]: authentication failure; rhost=10.0.0.9 user=root@pam msg=error with cfs lock 'authkey': got lock request timeout
I can try rebooting the machine, if you think it's safe to do so? Or is there anything else I should check before rebooting?
 
Probably need to reboot the node but first could you see what say output of journalctl -u pve-cluster, and Please consider upgrading the PVE node to the latest version as well
 
Output of journalctl -u pve-cluster:
Code:
-- Logs begin at Sun 2020-11-22 08:12:46 CET, end at Mon 2020-11-23 12:29:48 CET. --
-- No entries --
Re - upgrading, I was under the assumption this happened automagically. Will do an apt update/dist-upgrade.

Edit - there were no ProxMox updates available when I did the apt update / apt dist-upgrade. How should I upgrade?
Edit2 - seems the required apt repo is missing. Will fix that first.
 
Last edited:
@Moayad - after upgrading and rebooting the node, everything seems to be working again. No more I/O errors and the Web GUI works as before. Thanks a lot!
 
  • Like
Reactions: Moayad
Great!

Please mark the thread as [SOLVED] to help other people who have the same issue Thanks!

have a nice day :)
 
  • Like
Reactions: iGadget