Proxmox goes ape s**t after a reboot

Ralms

New Member
May 16, 2019
10
0
1
34
Hi all,

I've been running a single node, Proxmox box for over a year at this point.

Yesterday I started getting errors `proxmox unable to open file '/etc/pve/nodes/ Input/output error` when trying to modify a VM configuration.
After searching around, I've decided to try reboot the host and see if the error would go away.

Well, it got even worse now...
Trying to login to the portal, I get `Connection error 401: permission denied - invalid PVE ticket`

I've cleaned the portal temporary files but didn't help.

Then I've found this thread suggesting to delete the certificates and keys and to recreate them:
https://forum.proxmox.com/threads/3...ied-invalid-pve-ticket-401.56038/#post-306529

Knowing certificates, I've decided to rename the files and keep them instead of just removing them.
Trying to do `root@prox1:~# mv /etc/pve/pve-root-ca.pem /etc/pve/pve-root-ca.pem_bck` I got:
`mv: cannot move '/etc/pve/pve-root-ca.pem' to '/etc/pve/pve-root-ca.pem_bck': Input/output error`

I haven't done anything to cause this, the only thing I can remember was installing iperf... which shouldn't touch anything.
Apologies for the frustration, but It's really annoying when you are almost locked out your own machine as root without doing anything.

Any suggestions?

Thank you.

EDIT:

Disk is not full:

Code:
root@prox1:~# df -h
Filesystem                                Size  Used Avail Use% Mounted on
udev                                       16G     0   16G   0% /dev
tmpfs                                     3.2G  9.3M  3.2G   1% /run
/dev/mapper/pve-root                       28G  3.8G   22G  15% /
tmpfs                                      16G   40M   16G   1% /dev/shm
tmpfs                                     5.0M     0  5.0M   0% /run/lock
tmpfs                                      16G     0   16G   0% /sys/fs/cgroup
/dev/fuse                                  30M   24K   30M   1% /etc/pve
tmpfs                                     3.2G     0  3.2G   0% /run/user/0
 
Last edited:
Hi,

But probably failing or even already failed. Check smartcl data and dmesg/kernel-log.
Hi t.lamprecht,
thank you for reaching out, appreciated.

Unfortunately, smartctl won't see much, because this drive is a volume of the raid controller:
1617638294153.png

But checking ILO it's reporting as OK.
1617638328529.png

Regarding dmesg (output attached), doing a fast first pass I saw:

  1. "[ 0.261468] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)"
    HP says this can be ignored. https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03265132
    I'm using a DL380p Gen8.

  2. "[ 0.110704] ACPI: SPCR: Unexpected SPCR Access Width. Defaulting to byte size"
    Don't really know what this means.

  3. I see many lines regarding "Sata"
    [ 2.562574] ata2.00: failed to resume link (SControl 0)
    [ 2.878565] ata1.01: failed to resume link (SControl 0)
    [ 2.889597] ata1.00: SATA link down (SStatus 0 SControl 300)
    [ 2.889613] ata1.01: SATA link down (SStatus 4 SControl 0)
    [ 3.602570] ata2.01: failed to resume link (SControl 0)
    [ 3.613577] ata2.00: SATA link down (SStatus 4 SControl 0)
    [ 3.613594] ata2.01: SATA link down (SStatus 4 SControl 0)

    Don't know what this refers too either.
    I have a single disk only.

Any idea?

Thank you.
 

Attachments

Hi,

But probably failing or even already failed. Check smartcl data and dmesg/kernel-log.

Additionally, I want to add that if the drive was already failed, I wouldn't expect Proxmox to boot at all in that case and it does.
I can login via SSH but that is about it.
 
Additionally, I want to add that if the drive was already failed, I wouldn't expect Proxmox to boot at all in that case and it does.
I can login via SSH but that is about it.
That does not necessarily mean that the drive(s) are OK, reading existing data can often still be fine for a time. For booting not that much is required, a few hundred MiB.

[ 2.562574] ata2.00: failed to resume link (SControl 0)
[ 2.878565] ata1.01: failed to resume link (SControl 0)
[ 2.889597] ata1.00: SATA link down (SStatus 0 SControl 300)
[ 2.889613] ata1.01: SATA link down (SStatus 4 SControl 0)
[ 3.602570] ata2.01: failed to resume link (SControl 0)
[ 3.613577] ata2.00: SATA link down (SStatus 4 SControl 0)
[ 3.613594] ata2.01: SATA link down (SStatus 4 SControl 0)
Normally that means a (partially) failed controller or disk.
And this happened really just suddenly after a year of working fine with Proxmox VE - no other change involved? (besides installing iperf, I think we can ignore that one). Big updates (new kernel), power loss, ...?

You can try booting an older kernel, to see if that helps (due to a regression between your controller and a newer kernel version), I honestly do not see a big chance there, but it's cheap to try... Else I'd still suspect the HW, first controller or still disks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!