Proxmox goes ape s**t after a reboot

Ralms · Apr 5, 2021

Hi all,

I've been running a single node, Proxmox box for over a year at this point.

Yesterday I started getting errors `proxmox unable to open file '/etc/pve/nodes/ Input/output error` when trying to modify a VM configuration.
After searching around, I've decided to try reboot the host and see if the error would go away.

Well, it got even worse now...
Trying to login to the portal, I get `Connection error 401: permission denied - invalid PVE ticket`

I've cleaned the portal temporary files but didn't help.

Then I've found this thread suggesting to delete the certificates and keys and to recreate them:
https://forum.proxmox.com/threads/3...ied-invalid-pve-ticket-401.56038/#post-306529

Knowing certificates, I've decided to rename the files and keep them instead of just removing them.
Trying to do `root@prox1:~# mv /etc/pve/pve-root-ca.pem /etc/pve/pve-root-ca.pem_bck` I got:
`mv: cannot move '/etc/pve/pve-root-ca.pem' to '/etc/pve/pve-root-ca.pem_bck': Input/output error`

I haven't done anything to cause this, the only thing I can remember was installing iperf... which shouldn't touch anything.
Apologies for the frustration, but It's really annoying when you are almost locked out your own machine as root without doing anything.

Any suggestions?

Thank you.

EDIT:

Disk is not full:

Code:

root@prox1:~# df -h
Filesystem                                Size  Used Avail Use% Mounted on
udev                                       16G     0   16G   0% /dev
tmpfs                                     3.2G  9.3M  3.2G   1% /run
/dev/mapper/pve-root                       28G  3.8G   22G  15% /
tmpfs                                      16G   40M   16G   1% /dev/shm
tmpfs                                     5.0M     0  5.0M   0% /run/lock
tmpfs                                      16G     0   16G   0% /sys/fs/cgroup
/dev/fuse                                  30M   24K   30M   1% /etc/pve
tmpfs                                     3.2G     0  3.2G   0% /run/user/0

Ralms · Apr 5, 2021

Update, I've tried now to do

Code:

pvecm updatecerts -f

but no luck.

t.lamprecht · Apr 5, 2021

Hi,

Ralms said:
Disk is not full:

But probably failing or even already failed. Check smartcl data and dmesg/kernel-log.

Ralms · Apr 5, 2021

t.lamprecht said:
Hi,

But probably failing or even already failed. Check smartcl data and dmesg/kernel-log.

Hi t.lamprecht,
thank you for reaching out, appreciated.

Unfortunately, smartctl won't see much, because this drive is a volume of the raid controller:

But checking ILO it's reporting as OK.

Regarding dmesg (output attached), doing a fast first pass I saw:

"[ 0.261468] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)"
HP says this can be ignored. https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03265132
I'm using a DL380p Gen8.
"[ 0.110704] ACPI: SPCR: Unexpected SPCR Access Width. Defaulting to byte size"
Don't really know what this means.
I see many lines regarding "Sata"
[ 2.562574] ata2.00: failed to resume link (SControl 0)
[ 2.878565] ata1.01: failed to resume link (SControl 0)
[ 2.889597] ata1.00: SATA link down (SStatus 0 SControl 300)
[ 2.889613] ata1.01: SATA link down (SStatus 4 SControl 0)
[ 3.602570] ata2.01: failed to resume link (SControl 0)
[ 3.613577] ata2.00: SATA link down (SStatus 4 SControl 0)
[ 3.613594] ata2.01: SATA link down (SStatus 4 SControl 0)

Don't know what this refers too either.
I have a single disk only.

Any idea?

Thank you.

Ralms · Apr 5, 2021

t.lamprecht said:
Hi,

But probably failing or even already failed. Check smartcl data and dmesg/kernel-log.

Additionally, I want to add that if the drive was already failed, I wouldn't expect Proxmox to boot at all in that case and it does.
I can login via SSH but that is about it.

t.lamprecht · Apr 6, 2021

Ralms said:
Additionally, I want to add that if the drive was already failed, I wouldn't expect Proxmox to boot at all in that case and it does.
I can login via SSH but that is about it.

That does not necessarily mean that the drive(s) are OK, reading existing data can often still be fine for a time. For booting not that much is required, a few hundred MiB.

Ralms said:
[ 2.562574] ata2.00: failed to resume link (SControl 0)
[ 2.878565] ata1.01: failed to resume link (SControl 0)
[ 2.889597] ata1.00: SATA link down (SStatus 0 SControl 300)
[ 2.889613] ata1.01: SATA link down (SStatus 4 SControl 0)
[ 3.602570] ata2.01: failed to resume link (SControl 0)
[ 3.613577] ata2.00: SATA link down (SStatus 4 SControl 0)
[ 3.613594] ata2.01: SATA link down (SStatus 4 SControl 0)

Normally that means a (partially) failed controller or disk.
And this happened really just suddenly after a year of working fine with Proxmox VE - no other change involved? (besides installing iperf, I think we can ignore that one). Big updates (new kernel), power loss, ...?

You can try booting an older kernel, to see if that helps (due to a regression between your controller and a newer kernel version), I honestly do not see a big chance there, but it's cheap to try... Else I'd still suspect the HW, first controller or still disks.

Search

Search

Proxmox goes ape s**t after a reboot

Ralms

New Member

Ralms

New Member

t.lamprecht

Proxmox Staff Member

Ralms

New Member

Attachments

Ralms

New Member

t.lamprecht

Proxmox Staff Member