[SOLVED] Why do I get "pve-root needs manual fsck" at INITRAMFS?

JonMikelV

Member
Feb 13, 2022
5
3
8
55
I've been using Proxmox VE 7.0.1 for about 6 months and just now ran into the second instance of the GUI not working (it starts to load but nothing is displayed & ext js errors show in browser console).

Using SSH to connect then systemctl to try and restart the UI just returned errors (sorry, I don't have them avail. now) and attempting to update complained about a read-only drive somewhere (again, no actual message avail. to share).

From the physical console I am able to reboot, but it just shows these messages:
/dev/mapper/pve-root contains a file system with errors, check forced.
/dev/mapper/pve-root:
Inodes that were part of a corrupted orphan linked list found.

/dev/mapper/pve-root: UNEXPECTED INCONCISTENCY; RUN fsck MANUALY.
(i.e., without -a or -p options)
fsck exited with status code 54
the root filesystem on /dev/mapper/pve-root requires a manual fsck

I ran fsck -f -c -y /dev/mapper/pve-root and it scrolled a bunch of Free blocks count wrong and Inode bitmap differences messages all with a Fix? yes following them and ultimately said /dev/mapper/pve-root: ***** FILE SYSTEM WAS MODIFIED *****.

I ran it again just to be sure and I didn't get any of the scary "count wrong" or "bitmap differences" stuff, but did get the finale FILE SYSTEM WAS MODIFIED for some reason. The console reboot command didn't work at that point so I did a manual power cycle.

While this does seem to have resolved my current issue (GUI & VMs work and I'm now on v7.1-10) without having to re-install, does anybody have thoughts on why this has happened to me twice now and/or how I can avoid it in the future?

Thanks.
 
This sounds like a failing disk.
Have you checked your disks with smartctl?
 
Thanks for the smart suggestion. All disks come up with no errors which aligns with the PASSED result that I'm seeing in the Disks UI.

I'm still new to Proxmox so don't know all the places to look for stuff, but I did find the Syslog in the GUI and found some not-recent instances of:
proxmox kernel: EDAC MC0: 1 UE x38 UE on mc#0csrow#1channel#0 or mc#0csrow#1channel#1 (csrow:1 page:0x0 offset:0x0 grain:1073741824)

If I'm reading things correctly that indicates a memory issue, which is odd as I've run memtest+ at boot recently without any errors. Hopefully it's a thermal issue with a bad DIMM and not a motherboard problem.

[A little while later....] I installed memtest and ran it on avail. memory while the system was actively hosting a VM and it reported failures (along with more entries in syslog). Guess it's time for a new stick.

Any other thoughts / suggestions (or corrections on my assumptions) are welcome. :-)
 
So the memtest errors were while the host was running with a typical load?
Usually it takes a lot of passes of the complete memory to find something (booting the memtest to make sure the whole memory can be tested).

I'd suggest replacing your faulty DIMM(s) as soon as possible.
 
It's odd - multiple runs of Grub Memtest86+ didn't find any issue but using memtester in Proxmox console I was able to trigger the error enough that I'm confident I got the bad DIMM replaced (no more memtester error).

While the #csrow part of the error helped pinpoint the specific stick with issues, I did the following to ensure error was triggerable before and NOT-triggerable after replacing.
- log into console (local or shell via UI)
- run "vmstat" (to determine how much memory I could play with)
- run "memtester #####k 1" (##### is value from memory "free" column, in kb, run once)
- ensure "EDAC mc0" errors did (before mem swap) or did not (after mem swap) appear

I'm considering this SOLVED under assumption that memory issues were causing bad writes thus screwing up pve-root mention in original error.

Thanks for your assistance!
 
Last edited:
Hello, I got the same error and ran the fsck command, and the system booted again. But I keep getting these complete system crash where the system is completely not responsive , I need to hard reboot to start the system again. Sometimes this crash happens same day , after 2 or 3 days . Before the system was running fine for 6 months but I decided to update and restart the system . Since then I get this weird behavior . I looked also in logs but couldn’t find anything .
I checked the boot nvme drive and all smart data are fine . I ran memtest86 2 times and all tests passed . Also I ran memtester and all is fine . I have no clue where else I should look.
On a side note : in one of these crashes even long press the power button didn’t shutdown the system I had to remove the power cable, I got worried but the system booted fine afterwards.
 
Last edited:
Hello, I got the same error and ran the fsck command, and the system booted again. But I keep getting these complete system crash where the system is completely not responsive , I need to hard reboot to start the system again. Sometimes this crash happens same day , after 2 or 3 days . Before the system was running fine for 6 months but I decided to update and restart the system . Since then I get this weird behavior . I looked also in logs but couldn’t find anything .
I checked the boot nvme drive and all smart data are fine . I ran memtest86 2 times and all tests passed . Also I ran memtester and all is fine . I have no clue where else I should look.
On a side note : in one of these crashes even long press the power button didn’t shutdown the system I had to remove the power cable, I got worried but the system booted fine afterwards.
Omar, the long press power button off is just the power button, the firmware on the mother board, and the power supply. It should always work if those three things are functioning. A long press of the power button sends no signals to the OS, it just turns off the power supply.
 
@bwdavis thank you for your reply. So you suggest that the root problem could be the motherboard or the power supply? In that case should I be able to find any kind of log ? When I checked the log again I found that the workers were simply running tasks and suddenly no more log, the timeframe simply stops till I reboot the system and the log starts again . For example the lat log I had today was the temperature log for one of the hdd and suddenly blank .
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!