FSCK failure on reboot (with fix)

ijcd

New Member
Nov 20, 2008
17
0
1
I received this from one of our operations staff:

Two of the grid servers rebooted due to a power outage and failed to come back up properly.

The reason in both cases was a failed root filesystem fsck.

On looking at the systems, root is an LVM volume, and in /etc/rcS.d/, chkroot.sh comes before LVM is configured. The LVM devices and mapper don't exist to find the filesystem at this point.

The fix is to manually start services required for LVM, then run fsck (checking that root is mounted read-only), and reboot.

I've written a script to do this currently located on grid02 and attached (inline here since the forum says it is an invalid file)

=========================================
#!/bin/sh
# Karsten M. Self
# Thu Jan 22 23:18:04 GMT 2009
#
# If we fail fsck on boot we need to enable LVM before we can fsck.
#

/etc/init.d/glibc.sh start
/etc/init.d/mountkernfs.sh start
/etc/init.d/udev start
/etc/init.d/mountdevsubfs.sh start
/etc/init.d/libdevmapper1.02 start
/etc/init.d/lvm start

# Hail Mary fsck
mount -o remount,ro / && e2fsck -y /dev/mapper/pve-root

echo "Check for the smell of burning rubber. You should probably reboot."
 
Code:
/etc/init.d/glibc.sh start
/etc/init.d/mountkernfs.sh start
/etc/init.d/udev start
/etc/init.d/mountdevsubfs.sh start

Those things are already started?

/etc/rcS.d/S01glibc.sh
/etc/rcS.d/S02mountkernfs.sh
/etc/rcS.d/S03udev
/etc/rcS.d/S04mountdevsubfs.sh

How can i reproduce the bug?
 
The bug manifested itself during power failure for us. The machines weren't able to fsck when they came back up because LVM wasn't enabled before fsck was attempted.

I think the problem is that checkroot happens before lvm is started, but I will point our ops person at this thread and see if he can give us more information.

/etc/rcS.d/S10checkroot.sh
/etc/rcS.d/S26lvm
/etc/rcS.d/S30checkfs.sh

To reproduce, I think you would have to uncleanly mount things (maybe kill power abruptly) so it tries to fsck on boot.
 
From our ops staff:
================================
Waiting for account to be activated.

Answering the question (pass this on to the forum if you wish): to reproduce, shut down uncleanly. Killing power, an IPMI reset, or other forced host kill should do the trick.

System will attemp fsck at boot. Fsck fails for reasons previously stated (LVM subsystem hasn't been initialized, fsck can't identify devices as specified in /etc/fstab).

I may have restarted a couple of basic services already started. The script given is a set of sufficient but perhaps not all necessary services to restart. There's no harm in restarting the ones selected.

Cheers.
 
Ok, I will try to reproduce that on Monday.

I also wonder how that is handled by other distributions?
 
Simply use '/lib/init/rw/rootdev' as device name.

Code:
fsck.ext3 /lib/init/rw/rootdev

No need to run any scripts.

- Dietmar
 
Should this be built into the normal startup routine then, or you are saying to run this manually if there is a problem after a power blip?
 
fsck does automatically run, but it can fail if there are serious errors on the disk. In such cases you need to run fsck manually.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!