Errors after reboot

voidindigo

Well-Known Member
Sep 18, 2018
31
5
48
57
I ran into a fairly serious problem today on our development cluster.

I upgraded my 3-node ProxMox 8.3.4 cluster the other day, and it pulled in an update for dbus. This required a system reboot. After reboot, the VMs failed to start, and I had errors like:

proxmox TASK ERROR: activating LV 'images/images' failed: Activation of logical volume images/images is prohibited while logical volume images/images_tmeta is active.

And VMs failing to start with this message:
cannot perform fix without a full examination
Usage: thin_check [options] {device|file}
Options:
{-q|--quiet}
{-h|--help}
{-V|--version}
{-m|--metadata-snap}
{--auto-repair}
{--override-mapping-root}
{--clear-needs-check-flag}
{--ignore-non-fatal-errors}
{--skip-mappings}
{--super-block-only}
TASK ERROR: activating LV 'images/images' failed: Check of pool images/images failed (status:1). Manual repair required!

I found multiple threads, the bulk of which said to try things like:
lvchange -an images/images_tdata
lvchange -an images/images_tmeta
lvchange -ay images/images
Or:
lvchange -an images/images
lvconvert --repair images/images
lvchange -ay images/images
But that didn't work for me. I also saw some people saying they thought image storage might be full... which well may be my problem. EDIT: I don't think is my problem after all...

The only thing that I can find in journalctl is:
Feb 28 13:18:55 proxmox19 pvestatd[3992]: activating LV 'images/images' failed: Check of pool images/images failed (status:1). Manual repair required!

I found a link to THIS THREAD that talks about changing parameters for thin_check ... but that didn't work for me either. Eventually I found a NOTE HERE to try this:
lvconvert --repair images/images
but that results in the message:
truncating metadata device to 4161600 4k blocks
Which apparently is problematic.

Eventually I even found a post (can't find it now) that said REMOVING the thin_check options solved their problem.

So, after adding / converting / rebooting / removing / rebooting I finally have the systems back up. And I really have no idea why they are back up, except that I believe it's more a timing problem than anything else. I really don't know...

My questions:

  1. does anyone know definitively how to figure out what causes that, and what the correct steps are to avoid / repair it should it happen again?
  2. how do you "manually repair" the system in that state?
  3. is there any way to get a better warning when system space is low, rather than failing to restart?
I've been managing this cluster for a few years now, but I'm not a ProxMox guru by any rate. I feel like I'm missing some basic steps here... any help is appreciated.
Thanks
 
Last edited:
BUMP: No thoughts on this? This is a mission critical system for us, there's no thoughts on how to predict / resolve this?