local-LVM not available after Kernel update on PVE 7

Jul 27, 2020
14
0
1
44
Updated our Proxmox PVE 7 server this morning and upon reboot the local-lvm was not available and VM's would not start. Below are the updates applied:
libpve-common-perl: 7.0-6 ==> 7.0-9
pve-container: 4.0-9 ==> 4.0-10
pve-kernel-helper: 7.0-7 ==> 7.1-2
qemu-server: 7.0-13 ==> 7.0-14

PVE said reboot was required

Once rebooted, the local-lvm showed at 0GB in WebGUI and got a start failed error when I tried to start VMs. See attachment(s)

Ran pvdisplay,lvdisplay, and vgdisplay (see attachment) the lvdisplay shows all the logical volumes as "Not available" (see attachment)

Ran lvchange -ay pve/data to activate pve/data so the local-lvm now shows active and with percentage but the VM's still won't start

Ran the lvchange -ay command for each "LV Path" to each VM-disk (ex. lvchange -ay /dev/pve/vm-209002-disk-0) after this the "logical volumes" showed available and VM's started. I have to run the lvchange -ay command for each "LV Path" to each VM-disk anytime the PVE server is rebooted. Why? How can I resolve this permanently?
 

Attachments

Hi,
is there anything interesting in /var/log/syslog about lvm/pvscan? Please also share your /etc/lvm/lvm.conf and the output of pveversion -v.
 
Hi Fabian

I have attached the syslog and lvm.conf. The only thing that I noted in the syslog was the following line:
Oct 12 06:50:27 pve lvm[825]: pvscan[825] VG pve skip autoactivation.

Here is the output of pvevision -v
pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-5-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-helper: 7.1-2
pve-kernel-5.11: 7.0-8
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-4-pve: 5.11.22-9
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph-fuse: 15.2.13-pve1
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-9
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-11
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-10
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-4
pve-firmware: 3.3-2
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-14
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1
 

Attachments

I also got VG pve skip autoactivation. in my logs and my /etc/lvm/lvm.conf is identical, so those should be fine.

I noticed two things. First, in the output of lvdisplay is an error:
LV Name data
VG Name pve
LV UUID 56Hxkt-8ifA-xPKr-8a5t-qW25-Nung-HEb9Pa
LV Write Access read/write
LV Creation host, time proxmox, 2021-09-09 14:40:33 -0700
Expected thin-pool segment type but got NULL instead.
LV Pool metadata data_tmeta
LV Pool data data_tdata
LV Status NOT available
LV Size 1.67 TiB
Current LE 437892
Segments 1
Allocation inherit
Read ahead sectors auto

Second, and not sure if this is related, lvm tries to read from /etc/urandom before that was even initialized
Oct 12 06:50:27 pve kernel: [ 3.432021] random: lvm: uninitialized urandom read (4 bytes read)

Please provide the output of lvs -a and lvdisplay pve/data -vvv. Maybe there's more information there.
 
Do you want the output of lvs -a and lvdisplay pve/data -vvv while the error is occurring or after I activate the logical volumes manually? Activating the logical volumes manually has been the only way that I have been able to get the VMs on the PVE to start. This is a production server so I can't keep it offline.
 
Could this be an issue between HP and Proxmox 7? Although, these servers worked fine with Proxmox 6. All of our test PVEs are on various Dell Hardware using the no-subscription Package Repository and we haven't had an issue with them. This was the reason I went ahead and upgraded two of our 5 production PVE servers to Proxmox 7. I have since done a clean re-build of these same two PVE servers and the problem persists.
 
Do you want the output of lvs -a and lvdisplay pve/data -vvv while the error is occurring or after I activate the logical volumes manually? Activating the logical volumes manually has been the only way that I have been able to get the VMs on the PVE to start. This is a production server so I can't keep it offline.
It appears that the Expected thin-pool segment type but got NULL instead. error is only present in the log before the activate commands, so probably it's better to do it after the next reboot.
Could this be an issue between HP and Proxmox 7? Although, these servers worked fine with Proxmox 6. All of our test PVEs are on various Dell Hardware using the no-subscription Package Repository and we haven't had an issue with them. This was the reason I went ahead and upgraded two of our 5 production PVE servers to Proxmox 7. I have since done a clean re-build of these same two PVE servers and the problem persists.
So both servers show the same symptoms? And the thin pool was re-created from scratch too?

I also noticed in the syslog
Code:
Oct 10 02:38:53 pve smartd[929]: Device: /dev/sda [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
which looks like the drive is not fully healthy anymore.
 
Yes, both servers are showing the same symptoms. One is a HP Proliant Gen 9 Xeon server and the other is a HP Z620 Workstation. They have both been nuked and paved so yes the thin pools were re-created too. I rebooted the HP Z620 Workstation today so I could run the requested lvs -a and lvdisplay pve/data -vvv commands. The output is attached. As for the SMART attribute, I'm still digging into that but the drive is fairly new and is enterprise grade.

Just seems funny that all the of our test PVEs using old dell PCs are running fine with PVE7 but the HP production servers are having issues with PVE7. We don't have an HP machine that isn't in production to test this. Next week, may try to swap HDD and put it in a Dell and see if the issue still happens. They worked fine with PVE6. We may need to go back to PVE6. If so, is there an easy way to do it?
 

Attachments

@bdfy @ShEV Did you also experience the issues after upgrading to Proxmox VE 7 or did it coincide with a kernel update? In the latter case, can you try booting with an earlier kernel and see if it works?
 
Last edited:
@bdfy @ShEV Did you also experience the issues after upgrading to Proxmox VE 7 or did it coincide with a kernel update? In the latter case, can you try booting with an earlier kernel and see if it works?
I have upgrade a month ago or little bit later and didn't run into any problems after upgarde. Now all work fine after manual `vgchange -ay`, but lvm not auto activated on boot. Don't have a chance to test with older kernel now, sorry.
 
Did you also experience the issues after upgrading to Proxmox VE 7 or did it coincide with a kernel update? In the latter case, can you try booting with an earlier kernel and see if it works?
I originally had Proxmox 7 installed. The problem arose after updating the system. The kernel has changed from version 5.11.22-4 to 5.11.22-5. Loading the previous kernel does nothing. After boot im have to execute the lvchange -ay command with your hands and start the machines.


Update log:
Code:
Start-Date: 2021-10-23  10:18:24
Commandline: apt dist-upgrade
Install: swtpm-libs:amd64 (0.7.0~rc1+2, automatic), swtpm-tools:amd64 (0.7.0~rc1+2, automatic), libopts25:amd64 (1:5.18.16-4, automatic), swtpm:amd64 (0.7.0~rc1+2, automatic), libjson-glib-1.0-common:amd64 (1.6.2-1, automatic), libtpms0:amd64 (0.9.0+1, automatic), gnutls-bin:amd64 (3.7.1-5, automatic), libunbound8:amd64 (1.13.1-1, automatic), libjson-glib-1.0-0:amd64 (1.6.2-1, automatic), pve-kernel-5.11.22-5-pve:amd64 (5.11.22-10, automatic), libgnutls-dane0:amd64 (3.7.1-5, automatic)
Upgrade: reportbug:amd64 (7.10.3, 7.10.3+deb11u1), libperl5.32:amd64 (5.32.1-4+deb11u1, 5.32.1-4+deb11u2), libpam-runtime:amd64 (1.4.0-9, 1.4.0-9+deb11u1), krb5-locales:amd64 (1.18.3-6, 1.18.3-6+deb11u1), libgssapi-krb5-2:amd64 (1.18.3-6, 1.18.3-6+deb11u1), corosync:amd64 (3.1.2-pve2, 3.1.5-pve1), pve-firmware:amd64 (3.3-1, 3.3-2), perl:amd64 (5.32.1-4+deb11u1, 5.32.1-4+deb11u2), tzdata:amd64 (2021a-1, 2021a-1+deb11u1), pve-qemu-kvm:amd64 (6.0.0-3, 6.0.0-4), libproxmox-acme-perl:amd64 (1.3.0, 1.4.0), python3-reportbug:amd64 (7.10.3, 7.10.3+deb11u1), libpve-storage-perl:amd64 (7.0-10, 7.0-12), libvotequorum8:amd64 (3.1.2-pve2, 3.1.5-pve1), libkrb5support0:amd64 (1.18.3-6, 1.18.3-6+deb11u1), libquorum5:amd64 (3.1.2-pve2, 3.1.5-pve1), libcmap4:amd64 (3.1.2-pve2, 3.1.5-pve1), proxmox-backup-file-restore:amd64 (2.0.9-2, 2.0.11-1), libc6:amd64 (2.31-13, 2.31-13+deb11u2), locales:amd64 (2.31-13, 2.31-13+deb11u2), libcfg7:amd64 (3.1.2-pve2, 3.1.5-pve1), libkrb5-3:amd64 (1.18.3-6, 1.18.3-6+deb11u1), libpam-modules:amd64 (1.4.0-9, 1.4.0-9+deb11u1), qemu-server:amd64 (7.0-13, 7.0-16), libpve-access-control:amd64 (7.0-4, 7.0-5), pve-container:amd64 (4.0-9, 4.1-1), libproxmox-acme-plugins:amd64 (1.3.0, 1.4.0), libcpg4:amd64 (3.1.2-pve2, 3.1.5-pve1), pve-i18n:amd64 (2.4-1, 2.5-1), base-files:amd64 (11.1, 11.1+deb11u1), libk5crypto3:amd64 (1.18.3-6, 1.18.3-6+deb11u1), rsync:amd64 (3.2.3-4, 3.2.3-4+deb11u1), proxmox-backup-client:amd64 (2.0.9-2, 2.0.11-1), libpam-modules-bin:amd64 (1.4.0-9, 1.4.0-9+deb11u1), libpve-http-server-perl:amd64 (4.0-2, 4.0-3), pve-manager:amd64 (7.0-11, 7.0-13), libpve-common-perl:amd64 (7.0-6, 7.0-10), perl-base:amd64 (5.32.1-4+deb11u1, 5.32.1-4+deb11u2), libpam0g:amd64 (1.4.0-9, 1.4.0-9+deb11u1), libc-l10n:amd64 (2.31-13, 2.31-13+deb11u2), libc-bin:amd64 (2.31-13, 2.31-13+deb11u2), pve-kernel-5.11.22-4-pve:amd64 (5.11.22-8, 5.11.22-9), pve-kernel-5.11:amd64 (7.0-7, 7.0-8), pve-firewall:amd64 (4.2-2, 4.2-4), libcorosync-common4:amd64 (3.1.2-pve2, 3.1.5-pve1), perl-modules-5.32:amd64 (5.32.1-4+deb11u1, 5.32.1-4+deb11u2), libnozzle1:amd64 (1.21-pve1, 1.22-pve1), libknet1:amd64 (1.21-pve1, 1.22-pve1), pve-edk2-firmware:amd64 (3.20200531-1, 3.20210831-1), pve-kernel-helper:amd64 (7.0-7, 7.1-2)
End-Date: 2021-10-23  10:19:56

The boot logs contain the following:
Code:
Oct 23 11:03:24 pve1 lvm[961]:   pvscan[961] PV /dev/sdc online, VG vms is complete.
Oct 23 11:03:24 pve1 lvm[961]:   pvscan[961] VG vms skip autoactivation.
 
Last edited:
What might be happening is the following:
  1. pvscan is executed early during boot (initrd/initramfs stage).
    1. It creates the /run/lvm/vgs_online/<vgname> file.
    2. Autoactivation is attempted (and in your case it fails for some reason).
  2. Later during boot, pvscan is started by systemd.
    1. It sees that /run/lvm/vgs_online/<vgname> exists and skips autoactivation (hence the message in the syslog). This happens regardless of whether autoactivation worked or not earlier.
Could you try setting the debug parameter for the kernel command line (see here at the bottom for how to edit it), and share the contents of /run/initramfs/initramfs.debug after re-booting?
 
I won't be able to do the debugging until next week on the production servers with the problem however I did do the debug on one of the test servers that is running PVE 7 and kernel 5.11.22-5 but not having the LVM issue. I compared the debug file from our working test server to ShEV's debug file and found " Volume group "PVE" not found-cannot process volume group pve". I have uploaded the debug file from our working test server and will upload the debug file from the problem server early next week.

1635547894392.png
 

Attachments

We are thinking about adding activation in Proxmox VE's storage layer, when a thin pool/volume group is not active for whatever reason. But to my knowledge, autoactivation always worked in the past (and still does for most other people), so it still would be interesting to further debug the issue:

For the log where it doesn't work, I think the device might just not be there/ready yet at the time the script for LVM2 is executed. The question remains why the pvscan (which should trigger once the device shows up) fails.

Could you change the following in your /etc/lvm/lvm.conf file:
Code:
# Configuration section log.
# How LVM log information is reported.
log {
----snip----
        # Configuration option log/file.
        # Write error and debug log messages to a file specified here.
        # This configuration option does not have a default value defined.
        file = /run/lvm-log

        # Configuration option log/overwrite.
        # Overwrite the log file each time the program is run.
        overwrite = 0

        # Configuration option log/level.
        # The level of log messages that are sent to the log file or syslog.
        # There are 6 syslog-like log levels currently in use: 2 to 7 inclusive.
        # 7 is the most verbose (LOG_DEBUG).
        level = 7
Afterwards run update-initramfs -u so that the new configuration will actually be available early on. Please share the contents of /run/lvm-log after the next reboot.
 
Code:
15:04:45.899648 lvchange[471] misc/lvm-flock.c:113  _do_flock /run/lock/lvm/V_pve:aux WB
15:04:45.899673 lvchange[471] misc/lvm-flock.c:113  _do_flock /run/lock/lvm/V_pve WB
15:04:45.986404 pvscan[474] device_mapper/libdm-common.c:2494  Udev cookie 0xd4d8e9e (semid 0) destroyed
15:04:45.986435 pvscan[474] device_mapper/libdm-common.c:1484  pve-swap: Skipping NODE_ADD (253,0) 0:6 0660 [trust_udev]
15:04:45.986448 pvscan[474] device_mapper/libdm-common.c:1495  pve-swap: Processing NODE_READ_AHEAD 256 (flags=1)
15:04:45.986485 pvscan[474] device_mapper/libdm-common.c:1249  pve-swap (253:0): read ahead is 256
15:04:45.986500 pvscan[474] device_mapper/libdm-common.c:1373  pve-swap: retaining kernel read ahead of 256 (requested 256)
15:04:45.986522 pvscan[474] device_mapper/libdm-common.c:1484  pve-root: Skipping NODE_ADD (253,1) 0:6 0660 [trust_udev]
15:04:45.986536 pvscan[474] device_mapper/libdm-common.c:1495  pve-root: Processing NODE_READ_AHEAD 256 (flags=1)
15:04:45.986561 pvscan[474] device_mapper/libdm-common.c:1249  pve-root (253:1): read ahead is 256
15:04:45.986573 pvscan[474] device_mapper/libdm-common.c:1373  pve-root: retaining kernel read ahead of 256 (requested 256)
15:04:45.986583 pvscan[474] device_mapper/libdm-common.c:1484  pve-data_tmeta: Skipping NODE_ADD (253,2) 0:6 0660 [trust_udev]
15:04:45.986596 pvscan[474] device_mapper/libdm-common.c:1495  pve-data_tmeta: Processing NODE_READ_AHEAD 256 (flags=1)
15:04:45.986622 pvscan[474] device_mapper/libdm-common.c:1249  pve-data_tmeta (253:2): read ahead is 256
15:04:45.986634 pvscan[474] device_mapper/libdm-common.c:1373  pve-data_tmeta: retaining kernel read ahead of 256 (requested 256)
15:04:45.986651 pvscan[474] device_mapper/libdm-common.c:1484  pve-data_tdata: Skipping NODE_ADD (253,3) 0:6 0660 [trust_udev]
15:04:45.986661 pvscan[474] device_mapper/libdm-common.c:1495  pve-data_tdata: Processing NODE_READ_AHEAD 256 (flags=1)
15:04:45.986685 pvscan[474] device_mapper/libdm-common.c:1249  pve-data_tdata (253:3): read ahead is 256
15:04:45.986697 pvscan[474] device_mapper/libdm-common.c:1373  pve-data_tdata: retaining kernel read ahead of 256 (requested 256)
15:04:45.986715 pvscan[474] device_mapper/libdm-config.c:986  global/thin_check_executable not found in config: defaulting to /usr/sbin/thin_check
15:04:45.986728 pvscan[474] activate/dev_manager.c:2297  Running check command on /dev/mapper/pve-data_tmeta
15:04:46.5731 pvscan[474] config/config.c:1474  global/thin_check_options not found in config: defaulting to thin_check_options = [ "-q" ]
15:04:46.5756 pvscan[474] misc/lvm-exec.c:71  Executing: /usr/sbin/thin_check -q /dev/mapper/pve-data_tmeta
15:04:46.5966 pvscan[474] misc/lvm-flock.c:37  _drop_shared_flock /run/lock/lvm/V_pve.
15:04:46.6054 pvscan[474] mm/memlock.c:694  memlock reset.
15:07:46.645996 lvchange[471] misc/lvm-flock.c:47  _undo_flock /run/lock/lvm/V_pve:aux
These are the last messages from pvscan with PID 474 in the log. AFAICT based on the source code, it should really print something more. The fact that it doesn't, likely means that it crashes (or is killed). There is also an lvchange process running at the same time, which might be relevant, but there are locks, so not sure.

If it does crash, I'd say that's a bug in LVM2 and might be worth reporting upstream. I tried to reproduce the issue and find a potential root cause, but wasn't successful. As for Proxmox VE, and mentioned earlier, I'll look into adding activation for volume groups/thin pools to our storage library.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!