Proxmox not cleaning up LVM volume references

[NUT]

Active Member
Nov 14, 2017
11
0
41
48
We have a shared storage server using iSCSI with LVM on top shared between 8 hypervisors, whenever we remove a VM it's disks will be removed from the LVM but it's references in DM and the symlinks in /dev/mapper and /dev/volgroup/ will not be cleaned up.

When I try to clean up the unused dm nodes in /dev and remove the symlinks to them we can re-use the VMID again.

Deletion of the VM is done using the Proxmox Management web interface.

VM100, VM101 and VM102 are removed, this is the output from lvscan:

ACTIVE '/dev/vmstorage/vm-103-disk-2' [50,00 GiB] inherit
ACTIVE '/dev/vmstorage/vm-104-disk-1' [40,00 GiB] inherit
[...]
ACTIVE '/dev/pve/swap' [3,62 GiB] inherit
ACTIVE '/dev/pve/root' [25,94 GiB] inherit

The contents of /dev/mapper:

drwxr-xr-x 2 root root 520 okt 24 15:56 .
drwxr-xr-x 21 root root 4620 okt 24 15:56 ..
crw------- 1 root root 10, 236 okt 24 15:49 control
lrwxrwxrwx 1 root root 7 okt 24 15:49 pve-root -> ../dm-1
lrwxrwxrwx 1 root root 7 okt 24 15:49 pve-swap -> ../dm-0
lrwxrwxrwx 1 root root 7 okt 24 15:50 vmstorage-vm--100--disk--1 -> ../dm-2
lrwxrwxrwx 1 root root 7 okt 24 15:50 vmstorage-vm--101--disk--1 -> ../dm-3
lrwxrwxrwx 1 root root 8 okt 24 15:50 vmstorage-vm--101--disk--2 -> ../dm-12
lrwxrwxrwx 1 root root 7 okt 24 15:50 vmstorage-vm--102--disk--1 -> ../dm-4
lrwxrwxrwx 1 root root 7 okt 24 15:56 vmstorage-vm--103--disk--2 -> ../dm-5
lrwxrwxrwx 1 root root 7 okt 24 15:50 vmstorage-vm--104--disk--1 -> ../dm-6
lrwxrwxrwx 1 root root 8 okt 24 15:50 vmstorage-vm--104--disk--2 -> ../dm-13

The contents of /dev/vmstorage:

lrwxrwxrwx 1 root root 7 okt 24 15:50 vm-100-disk-1 -> ../dm-2
lrwxrwxrwx 1 root root 7 okt 24 15:50 vm-101-disk-1 -> ../dm-3
lrwxrwxrwx 1 root root 8 okt 24 15:50 vm-101-disk-2 -> ../dm-12
lrwxrwxrwx 1 root root 7 okt 24 15:50 vm-102-disk-1 -> ../dm-4
lrwxrwxrwx 1 root root 7 okt 24 15:56 vm-103-disk-2 -> ../dm-5
lrwxrwxrwx 1 root root 7 okt 24 15:50 vm-104-disk-1 -> ../dm-6
lrwxrwxrwx 1 root root 8 okt 24 15:50 vm-104-disk-2 -> ../dm-13

Sometimes when I remove the volume by hand (e.g. dmsetup remove vmstorage-vm--100--disk--1) even the symlinks still don't get removed, but in general that works fine. Whenever that happens I can clear them with a simple rm command.

I have to repeat this on all hypervisors by hand for all removed VMs which had storage in this iSCSI backed LVM storage. I did not try any other storage type yet, I am unable to switch over because some of the VMs are in production.

Does anyone have any clue to why Proxmox does not clean up all traces of the VM getting deleted? I can not find anything going 'wrong' in the logs, we only notice it happened again when we try to recycle a VM ID and it's creation is failing because of this error:

device-mapper: create ioctl on vmstorage-vm--100--disk--1LVM-7Nd10YZoG8pewnLgyFMc26XDHDCT2ADiOI6ug5mWqZqblX1ZBYeTT0gX030VsffK failed: Device or resource busy
TASK ERROR: create failed - lvcreate 'vmstorage/pve-vm-100' error: Failed to activate new LV.

Rebooting all the hypervisors also clears the issue, but only temporarily...
 
what is your 'pveversion -v' ?
 
what is your 'pveversion -v' ?

I should have added that, shouldn't I? ;)

proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-25
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.10.17-4-pve: 4.10.17-24
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.59-1-pve: 4.4.59-87
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90

I am going to update to the latest 5.1 update available today, so the problem 'will go away' once I do as it contains a kernel update. But after some time it returns, we'll do a quick test once the update is installed and update you here.

apt list --upgradable
Listing... Done
libnvpair1linux/stable 0.7.3-pve1~bpo9 amd64 [upgradable from: 0.7.2-pve1~bpo90]
libssl1.0.2/stable 1.0.2l-2+deb9u1 amd64 [upgradable from: 1.0.2l-2]
libssl1.1/stable 1.1.0f-3+deb9u1 amd64 [upgradable from: 1.1.0f-3]
libuutil1linux/stable 0.7.3-pve1~bpo9 amd64 [upgradable from: 0.7.2-pve1~bpo90]
libzfs2linux/stable 0.7.3-pve1~bpo9 amd64 [upgradable from: 0.7.2-pve1~bpo90]
libzpool2linux/stable 0.7.3-pve1~bpo9 amd64 [upgradable from: 0.7.2-pve1~bpo90]
openssl/stable 1.1.0f-3+deb9u1 amd64 [upgradable from: 1.1.0f-3]
proxmox-ve/stable 5.1-26 all [upgradable from: 5.1-25]
pve-kernel-4.13.4-1-pve/stable 4.13.4-26 amd64 [upgradable from: 4.13.4-25]
spl/stable 0.7.3-pve1~bpo9 amd64 [upgradable from: 0.7.2-pve1~bpo90]
zfs-initramfs/stable 0.7.3-pve1~bpo9 all [upgradable from: 0.7.2-pve1~bpo90]
zfsutils-linux/stable 0.7.3-pve1~bpo9 amd64 [upgradable from: 0.7.2-pve1~bpo90]

But besides the kernel it does not contain any updates on LVM, udev or dmsetup, so I am expecting it not to be resolved by this update.
 
do you have maybe the line

use_lvmetad = 1

in your /etc/lvm/lvm.conf ?

it should read

use_lvmetad = 0

if this is not the problem,
can you post a full task log of a removal which does not work, and the output of 'lvs' after ?
 
do you have maybe the line

use_lvmetad = 1

in your /etc/lvm/lvm.conf ?

it should read

use_lvmetad = 0

if this is not the problem,
can you post a full task log of a removal which does not work, and the output of 'lvs' after ?

The LVM configuration is the same on all hypervisors, 'use_lvmetad = 0' (I have found that possible fault many times over while trying to google for people with similar issues, also with lvm2/dmsetup in general with no proxmox, but it was set correctly from the get-go).

Will try to gather the logs, but the removal is not the problem by itself, it does remove the VM from everywhere except dmsetup, and you will only notice it did that with the attempt to create a VM with the same ID.
 
After some testing yesterday the problem did not return yet, but this was expected. It is a problem which returns after a while, not directly after a reboot.

We're also having issues with memory modules going bad (down to a bad run from Hynix, unfortunately), so from time to time a hypervisor will take a segfault tumble and reboot, making it even less of a quick returning issue.
 
There we go again :eek:

I've ran a pvereport and included it here (removed sensitive info of course). Included are the logs in /var/log concerning pve. If you need anything else, please let me know where to find it.
 

Attachments

the journal output from that time would be good also
 
journalctl --since="<INSERT DATE/TIME>" --until="<INSERT DATE/TIME>" > file.log
 
  • Like
Reactions: [NUT]
journalctl --since="<INSERT DATE/TIME>" --until="<INSERT DATE/TIME>" > file.log

Thanks, I never knew you could do that with journalctl, never too old to learn ;)

But unfortunately it's logs from that day and time frame got rotated away already... so I only got

"-- Logs begin at Sat 2017-11-18 03:01:58 CET, end at Mon 2017-11-20 09:58:53 CET. --"

So I gathered the logs myself, they are too big to upload to the forum so I created a link on our download server.

Included are syslog.5.gz, messages.1, daemon.log.1, debug.1 and kern.log.1

Please tell me if you need anything else.
 
Last edited:
i looked at the logs, but there is nothing indicating anything wrong with lvm
the only thing which was sticking out were the constant
sd 11:0:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16).

but i am not sure if this is the problem...
 
i looked at the logs, but there is nothing indicating anything wrong with lvm
the only thing which was sticking out were the constant


but i am not sure if this is the problem...

Well, this (from my understanding) is because the storage is big, the default command isn't enough and it retries it with that command, though it never remembers it does so it repeats a lot.

Everything else is working fine, up until we remove a VM...
 
Allright, the fix is now known, though the root-cause of this issue is still not clear:

Proxmox Support was unable to recreate the issue on a clean install of PVE 5.1, which prompted me to reinstall the nodes one by one. We initially started with PVE 4.4 and upgraded along the way. While it is still unclear why the issues happened, we now know they are fixed with a clean installation. :eek:

Small note for those who don't read release notes (like me :D), if you do a clean install of PVE 5.x the network interface names will be named by using the predictable naming convention, if you do an upgrade they will still be using the 'eth' naming convention and will not change even when eventually running 5.x.

A small 'heads-up' ;)