LVM: "vgs" takes 5 minutes on one cluster node

schoda · Jun 26, 2019

Hey there,

we have a Proxmox cluster with six nodes. Everything worked fine before my vacation. Now i'm having some trouble with lvm on one node. "vgs" takes 5 minutes to complete and shows this:

Code:

 vgs
  Couldn't find device with uuid rptkmO-sIg8-pFcT-FD0M-1RXf-a6zP-eZTw1p.
  Couldn't find device with uuid ladH8H-TTYn-oFVY-WJ0V-ZwhU-bh5K-jHPKNa.
  VG                               #PV #LV #SN Attr   VSize   VFree 
  IWSVA                              2   7   0 wz-pn- 104,84g  60,72g
  pve                                1   3   0 wz--n- 223,00g  16,00g
  rhel                               2   2   0 wz-pn- 548,99g  14,00g
  rhel                               2   2   0 wz-pn-   2,05t  60,00g
  vg-cluster01-XYZ              1   2   0 wz--n-   5,00t   3,00t
  vg-cluster01-demovolume            1   0   0 wz--n- 200,00g 200,00g
  vg-cluster01-ZYX                1   2   0 wz--n-   1,50t 536,00g
  vg-cluster01-online                1   1   0 wz--n-   1,50t 786,00g
  vg-cluster01-s4h                   1   6   0 wz--n-   3,00t 202,00g
  vg-cluster01-storage01             1  74   0 wz--n-  20,00t   5,46t
  vg-cluster01-storage01-YXZ   1   5   0 wz--n-   5,00t 824,00g

pvs:

Code:

root@pm-06:/etc/lvm/backup# pvs
  Couldn't find device with uuid rptkmO-sIg8-pFcT-FD0M-1RXf-a6zP-eZTw1p.
  Couldn't find device with uuid 7gYIcd-rEvM-w6TD-RXxn-tCoS-balO-7yNxQb.
  Couldn't find device with uuid ladH8H-TTYn-oFVY-WJ0V-ZwhU-bh5K-jHPKNa.
  PV                                           VG                               Fmt  Attr PSize   PFree
  /dev/mapper/pm-cluster01-XYZ            vg-cluster01-XYZ            lvm2 a--    5,00t   3,00t
  /dev/mapper/pm-cluster01-demovolume          vg-cluster01-demovolume          lvm2 a--  200,00g 200,00g
  /dev/mapper/pm-cluster01-ZYX              vg-cluster01-ZYX              lvm2 a--    1,50t 536,00g
  /dev/mapper/pm-cluster01-online              vg-cluster01-online              lvm2 a--    1,50t 786,00g
  /dev/mapper/pm-cluster01-s4h                 vg-cluster01-s4h                 lvm2 a--    3,00t 222,00g
  /dev/mapper/pm-cluster01-storage01           vg-cluster01-storage01           lvm2 a--   20,00t   5,46t
  /dev/mapper/pm-cluster01-storage01-YXZ vg-cluster01-storage01-YXZ lvm2 a--    5,00t 824,00g
  /dev/sda3                                    pve                              lvm2 a--  223,00g  16,00g
  /dev/vg-cluster01-s4h/vm-179-disk-1          rhel                             lvm2 a--   50,00g  14,00g
  /dev/vg-cluster01-s4h/vm-183-disk-2          rhel                             lvm2 a--  100,00g  60,00g
  /dev/vg-cluster01-storage01/vm-165-disk-1    IWSVA                            lvm2 a--   79,97g  60,72g
  [unknown]                                    rhel                             lvm2 a-m  499,00g      0
  [unknown]                                    rhel                             lvm2 a-m    1,95t      0
  [unknown]                                    IWSVA                            lvm2 a-m   24,88g      0

I do now have an explanation how this happened. My colleague created two VMs and installed RHEL on them. But i can't think of any reason what went wrong. What's my best option here to get rid of those [unknown] devices? Or to find out what physical devices where used? I can't find anything in lsblkid or via findfs. I can find some of those UUID's in /etc/lvm/backup but only with [unknown] device. We connected our proxmox cluster to a SAN via FC and are using shared lvm thick volumes for our VMs. Luckily those two RHEL VMs have their own volume.

Any advice on what to do now? If i do vgreduce --removemissing rhel i'm told there are more than one vg with rhel and i have to use --select. But if i try --select:

Code:

root@pm-06:/etc/lvm/backup# vgreduce --removemissing rhel
  Multiple VGs found with the same name: skipping rhel
  Use --select vg_uuid=<uuid> in place of the VG name.
root@pm-06:/etc/lvm/backup# vgreduce --removemissing --select vg_uuid=rptkmO-sIg8-pFcT-FD0M-1RXf-a6zP-eZTw1p
vgreduce: unrecognized option '--select'
  Error during parsing of command line.
root@pm-06:/etc/lvm/backup#

Thanks in advance!

Cheers,
Daniel

oguz · Jun 26, 2019

hi,

to be honest idk what exactly went wrong here, but i'm not sure if removing these is a good idea (at least before trying to find out what they are and how they came there), so i'd go that route first instead. also because they look pretty big and maybe there's something important in there.

how did your colleague create this vm?

please show me your:
- `pveversion -v`
- `qm config VMID` (where VMID is the vm your colleague created)

also, this link[0] might be useful for ideas, in case you decide to try removing/recreating it.

[0]: https://www.suse.com/support/kb/doc/?id=3803380

schoda · Jun 26, 2019

Hi,

thanks for the fast reply. If those volumes belong to those two VMs the data on it is not *that* important because those are only test instances.

Code:

pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-16-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-4
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-12-pve: 4.15.18-36
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Code:

# qm config 183
agent: 1
bootdisk: scsi0
cores: 32
cpu: host
ide2: pm-isostorage:iso/rhel-server-7.6-x86_64-dvd.iso,media=cdrom,size=4289M
memory: 262144
name: s4hsrv.XXX.de
net0: virtio=D2:E3:96:73:AE:61,bridge=vmbr171,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: pm-cluster01-s4h:vm-183-disk-0,size=2100G
scsi1: pm-cluster01-s4h:vm-183-disk-2,size=100G
scsihw: virtio-scsi-pci
smbios1: uuid=6969e35f-72fd-40ef-ad64-05a03f914f01
sockets: 1
unused0: pm-cluster01-s4h:vm-183-disk-1
vmgenid: dc47dee5-a037-4fb1-bada-25aa0eea6d1b

Code:

root@pm-06:/etc/lvm/backup# qm config 179
agent: 1
boot: cdn
bootdisk: scsi0
cores: 16
cpu: host
ide2: pm-isostorage:iso/rhel-server-7.6-x86_64-dvd.iso,media=cdrom
memory: 266240
name: smtsrv
net0: virtio=EA:78:BB:69:AF:B8,bridge=vmbr171,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: pm-cluster01-s4h:vm-179-disk-0,size=500G
scsi1: pm-cluster01-s4h:vm-179-disk-1,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=3dd8944c-da5c-4dd3-bc0f-290396022ce1
sockets: 1
vmgenid: 0ab9dfda-7447-4ef7-8990-ec41392d7fca

oguz · Jun 26, 2019

to me, those [unknown] devices seem to be the `scsi0` disks in each of the vms (judging from the sizes). are these used/activated? you may do with them as you see fit, if they are just test instances anyway.

schoda · Jun 26, 2019

Both VMs running and using the disks. I need to talk to our SAP guy if i can recreate this VMs or if he is still doing tests on them.

From inside the VMs:

Code:

245 root@smtsrv: ~# vgs
  VG   #PV #LV #SN Attr   VSize   VFree 
  rhel   2   2   0 wz--n- 548,99g <14,00g
246 root@smtsrv: ~# pvs
  PV         VG   Fmt  Attr PSize    PFree 
  /dev/sda2  rhel lvm2 a--  <499,00g      0
  /dev/sdb   rhel lvm2 a--   <50,00g <14,00g

640 root@s4hsrv: ~# vgs
  VG   #PV #LV #SN Attr   VSize VFree 
  rhel   2   2   0 wz--n- 2,05t <60,00g
641 root@s4hsrv: ~# pvs
  PV         VG   Fmt  Attr PSize    PFree 
  /dev/sda3  rhel lvm2 a--     1,95t      0
  /dev/sdb   rhel lvm2 a--  <100,00g <60,00g

schoda · Jun 26, 2019

If i shut down one of the VM's one [unkown] device disappears. It reappears when i start the VM.

Code:

root@pm-05:~# pvs
  Couldn't find device with uuid ladH8H-TTYn-oFVY-WJ0V-ZwhU-bh5K-jHPKNa.
  PV                                           VG                               Fmt  Attr PSize   PFree 
  /dev/mapper/pm-cluster01-XYZ            vg-cluster01-XYZ            lvm2 a--    5,00t   3,00t
  /dev/mapper/pm-cluster01-demovolume          vg-cluster01-demovolume          lvm2 a--  200,00g 200,00g
  /dev/mapper/pm-cluster01-ZYX              vg-cluster01-ZYX              lvm2 a--    1,50t 536,00g
  /dev/mapper/pm-cluster01-online              vg-cluster01-online              lvm2 a--    1,50t 786,00g
  /dev/mapper/pm-cluster01-s4h                 vg-cluster01-s4h                 lvm2 a--    3,00t 170,00g
  /dev/mapper/pm-cluster01-storage01           vg-cluster01-storage01           lvm2 a--   20,00t   5,46t
  /dev/mapper/pm-cluster01-storage01-YZX vg-cluster01-storage01-YZX lvm2 a--    5,00t 824,00g
  /dev/sda3                                    pve                              lvm2 a--  223,00g  16,00g
  /dev/vg-cluster01-storage01/vm-165-disk-1    IWSVA                            lvm2 a--   79,97g  60,72g
  [unknown]                                    IWSVA                            lvm2 a-m   24,88g      0

root@pm-05:~# pvs
  Couldn't find device with uuid 7gYIcd-rEvM-w6TD-RXxn-tCoS-balO-7yNxQb.
  Couldn't find device with uuid ladH8H-TTYn-oFVY-WJ0V-ZwhU-bh5K-jHPKNa.
  PV                                           VG                               Fmt  Attr PSize   PFree 
  /dev/mapper/pm-cluster01-XYZ            vg-cluster01-XYZ            lvm2 a--    5,00t   3,00t
  /dev/mapper/pm-cluster01-demovolume          vg-cluster01-demovolume          lvm2 a--  200,00g 200,00g
  /dev/mapper/pm-cluster01-ZYX              vg-cluster01-ZYX              lvm2 a--    1,50t 536,00g
  /dev/mapper/pm-cluster01-online              vg-cluster01-online              lvm2 a--    1,50t 786,00g
  /dev/mapper/pm-cluster01-s4h                 vg-cluster01-s4h                 lvm2 a--    3,00t 170,00g
  /dev/mapper/pm-cluster01-storage01           vg-cluster01-storage01           lvm2 a--   20,00t   5,46t
  /dev/mapper/pm-cluster01-storage01-YZX vg-cluster01-storage01-YZX lvm2 a--    5,00t 824,00g
  /dev/sda3                                    pve                              lvm2 a--  223,00g  16,00g
  /dev/vg-cluster01-s4h/vm-183-disk-2          rhel                             lvm2 a--  100,00g  60,00g
  /dev/vg-cluster01-storage01/vm-165-disk-1    IWSVA                            lvm2 a--   79,97g  60,72g
  [unknown]                                    rhel                             lvm2 a-m    1,95t      0
  [unknown]                                    IWSVA                            lvm2 a-m   24,88g      0

schoda · Jun 26, 2019

I could not migrate the VM with ID 179 to another host when it was shut down.

Code:

2019-06-26 16:44:50 starting migration of VM 179 to node 'pm-05' (192.168.52.87)
2019-06-26 16:44:50 copying disk images
can't deactivate LV '/dev/vg-cluster01-s4h/vm-179-disk-1':   Logical volume vg-cluster01-s4h/vm-179-disk-1 in use.
2019-06-26 16:44:55 ERROR: volume deactivation failed: pm-cluster01-s4h:vm-179-disk-1 at /usr/share/perl5/PVE/Storage.pm line 1087.
2019-06-26 16:44:55 ERROR: migration finished with problems (duration 00:00:05)
TASK ERROR: migration problems

And i tried to reboot the node, it comes up and as soon as it is up i have "vgs" processes in my top which eat 100% CPU.

Code:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                  
21048 root      20   0   97104  66888   4116 R 100,0  0,0   2:01.35 vgs                                                                                                                                                                      
21333 root      20   0   66388  36208   4296 R 100,0  0,0   1:00.21 vgs                                                                                                                                                                      
21386 root      20   0   76564  46572   4276 R  99,7  0,0   0:56.36 vgs                                                                                                                                                                      
21422 root      20   0   66116  35988   4080 R  99,7  0,0   0:47.61 vgs                                                                                                                                                                      
17198 root      rt   0  198372  73544  51484 S   1,0  0,0   0:08.93 corosync

Code:

root@pm-06:~# ps aux|grep -i vgs
root     21333 99.9  0.0  97640 67604 ?        R    17:23   2:21 /sbin/vgs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,vg_size,vg_free,lv_count
root     21386  100  0.0  98724 68636 ?        R    17:23   2:18 /sbin/vgs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,vg_size,vg_free,lv_count
root     21422  100  0.0  96560 66468 ?        R    17:23   2:09 /sbin/vgs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,vg_size,vg_free,lv_count
root     21966  103  0.0  59320 29172 ?        R    17:25   0:05 /sbin/vgs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,vg_size,vg_free,lv_count
root     21996  0.0  0.0  14320   936 pts/0    S+   17:25   0:00 grep -i vgs

Stoiko Ivanov · Jun 26, 2019

please update your /etc/lvm/lvm.conf to include all devices on which you have created guest disk-images, which themselves have LVM configured inside.

We've seen these very long timeouts with the lvm-utils (vgs, pvs, lvs,...) when a vg-name was used more than once - e.g. 'rhel' in your case.
when running on one host, the lvm-utils would prevent creating 2 VGs with the same name, but since you create them in the guest, which does not know anything about the other guest with the same VG-name there is no error shown.

From a quick glance at your output you need to add a fitting regex for:
* '/dev/mapper/pm-cluster01'
* '/dev/vg-cluster01-storage01'
* '/dev/vg-cluster01-s4h'

to the 'global_filter' directive in '/etc/lvm/lvm.conf'

Hope this helps!

schoda · Jun 26, 2019

Stoiko Ivanov said:
please update your /etc/lvm/lvm.conf to include all devices on which you have created guest disk-images, which themselves have LVM configured inside.

We've seen these very long timeouts with the lvm-utils (vgs, pvs, lvs,...) when a vg-name was used more than once - e.g. 'rhel' in your case.
when running on one host, the lvm-utils would prevent creating 2 VGs with the same name, but since you create them in the guest, which does not know anything about the other guest with the same VG-name there is no error shown.

From a quick glance at your output you need to add a fitting regex for:
* '/dev/mapper/pm-cluster01'
* '/dev/vg-cluster01-storage01'
* '/dev/vg-cluster01-s4h'

to the 'global_filter' directive in '/etc/lvm/lvm.conf'

Hope this helps!

Stoiko Ivanov · Jun 26, 2019

try adding a '/*' after '/dev/vg-cluster01*' - that should still list the vg contents for the storages actually used by pve, but exclude the VGs created on diskimages in those VGs:
'/dev/vg-cluster01*/*'

schoda · Jun 26, 2019

Stoiko Ivanov said:
try adding a '/*' after '/dev/vg-cluster01*' - that should still list the vg contents for the storages actually used by pve, but exclude the VGs created on diskimages in those VGs:
'/dev/vg-cluster01*/*'

Sadly that didn't change the output of lvs/pvs/vgs. But i'll read up on filter syntax and try a few things.
Thanks for pointing me in the right direction! Proxmox is an awesome project!

Search

Search

LVM: "vgs" takes 5 minutes on one cluster node

schoda

Member

oguz

Proxmox Retired Staff

schoda

Member

oguz

Proxmox Retired Staff

schoda

Member

schoda

Member

schoda

Member

Stoiko Ivanov

Proxmox Staff Member

schoda

Member

Stoiko Ivanov

Proxmox Staff Member

schoda

Member