1-4x vgs at 100 cpu?

mo_ · Aug 16, 2012

On one of the nodes in our 2-node-cluster I can see 1-4 instances of "vgs" running at 100 cpu.

"ps -ef" tells me:

root 21183 6134 99 16:48 ? 00:01:30 /sbin/vgs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,vg_size,vg_free

Apparantly these calls fail because every 2 minutes theres a new attempt with a different PID. Also vgdisplay needs like... 5? minutes to display its results, which look fine though.

Furthermore, theres no errors in /var/log/messages or dmesg.

This causes machine start/stops to take VERY long, and the webservice on this node is unresponsive which causes the other node to not be able to display stats about the VMs on the faulty node #2.

The weird thing is: you can interact with the VMs on that node in the usual way and you can also shut them down from inside the VM just fine.

Any pointers to help resolve this?

PS: we are using SAN storage and it should go without saying thats its connected via FC, including a shared storage for KVM guests.

udo · Aug 17, 2012

mo_ said:
On one of the nodes in our 2-node-cluster I can see 1-4 instances of "vgs" running at 100 cpu.

"ps -ef" tells me:

root 21183 6134 99 16:48 ? 00:01:30 /sbin/vgs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,vg_size,vg_free

Apparantly these calls fail because every 2 minutes theres a new attempt with a different PID. Also vgdisplay needs like... 5? minutes to display its results, which look fine though.

Furthermore, theres no errors in /var/log/messages or dmesg.

This causes machine start/stops to take VERY long, and the webservice on this node is unresponsive which causes the other node to not be able to display stats about the VMs on the faulty node #2.

The weird thing is: you can interact with the VMs on that node in the usual way and you can also shut them down from inside the VM just fine.

Any pointers to help resolve this?

PS: we are using SAN storage and it should go without saying thats its connected via FC, including a shared storage for KVM guests.

Hi,
any hint in /var/log/sylog?
Is there storage unaccesible? Any info during "lvs"?
How looks wait during top?

Udo

mo_ · Aug 17, 2012

udo said:
Hi,
1) any hint in /var/log/sylog?
2) Is there storage unaccesible? Any info during "lvs"?
3) How looks wait during top?

1) nothing that shouldnt be there at all
2) storage is accessible, lvs looks clean
3) 0.0%wa.

Ive tracked down the problem some more. Apparantly those instances of vgs are being started by the webinterface. Namely if you click on the storage "local" of the faulty node #2 while its trying to fill out the table (fails), I'm getting one additional instance of vgs. The problem is that the webinterface of node #2 is waiting for these processes to terminate (I guess) and thus reacts VERY slowly to anything involving it (like migration, starting/stopping VMs, even getting stats).

The storage "local" is merely a directory on the nodes root partition, which is no SAN, no raid, no lvm, no nothing, just one plain SSD.
Also the root partition is fully accessible.

mo_ · Aug 20, 2012

anyone? Im sorry for the bump, but I feel this got burried over the weekend. apparantly Im not the first one to encounter this problem, but alas noone ever replied to the other question at http://forum.proxmox.com/threads/9905-vgs-runs-at-100-for-days .

VladVons · Nov 5, 2016

if you have duplicated volume group names it causes 'vgs' 100% of CPU

Check volume group names for duplicates.
# vgdisplay

if thereis duplicated name then rename one
# vgrename -v 'VG UUID' vgdata2
# vgchange -ay

chrisalavoine · Nov 25, 2016

I too am having this problem. I have a 6 host cluster which has worked fine for some time now, connected to a Compellent SC4020 SAN via 10Gb x520's on the storage network (separate network for LAN via 1Gb ethernet). I recently added 2 new hosts (identical hardware to the existing 4) and recently the 2 newer hosts keep failing with 100% vgs processes. As per OP the VM's work fine, but I can't access any of the iSCSI LVM mounts, no clues in syslog or dmesg.

The original 4 hosts are still fine, all are running the following:

pveversion -v
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-10 (running version: 4.3-10/7230e60f)
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-47
qemu-server: 4.0-94
pve-firmware: 1.1-10
libpve-common-perl: 4.0-80
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-14
pve-qemu-kvm: 2.7.0-8
pve-container: 1.0-81
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

I've tried doing a vgdisplay which does show some difference between the working hosts and non-working hosts. The non-working ones appear to have some vg's that look like ones I've created within VM's (which seems a bit crazy), they have names like "main" and "data" which are used regularly within VM's. Anyone else get this?

Any help much appreciated.

c

fabian · Nov 25, 2016

chrisalavoine said:
I too am having this problem. I have a 6 host cluster which has worked fine for some time now, connected to a Compellent SC4020 SAN via 10Gb x520's on the storage network (separate network for LAN via 1Gb ethernet). I recently added 2 new hosts (identical hardware to the existing 4) and recently the 2 newer hosts keep failing with 100% vgs processes. As per OP the VM's work fine, but I can't access any of the iSCSI LVM mounts, no clues in syslog or dmesg.

The original 4 hosts are still fine, all are running the following:

pveversion -v
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-10 (running version: 4.3-10/7230e60f)
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-47
qemu-server: 4.0-94
pve-firmware: 1.1-10
libpve-common-perl: 4.0-80
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-68
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.3-14
pve-qemu-kvm: 2.7.0-8
pve-container: 1.0-81
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

I've tried doing a vgdisplay which does show some difference between the working hosts and non-working hosts. The non-working ones appear to have some vg's that look like ones I've created within VM's (which seems a bit crazy), they have names like "main" and "data" which are used regularly within VM's. Anyone else get this?

Any help much appreciated.

c

check your lvm.conf in /etc/lvm , especially the global_filter option .

chrisalavoine · Nov 25, 2016

fabian said:
check your lvm.conf in /etc/lvm , especially the global_filter option .

Hi Fabian,

Thanks for quick reply. I've checked my /etc/lvm/lvm.conf file and they are identical on both working and non-working hosts (I have never edited them). The global_filter section looks like this:

# Do not scan ZFS zvols (to avoid problems on ZFS zvols snapshots)
global_filter = [ "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|" ]

All other filter examples are commented out.

The VG's I'm seeing are definitely names that we use within our VM's. Interestingly, if I reboot the host with no VM's on it it comes up fine with correct VG's displayed. If I then live-migrate some VM's onto it and wait for a while the problems start to happen. I don't understand why this is only happening on 2 hosts as their setups are identical.

I am loathe to start renaming VG's as suggested by VladVons, but maybe I can add some filters to bypass known VG's?

c

fabian · Nov 25, 2016

you should filter out the path of the block devices where your VGs in VMs are - the two filters that we put in place are for stuff that our installer (potentially) sets up. if you use other storages that are accessed as block devices on the hypervisor nodes (and which might contain PVs/VGs/LVs) you should add appropriate filters.

chrisalavoine · Nov 25, 2016

fabian said:
you should filter out the path of the block devices where your VGs in VMs are - the two filters that we put in place are for stuff that our installer (potentially) sets up. if you use other storages that are accessed as block devices on the hypervisor nodes (and which might contain PVs/VGs/LVs) you should add appropriate filters.

Hi Fabian

Thanks for the tip. I have a feeling this may be coming from an NFS NAS server I've got mounted which does have a few VM disk on it. The device is showing as 192.168.35.60:/volume1/vmdatanfs so I've added the following to my lvm.conf global_filter:

global_filter = [ "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|", "r|192.168.35.60:/volume1/*|" ]

Does that look about right?

c

fabian · Nov 25, 2016

no, you don't get block devices over NFS. more likely culprits would be ceph (with KRBD) or iSCSI / LUNs

chrisalavoine · Nov 25, 2016

fabian said:
no, you don't get block devices over NFS. more likely culprits would be ceph (with KRBD) or iSCSI / LUNs

Ok, got to the bottom of this one, thanks for the tips. It seemed that these hosts were getting PV's from VM disks, still not sure how that was happening, but when I did a pvs I saw this:

# pvs
Couldn't find device with uuid d6io1C-mYOD-WS9D-Z9ig-AXh2-NVXY-pzCpQA.
Couldn't find device with uuid 2380px-Idkf-MbqL-D7ZH-l3Fg-1Oib-8UF2DL.
Couldn't find device with uuid TnOlMa-ocJE-Ftlz-CIeQ-eJrX-B6nD-7qUqx1.
Couldn't find device with uuid ZMUobw-K3bc-0mwl-Nj9X-WI1t-qPHE-nhfUxT.
Couldn't find device with uuid ItzSME-WleF-chMd-QrID-ZYeM-dHms-ESPG5c.
Couldn't find device with uuid Jpaj8N-CITz-FEg8-2qsX-99Sl-tumZ-lKAVDc.
Couldn't find device with uuid VBbctG-KB3s-B84D-QDtZ-f4ER-JnuM-RAq88y.
PV VG Fmt Attr PSize PFree
/dev/LUN2/vm-515-disk-4 main lvm2 a-- 45.00g 0
/dev/LUN3/vm-1202-disk-2 main lvm2 a-- 400.00g 0
/dev/LUN3/vm-1202-disk-5 main lvm2 a-- 100.00g 0
/dev/LUN3/vm-515-disk-1 main lvm2 a-- 40.00g 0
/dev/mapper/LUN1 LUN1 lvm2 a-- 4.39t 171.00g
/dev/mapper/LUN2 LUN2 lvm2 a-- 4.98t 537.81g
/dev/mapper/LUN3 LUN3 lvm2 a-- 5.00t 457.00g
/dev/mapper/LUN4 LUN4 lvm2 a-- 10.00t 7.74t
/dev/sda3 pve lvm2 a-- 111.12g 13.75g
unknown device main lvm2 a-m 75.00g 0
unknown device main lvm2 a-m 30.00g 1012.00m
unknown device main lvm2 a-m 30.00g 0
unknown device main lvm2 a-m 270.00g 0
unknown device main lvm2 a-m 300.00g 0
unknown device main lvm2 a-m 500.00g 1020.00m
unknown device main lvm2 a-m 1024.00g 1024.00g

Obviously, not good, the dev/LUN* devices should not be there so I added this to lvm.conf:

global_filter = [ "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|", "r|/dev/LUN*|" ]

And hey presto everything is working nicely again. I'd like someone to explain to me how this could only happen on 2 hosts and not all 6 identical setups but for now I'm a happy camper.

Cheers,
c

gurubert · Sep 19, 2018

Search

Search

1-4x vgs at 100 cpu?

mo_

Renowned Member

udo

Distinguished Member

mo_

Renowned Member

mo_

Renowned Member

VladVons

Active Member

chrisalavoine

Well-Known Member

fabian

Proxmox Staff Member

chrisalavoine

Well-Known Member

fabian

Proxmox Staff Member

chrisalavoine

Well-Known Member

fabian

Proxmox Staff Member

chrisalavoine

Well-Known Member

gurubert

Distinguished Member