[SOLVED] LXC Containers Timeout on Startup, VGS hang

BrianTillman

New Member
Oct 20, 2017
2
0
1
39
Hello all,
I have a 3 node cluster. On the second node, none of the LXC containers will start, all of them reach the startup timeout. Of the few that I tested, they startup normally when starting outside of Proxmox with lxc-start. KVM VMs are working as expected.

This must be someting storage related. For each attempt to start a container, I have an instance of vgs consuming 100% cpu. When running the vgs command myself, it hangs for a few minutes before returning, same for vgdisplay. When there are a few vgs processes hanging out there, the cluster node appears offline in the UI, the names of the containers don't display, the cluster member names are replaced with IP addresses in pvecm output.

I've rebooted (in serial) the 3 Proxmox hosts in the cluster, and each NFS, iSCSI servers used for storage. -- No luck.

How do I go about figuring out which storage device is causing me greif?

Proxmox newbie here. Let me know if I'm not providing the right/sufficent data!

Code:
root@LPHV2:~# pveversion
pve-manager/5.0-32/2560e073 (running kernel: 4.10.17-4-pve)

Code:
root@LPHV2:~# pvecm status
Quorum information
------------------
Date:             Thu Oct 19 19:33:50 2017
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/176
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.2.10
0x00000002          1 10.0.2.11 (local)
0x00000003          1 10.0.2.12

This pattern repeats, from dmesg
Code:
[ 1477.254702] EXT4-fs (dm-15): mounted filesystem with ordered data mode. Opts: (null)
[ 1477.317568] IPv6: ADDRCONF(NETDEV_UP): veth120i0: link is not ready
[ 1478.200105] vmbr5: port 2(veth120i0) entered blocking state
[ 1478.200108] vmbr5: port 2(veth120i0) entered disabled state
[ 1478.200247] device veth120i0 entered promiscuous mode
[ 1478.200821] device vlan5 entered promiscuous mode
[ 1478.383017] eth0: renamed from vethTYS09V
[ 1478.869950] vmbr5: port 2(veth120i0) entered disabled state
[ 1478.872021] device veth120i0 left promiscuous mode
[ 1478.872028] vmbr5: port 2(veth120i0) entered disabled state
[ 1478.983105] device vlan5 left promiscuous mode

Example LXC container config:

Code:
root@LPHV2:~# cat /etc/pve/lxc/116.conf
arch: amd64
cores: 4
hostname: LCZM
memory: 4096
mp1: LocalRaid0:vm-116-disk-1,mp=/mnt/ZMData,size=128G
net0: name=eth0,bridge=vmbr1,gw=10.0.2.1,hwaddr=1A:61:A8:AA:2E:69,ip=10.0.2.101/24,type=veth
onboot: 1
ostype: ubuntu
rootfs: LXC:vm-116-disk-1,size=3G
swap: 512

Starting up 116
Code:
root@LPHV2:~# pct start 116
Job for lxc@116.service failed because a timeout was exceeded.
See "systemctl status lxc@116.service" and "journalctl -xe" for details.
command 'systemctl start lxc@116' failed: exit code 1

Our new vgs process
Code:
root     18329 99.7  0.1  96776 67444 ?        R    19:35   2:48 /sbin/vgs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,vg_size,vg_free

systemctl status
Code:
root@LPHV2:~# systemctl status lxc@116.service
● lxc@116.service - LXC Container: 116
   Loaded: loaded (/lib/systemd/system/lxc@.service; disabled; vendor preset: enabled)
  Drop-In: /usr/lib/systemd/system/lxc@.service.d
           └─pve-reboot.conf
   Active: failed (Result: timeout) since Thu 2017-10-19 19:39:14 MST; 1min 0s ago
     Docs: man:lxc-start
           man:lxc
  Process: 19083 ExecStart=/usr/bin/lxc-start -n 116 (code=killed, signal=TERM)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/system-lxc.slice/lxc@116.service

Oct 19 19:37:44 LPHV2 systemd[1]: Starting LXC Container: 116...
Oct 19 19:39:14 LPHV2 systemd[1]: lxc@116.service: Start operation timed out. Terminating.
Oct 19 19:39:14 LPHV2 systemd[1]: Failed to start LXC Container: 116.
Oct 19 19:39:14 LPHV2 systemd[1]: lxc@116.service: Unit entered failed state.
Oct 19 19:39:14 LPHV2 systemd[1]: lxc@116.service: Failed with result 'timeout'.

LXC Debug log attached.
Code:
root@LPHV2:~# lxc-start -n 116 -F -l DEBUG -o 116.log

Code:
root@LPHV2:~# vgs
  VG         #PV #LV #SN Attr   VSize   VFree
  Data         1   1   0 wz--n-   8.00g     0
  Data         1   3   0 wz--n-   8.00g  5.00g
  Data-1       1   7   0 wz--n- 256.00g 48.00g
  LXC          1  17   0 wz--n-  64.00g 25.00g
  LocalRaid0   1   1   0 wz--n- 136.70g  8.70g
  OS-1         1   6   0 wz--n- 256.00g 32.00g
  pve          1   5   0 wz--n- 136.45g 15.84g

Code:
root@LPHV2:~# lvs
  LV             VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  AtlDB          Data       -wi-------   8.00g                                                   
  BitbucketData  Data       -wi-------   1.00g                                                   
  ConfluenceData Data       -wi-------   1.00g                                                   
  JiraData       Data       -wi-------   1.00g                                                   
  vm-108-disk-1  Data-1     -wi-ao----   8.00g                                                   
  vm-110-disk-1  Data-1     -wi-ao----   8.00g                                                   
  vm-111-disk-1  Data-1     -wi-------  32.00g                                                   
  vm-113-disk-1  Data-1     -wi-------  32.00g                                                   
  vm-123-disk-1  Data-1     -wi-a-----  32.00g                                                   
  vm-131-disk-1  Data-1     -wi-------  32.00g                                                   
  vm-134-disk-1  Data-1     -wi-------  64.00g                                                   
  vm-116-disk-1  LXC        -wi-a-----   3.00g                                                   
  vm-117-disk-1  LXC        -wi-------   2.00g                                                   
  vm-118-disk-1  LXC        -wi-------   1.00g                                                   
  vm-120-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-121-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-122-disk-1  LXC        -wi-a-----   5.00g                                                   
  vm-123-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-124-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-125-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-126-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-127-disk-1  LXC        -wi-------   2.00g                                                   
  vm-128-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-129-disk-1  LXC        -wi-------   2.00g                                                   
  vm-130-disk-1  LXC        -wi-------   2.00g                                                   
  vm-132-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-133-disk-1  LXC        -wi-a-----   2.00g                                                   
  vm-134-disk-1  LXC        -wi-------   4.00g                                                   
  vm-116-disk-1  LocalRaid0 -wi-a----- 128.00g                                                   
  vm-103-disk-1  OS-1       -wi-------  32.00g                                                   
  vm-105-disk-1  OS-1       -wi-------  32.00g                                                   
  vm-108-disk-1  OS-1       -wi-ao----  16.00g                                                   
  vm-110-disk-1  OS-1       -wi-ao----  16.00g                                                   
  vm-114-disk-1  OS-1       -wi-------  96.00g                                                   
  vm-131-disk-1  OS-1       -wi-------  32.00g                                                   
  data           pve        twi-aotz--  78.45g             43.19  21.75                           
  root           pve        -wi-ao----  34.00g                                                   
  swap           pve        -wi-ao----   8.00g                                                   
  vm-101-disk-1  pve        Vwi-a-tz--  32.00g data        45.05                                 
  vm-109-disk-1  pve        Vwi-aotz--  32.00g data        60.84
 

Attachments

VG #PV #LV #SN Attr VSize VFree Data 1 1 0 wz--n- 8.00g 0 Data 1 3 0 wz--n- 8.00g 5.00g Data-1 1 7 0 wz--n- 256.00g 48.00g
you have 2 Volume Groups with the same name ( 'Data' ), i have seen that lvm does not like this at all, and will do thing like stall for seconds/minutes
 
Dominik, thank you!

I had noticed the duplicate vg named 'Data', though I had dismissed it assuming it was something Proxmox had done intentionally, since it wasn't something I had intended.

Renaming one of the 'Data' volume groups returned everything to normal.

I'm still unsure of root cause. The 4 volumes split between the two 'Data' volume groups were originally on the same vg named 'Data'. I don't know what action(s) I took to cause a second 'Data' vg to appear holding one of the original 4 volumes. This is a lab environment without any change management, so I'm not expecting I'll figure out what I did to get into this state.

In case others run into this condition, here's what I did to sort it out:

1) You can figure out which logical volumes are associated with your duplicate named volume groups by the uuid of the volume group, for example:
Code:
lvdisplay -v --select vg_uuid=MigNgO-W1L1-uKQS-Xpmd-am3G-1gi7-p28k8Y

2) Similiarly, you must rename a volume group by uuid:
Code:
vgrename MigNgO-W1L1-uKQS-Xpmd-am3G-1gi7-p28k8Y "DataFoo"