Hello all,
I have a 3 node cluster. On the second node, none of the LXC containers will start, all of them reach the startup timeout. Of the few that I tested, they startup normally when starting outside of Proxmox with lxc-start. KVM VMs are working as expected.
This must be someting storage related. For each attempt to start a container, I have an instance of vgs consuming 100% cpu. When running the vgs command myself, it hangs for a few minutes before returning, same for vgdisplay. When there are a few vgs processes hanging out there, the cluster node appears offline in the UI, the names of the containers don't display, the cluster member names are replaced with IP addresses in pvecm output.
I've rebooted (in serial) the 3 Proxmox hosts in the cluster, and each NFS, iSCSI servers used for storage. -- No luck.
How do I go about figuring out which storage device is causing me greif?
Proxmox newbie here. Let me know if I'm not providing the right/sufficent data!
This pattern repeats, from dmesg
Example LXC container config:
Starting up 116
Our new vgs process
systemctl status
LXC Debug log attached.
I have a 3 node cluster. On the second node, none of the LXC containers will start, all of them reach the startup timeout. Of the few that I tested, they startup normally when starting outside of Proxmox with lxc-start. KVM VMs are working as expected.
This must be someting storage related. For each attempt to start a container, I have an instance of vgs consuming 100% cpu. When running the vgs command myself, it hangs for a few minutes before returning, same for vgdisplay. When there are a few vgs processes hanging out there, the cluster node appears offline in the UI, the names of the containers don't display, the cluster member names are replaced with IP addresses in pvecm output.
I've rebooted (in serial) the 3 Proxmox hosts in the cluster, and each NFS, iSCSI servers used for storage. -- No luck.
How do I go about figuring out which storage device is causing me greif?
Proxmox newbie here. Let me know if I'm not providing the right/sufficent data!
Code:
root@LPHV2:~# pveversion
pve-manager/5.0-32/2560e073 (running kernel: 4.10.17-4-pve)
Code:
root@LPHV2:~# pvecm status
Quorum information
------------------
Date: Thu Oct 19 19:33:50 2017
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1/176
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.2.10
0x00000002 1 10.0.2.11 (local)
0x00000003 1 10.0.2.12
This pattern repeats, from dmesg
Code:
[ 1477.254702] EXT4-fs (dm-15): mounted filesystem with ordered data mode. Opts: (null)
[ 1477.317568] IPv6: ADDRCONF(NETDEV_UP): veth120i0: link is not ready
[ 1478.200105] vmbr5: port 2(veth120i0) entered blocking state
[ 1478.200108] vmbr5: port 2(veth120i0) entered disabled state
[ 1478.200247] device veth120i0 entered promiscuous mode
[ 1478.200821] device vlan5 entered promiscuous mode
[ 1478.383017] eth0: renamed from vethTYS09V
[ 1478.869950] vmbr5: port 2(veth120i0) entered disabled state
[ 1478.872021] device veth120i0 left promiscuous mode
[ 1478.872028] vmbr5: port 2(veth120i0) entered disabled state
[ 1478.983105] device vlan5 left promiscuous mode
Example LXC container config:
Code:
root@LPHV2:~# cat /etc/pve/lxc/116.conf
arch: amd64
cores: 4
hostname: LCZM
memory: 4096
mp1: LocalRaid0:vm-116-disk-1,mp=/mnt/ZMData,size=128G
net0: name=eth0,bridge=vmbr1,gw=10.0.2.1,hwaddr=1A:61:A8:AA:2E:69,ip=10.0.2.101/24,type=veth
onboot: 1
ostype: ubuntu
rootfs: LXC:vm-116-disk-1,size=3G
swap: 512
Starting up 116
Code:
root@LPHV2:~# pct start 116
Job for lxc@116.service failed because a timeout was exceeded.
See "systemctl status lxc@116.service" and "journalctl -xe" for details.
command 'systemctl start lxc@116' failed: exit code 1
Our new vgs process
Code:
root 18329 99.7 0.1 96776 67444 ? R 19:35 2:48 /sbin/vgs --separator : --noheadings --units b --unbuffered --nosuffix --options vg_name,vg_size,vg_free
systemctl status
Code:
root@LPHV2:~# systemctl status lxc@116.service
● lxc@116.service - LXC Container: 116
Loaded: loaded (/lib/systemd/system/lxc@.service; disabled; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/lxc@.service.d
└─pve-reboot.conf
Active: failed (Result: timeout) since Thu 2017-10-19 19:39:14 MST; 1min 0s ago
Docs: man:lxc-start
man:lxc
Process: 19083 ExecStart=/usr/bin/lxc-start -n 116 (code=killed, signal=TERM)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/system-lxc.slice/lxc@116.service
Oct 19 19:37:44 LPHV2 systemd[1]: Starting LXC Container: 116...
Oct 19 19:39:14 LPHV2 systemd[1]: lxc@116.service: Start operation timed out. Terminating.
Oct 19 19:39:14 LPHV2 systemd[1]: Failed to start LXC Container: 116.
Oct 19 19:39:14 LPHV2 systemd[1]: lxc@116.service: Unit entered failed state.
Oct 19 19:39:14 LPHV2 systemd[1]: lxc@116.service: Failed with result 'timeout'.
LXC Debug log attached.
Code:
root@LPHV2:~# lxc-start -n 116 -F -l DEBUG -o 116.log
Code:
root@LPHV2:~# vgs
VG #PV #LV #SN Attr VSize VFree
Data 1 1 0 wz--n- 8.00g 0
Data 1 3 0 wz--n- 8.00g 5.00g
Data-1 1 7 0 wz--n- 256.00g 48.00g
LXC 1 17 0 wz--n- 64.00g 25.00g
LocalRaid0 1 1 0 wz--n- 136.70g 8.70g
OS-1 1 6 0 wz--n- 256.00g 32.00g
pve 1 5 0 wz--n- 136.45g 15.84g
Code:
root@LPHV2:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
AtlDB Data -wi------- 8.00g
BitbucketData Data -wi------- 1.00g
ConfluenceData Data -wi------- 1.00g
JiraData Data -wi------- 1.00g
vm-108-disk-1 Data-1 -wi-ao---- 8.00g
vm-110-disk-1 Data-1 -wi-ao---- 8.00g
vm-111-disk-1 Data-1 -wi------- 32.00g
vm-113-disk-1 Data-1 -wi------- 32.00g
vm-123-disk-1 Data-1 -wi-a----- 32.00g
vm-131-disk-1 Data-1 -wi------- 32.00g
vm-134-disk-1 Data-1 -wi------- 64.00g
vm-116-disk-1 LXC -wi-a----- 3.00g
vm-117-disk-1 LXC -wi------- 2.00g
vm-118-disk-1 LXC -wi------- 1.00g
vm-120-disk-1 LXC -wi-a----- 2.00g
vm-121-disk-1 LXC -wi-a----- 2.00g
vm-122-disk-1 LXC -wi-a----- 5.00g
vm-123-disk-1 LXC -wi-a----- 2.00g
vm-124-disk-1 LXC -wi-a----- 2.00g
vm-125-disk-1 LXC -wi-a----- 2.00g
vm-126-disk-1 LXC -wi-a----- 2.00g
vm-127-disk-1 LXC -wi------- 2.00g
vm-128-disk-1 LXC -wi-a----- 2.00g
vm-129-disk-1 LXC -wi------- 2.00g
vm-130-disk-1 LXC -wi------- 2.00g
vm-132-disk-1 LXC -wi-a----- 2.00g
vm-133-disk-1 LXC -wi-a----- 2.00g
vm-134-disk-1 LXC -wi------- 4.00g
vm-116-disk-1 LocalRaid0 -wi-a----- 128.00g
vm-103-disk-1 OS-1 -wi------- 32.00g
vm-105-disk-1 OS-1 -wi------- 32.00g
vm-108-disk-1 OS-1 -wi-ao---- 16.00g
vm-110-disk-1 OS-1 -wi-ao---- 16.00g
vm-114-disk-1 OS-1 -wi------- 96.00g
vm-131-disk-1 OS-1 -wi------- 32.00g
data pve twi-aotz-- 78.45g 43.19 21.75
root pve -wi-ao---- 34.00g
swap pve -wi-ao---- 8.00g
vm-101-disk-1 pve Vwi-a-tz-- 32.00g data 45.05
vm-109-disk-1 pve Vwi-aotz-- 32.00g data 60.84