Snmp monitoring

caplam · Nov 27, 2018

I try to setup librenms monitoring server in a lxc container.
I can monitor my switch, a qnap nas, a vm but i'm unable to get it working for a proxmox host.
i installed snmpd with a standard debian conf. The service start without problem and is active.
But when i try snmpwalk from proxmox host or from nms server i have a timeout.
Still the snmpd service remains active.
I tried to restart the container i see this in /var/log/messages on the host:

Code:

pve kernel: [511726.443462] audit: type=1400 audit(1543264741.686:173): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-108_</var/lib/lxc>" name="/" pid=953 comm="(ionclean)" flags="rw, rslave

Here is the conf file of the container:

Code:

arch: amd64
cores: 1
hostname: nms
memory: 512
net0: name=eth0,bridge=vmbr0,hwaddr=2E:76:DB:26:95:E2,ip=dhcp,type=veth
ostype: debian
rootfs: local-lvm:vm-108-disk-0,size=16G
swap: 512

Edit:
I can add that starting and stopping snmpd service is fast.
Once i have run a snmpwalk stopping service or geeting its status is very slow.

Edit2: with ss -ulnp i can see proxmox host is not listening port udp 161. It's not listening at all and yet systemctl status snmpd tells me snmpd service is running fine.

caplam · Nov 27, 2018

I made some changes in my snmp config. I copied from https://www.svennd.be/how-to-install-snmp-service-on-proxmox/
Now i see that there is something listening on port 161.
with ss -au i can see that Recv-Q is growing which means that there is no service which reads data from the socket. It stops growing at 166656
I suppose it's the size of the buffer.
sytemctl status snmpd show service running.

vshaulsk · Nov 27, 2018

I wonder if the issue has to do with running librenms as a container and trying to listen to the host.

I have librenms installed as full VM and it monitors the various proxmox nodes through SNMP (I am actually in the middle of moving librenms to a container....)

You may want to try to install the librenms VM from their website and see if it connects without issue...... just to check if your proxmox snmp is setup properly.

caplam · Nov 27, 2018

I migrate the container to another host.
Even when i try snmpwalk from proxmox host itself i get a timeout.
I don't think it's related to librenms.
It also has no problem to get data from a procurve switch, a qnap nas or a debian vm on the host.

caplam · Dec 5, 2018

I finally got it working with ........ a reboot but only for a few days.
This night my 2 nodes were unreachable by snmp.
I tried to stop/start snmpd daemon. I tried to kill snmpd and restart. I tried to restart networking.service.
When i restarted networking all my guests on the host became unreachable even by ping.
vmbr0 was not bridging any guests interface.
The only things which worked was to reboot the host.

I dig a bit with my very limited knowledge. I found that when my host become unreachable by snmp, snmpd is still running.
ss -au shows me that RecV-Q is growing.
There might be something with proxmox because my 2 hosts are the only machines which behave like that. Guests continue to answer to snmp requests (lxc or qemu with debian)
This night my 2 hosts went down ( snmp check) in 5 mins interval.
I will continue to dig around.

Stoiko Ivanov · Dec 5, 2018

caplam said:
I tried to restart networking.service.
When i restarted networking all my guests on the host became unreachable even by ping.
vmbr0 was not bridging any guests interface.

This is (sadly) the normal behaviour of ifupdown - you could try to change to ifupdown2 (in our repositories since a short while), which should support network reloading.

as for the snmpd - do you see anything from it in the logs (by default it logs to the journal on debian and thus PVE)?
else you could configure a higher log-level and see if it shows something relevant:

https://prefetch.net/blog/2009/04/16/debugging-net-snmp-problems/
http://net-snmp.sourceforge.net/wiki/index.php/Debug_tokens

caplam · Dec 5, 2018

in the journal i see an error on boths hosts which is not time related to the message in the nms.
the error is:

Code:

inetNetToMediaTable:_add_or_update_arpentry: unsupported address type, len = 0

I didn't find other examples of this error.
It seems linked to a change in the bridge config which doesn't update arp table. Therefore a message comes with no ip address and should be dropped which is not.
I don't understand why the daemon doesn't read (and empty) the buffer.
I need to find more debug messages.
Not sure of what i wrote above i'm far from field of expertise.

Stoiko Ivanov · Dec 5, 2018

* could be related to the following bug-report (from 2016) with patches https://sourceforge.net/p/net-snmp/patches/1324/
* debian's versions (in all releases + sid) are older https://packages.debian.org/stretch/snmp (most likely because upstream's last LTS release is even older (2014) - http://www.net-snmp.org/download.html)

You could try to compile net-snmp yourself and see whether this fixes the problem (and/or report the bug with debian)

caplam · Dec 5, 2018

I may have an idea.
I run strace on the host. I run snmpwalk in another shell. This host is not responsive to snmp. Snmpwalk returns dozen of lines and hangs until time out
The last line of the trace is:

Code:

statfs("/mnt/pve/TS639",

Where /mnt/pve/TS639 is the mount point of nfs shared storage which is unresponsive. I deactivated it in gui.

On the other host which have rebooted and is snmp responsive i tried the same.
snmpwalk ends normally and in the trace i have no occurence of /mnt/pve/TS639.

My guess is that if i deactivate the share the host acts as if it were still active until reboot.
I guess that if i reboot the first host it will be ok.
I suppose that both my hosts became unresponsive at the time were the share became unresponsive.
This shared storage is a very old qnap nas which is 99% full of backup. It answers only to ping. ssh is not possible. I think i will have to hard reboot it what i don't want because i'll have to run fsck but there is not enough memory so i have to mount an usb key to make a swapfile on it.

And that makes me think about the initial installation of snmpd. I had to reboot both my hosts to make it work.
I think it's because i had another network share, a synology ds1815, that went down cause of intel atom bug. I deactivated it as network share in proxmox gui but hadn't rebooted the hosts.

So i have a question: Is it possible to really deactivate a network share storage without rebooting ?

Edit : I found on the second host the same error (inetNetToMediaTable:_add_or_update_arpentry: unsupported address type, len = 0) with no effect on the snmp communication.

Stoiko Ivanov · Dec 6, 2018

* check the output of `mount` on your host - if the nfs-share is still mounted, you can try to unmount it.
* hanging nfs-shares, are quite tricky (the kernel hangs indefinitely waiting for them) and can require a reboot to be removed - YMMV

caplam · Dec 6, 2018

here is the output of mount

Code:

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=3965716k,nr_inodes=991429,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=803944k,mode=755)
/dev/mapper/pve-root on / type ext4 (rw,relatime,errors=remount-ro,data=ordered)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=37,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=12836)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
192.168.0.2:/Proxmox on /mnt/pve/TS639 type nfs (rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.0.2,mountvers=3,mountport=30000,mountproto=udp,local_lock=none,addr=192.168.0.2)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=803940k,mode=700)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

so it seems the share is still mounted.
If i run the same command on the host which has been rebooted i don't see the share.
I deactivated the share in the datacenter gui yesterday.

My nas (TS639) is back online since 1 hour (finished e2fsck and volume remounted). It's still deactivated for the time being.

Stoiko Ivanov · Dec 6, 2018

try to unmount it?
`umount /mnt/pve/TS639`

caplam · Dec 6, 2018

The share has been properly unmounted and snmpd daemon can normally read the buffer.
snmpwalk is ok and i guess in 5 minutes i will see the host in my nms.

Search

Search

Snmp monitoring

caplam

Active Member

caplam

Active Member

vshaulsk

Active Member

caplam

Active Member

caplam

Active Member

Stoiko Ivanov

Proxmox Staff Member

caplam

Active Member

Stoiko Ivanov

Proxmox Staff Member

caplam

Active Member

Stoiko Ivanov

Proxmox Staff Member

caplam

Active Member

Stoiko Ivanov

Proxmox Staff Member

caplam

Active Member

We value your privacy