Snmp monitoring

caplam

Active Member
Nov 14, 2018
19
0
41
I try to setup librenms monitoring server in a lxc container.
I can monitor my switch, a qnap nas, a vm but i'm unable to get it working for a proxmox host.
i installed snmpd with a standard debian conf. The service start without problem and is active.
But when i try snmpwalk from proxmox host or from nms server i have a timeout.
Still the snmpd service remains active.
I tried to restart the container i see this in /var/log/messages on the host:
Code:
pve kernel: [511726.443462] audit: type=1400 audit(1543264741.686:173): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-108_</var/lib/lxc>" name="/" pid=953 comm="(ionclean)" flags="rw, rslave
Here is the conf file of the container:
Code:
arch: amd64
cores: 1
hostname: nms
memory: 512
net0: name=eth0,bridge=vmbr0,hwaddr=2E:76:DB:26:95:E2,ip=dhcp,type=veth
ostype: debian
rootfs: local-lvm:vm-108-disk-0,size=16G
swap: 512

Edit:
I can add that starting and stopping snmpd service is fast.
Once i have run a snmpwalk stopping service or geeting its status is very slow.

Edit2: with ss -ulnp i can see proxmox host is not listening port udp 161. It's not listening at all and yet systemctl status snmpd tells me snmpd service is running fine.
 
Last edited:
I made some changes in my snmp config. I copied from https://www.svennd.be/how-to-install-snmp-service-on-proxmox/
Now i see that there is something listening on port 161.
with ss -au i can see that Recv-Q is growing which means that there is no service which reads data from the socket. It stops growing at 166656
I suppose it's the size of the buffer.
sytemctl status snmpd show service running.
 
I wonder if the issue has to do with running librenms as a container and trying to listen to the host.

I have librenms installed as full VM and it monitors the various proxmox nodes through SNMP (I am actually in the middle of moving librenms to a container....)

You may want to try to install the librenms VM from their website and see if it connects without issue...... just to check if your proxmox snmp is setup properly.
 
I migrate the container to another host.
Even when i try snmpwalk from proxmox host itself i get a timeout.
I don't think it's related to librenms.
It also has no problem to get data from a procurve switch, a qnap nas or a debian vm on the host.
 
I finally got it working with ........ a reboot but only for a few days.
This night my 2 nodes were unreachable by snmp.
I tried to stop/start snmpd daemon. I tried to kill snmpd and restart. I tried to restart networking.service.
When i restarted networking all my guests on the host became unreachable even by ping.
vmbr0 was not bridging any guests interface.
The only things which worked was to reboot the host.

I dig a bit with my very limited knowledge. I found that when my host become unreachable by snmp, snmpd is still running.
ss -au shows me that RecV-Q is growing.
There might be something with proxmox because my 2 hosts are the only machines which behave like that. Guests continue to answer to snmp requests (lxc or qemu with debian)
This night my 2 hosts went down ( snmp check) in 5 mins interval.
I will continue to dig around.
 
I tried to restart networking.service.
When i restarted networking all my guests on the host became unreachable even by ping.
vmbr0 was not bridging any guests interface.
This is (sadly) the normal behaviour of ifupdown - you could try to change to ifupdown2 (in our repositories since a short while), which should support network reloading.

as for the snmpd - do you see anything from it in the logs (by default it logs to the journal on debian and thus PVE)?
else you could configure a higher log-level and see if it shows something relevant:

https://prefetch.net/blog/2009/04/16/debugging-net-snmp-problems/
http://net-snmp.sourceforge.net/wiki/index.php/Debug_tokens
 
in the journal i see an error on boths hosts which is not time related to the message in the nms.
the error is:
Code:
inetNetToMediaTable:_add_or_update_arpentry: unsupported address type, len = 0
I didn't find other examples of this error.
It seems linked to a change in the bridge config which doesn't update arp table. Therefore a message comes with no ip address and should be dropped which is not.
I don't understand why the daemon doesn't read (and empty) the buffer.
I need to find more debug messages.
Not sure of what i wrote above i'm far from field of expertise.
 
I may have an idea.
I run strace on the host. I run snmpwalk in another shell. This host is not responsive to snmp. Snmpwalk returns dozen of lines and hangs until time out
The last line of the trace is:
Code:
statfs("/mnt/pve/TS639",
Where /mnt/pve/TS639 is the mount point of nfs shared storage which is unresponsive. I deactivated it in gui.

On the other host which have rebooted and is snmp responsive i tried the same.
snmpwalk ends normally and in the trace i have no occurence of /mnt/pve/TS639.

My guess is that if i deactivate the share the host acts as if it were still active until reboot.
I guess that if i reboot the first host it will be ok.
I suppose that both my hosts became unresponsive at the time were the share became unresponsive.
This shared storage is a very old qnap nas which is 99% full of backup. It answers only to ping. ssh is not possible. I think i will have to hard reboot it what i don't want because i'll have to run fsck but there is not enough memory so i have to mount an usb key to make a swapfile on it.

And that makes me think about the initial installation of snmpd. I had to reboot both my hosts to make it work.
I think it's because i had another network share, a synology ds1815, that went down cause of intel atom bug. I deactivated it as network share in proxmox gui but hadn't rebooted the hosts.

So i have a question: Is it possible to really deactivate a network share storage without rebooting ?

Edit : I found on the second host the same error (inetNetToMediaTable:_add_or_update_arpentry: unsupported address type, len = 0) with no effect on the snmp communication.
 
Last edited:
* check the output of `mount` on your host - if the nfs-share is still mounted, you can try to unmount it.
* hanging nfs-shares, are quite tricky (the kernel hangs indefinitely waiting for them) and can require a reboot to be removed - YMMV
 
  • Like
Reactions: caplam
here is the output of mount
Code:
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=3965716k,nr_inodes=991429,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=803944k,mode=755)
/dev/mapper/pve-root on / type ext4 (rw,relatime,errors=remount-ro,data=ordered)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=37,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=12836)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
192.168.0.2:/Proxmox on /mnt/pve/TS639 type nfs (rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.0.2,mountvers=3,mountport=30000,mountproto=udp,local_lock=none,addr=192.168.0.2)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=803940k,mode=700)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)
so it seems the share is still mounted.
If i run the same command on the host which has been rebooted i don't see the share.
I deactivated the share in the datacenter gui yesterday.

My nas (TS639) is back online since 1 hour (finished e2fsck and volume remounted). It's still deactivated for the time being.
 
try to unmount it?
`umount /mnt/pve/TS639`
 
The share has been properly unmounted and snmpd daemon can normally read the buffer.
snmpwalk is ok and i guess in 5 minutes i will see the host in my nms.