Hi,
we noticed this strange issue happening on our Proxmox nodes.
On 30/08 we added a new node to our Proxmox cluster (3 nodes before, 4 nodes after); this is the load average graph of the cluster since that date. As you can see, after that date, load has started to increase, constantly day by day. System load was increasing only on the 3 "old" nodes, not on the new one (you can see that last one node as the clearer green part on the bottom of the graph). We noticed all that situation only today, because it is not causing any particular trouble.
Today one trigger on our monitoring system alerted us that number of processes on the first 3 nodes was over the threshold. This is the graph of the processes: today the number of processes was over 600 per node. You can clearly see node number 4 as not affected in the bottom part of the graph. For some reason, at a precise time of the day, 10 processes / day are added.
I also discovered that this process was created every morning on the first 3 nodes:
nobody 3193922 3193921 0 06:25 ? 00:00:00 /usr/bin/find / -ignore_readdir_race ( -fstype NFS -o -fstype nfs -o -fstype nfs4 -o -fstype afs -o -fstype binfmt_misc -o -fstype proc -o -fstype smbfs -o -fstype autofs -o -fstype iso9660 -o -fstype ncpfs -o -fstype coda -o -fstype devpts -o -fstype ftpfs -o -fstype devfs -o -fstype mfs -o -fstype shfs -o -fstype sysfs -o -fstype cifs -o -fstype lustre_lite -o -fstype tmpfs -o -fstype usbfs -o -fstype udf -o -fstype ocfs2 -o -type d -regex \(^/tmp$\)\|\(^/usr/tmp$\)\|\(^/var/tmp$\)\|\(^/afs$\)\|\(^/amd$\)\|\(^/alex$\)\|\(^/var/spool$\)\|\(^/sfs$\)\|\(^/media$\)\|\(^/var/lib/schroot/mount$\) ) -prune -o -print0
Something was suggesting me to check fs mounts... and i discovered this:
10.50.0.160:/var/nfs/general on /mnt/pve/nfs_backup type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.50.0.160,mountvers=3,mountport=55917,mountproto=udp,local_lock=none,addr=10.50.0.160)
This mount is an old storage backup that we removed on 30/08, because we decomissioned the physical machine that hosted NFS server (before decommisioning and shutdown NFS service on the old storage backup we removed the storage from GUI).
As I said, we made this operation from Proxmox GUI and I would expect that Proxmox would have unmounted the NFS folder, I was wrong. In fact the mount is still there and every morning something tries to access this folder to perform a backup task that was already removed, causing the number of processes (and consequently load to increase). Process are never killed for some reason, they remain there hanging on.
Also trying to manually unmount the folder is not working... At this point we will try a reboot during the next maintenance window. Node 4 is not affected because it was rebooted multiple times after having added it to the cluster, and having removed NFS storage from cluster.
I think that a reboot will solve this situation, but I would like to report this: maybe someone alse can be in the same situation.
we noticed this strange issue happening on our Proxmox nodes.
On 30/08 we added a new node to our Proxmox cluster (3 nodes before, 4 nodes after); this is the load average graph of the cluster since that date. As you can see, after that date, load has started to increase, constantly day by day. System load was increasing only on the 3 "old" nodes, not on the new one (you can see that last one node as the clearer green part on the bottom of the graph). We noticed all that situation only today, because it is not causing any particular trouble.
Today one trigger on our monitoring system alerted us that number of processes on the first 3 nodes was over the threshold. This is the graph of the processes: today the number of processes was over 600 per node. You can clearly see node number 4 as not affected in the bottom part of the graph. For some reason, at a precise time of the day, 10 processes / day are added.
I also discovered that this process was created every morning on the first 3 nodes:
nobody 3193922 3193921 0 06:25 ? 00:00:00 /usr/bin/find / -ignore_readdir_race ( -fstype NFS -o -fstype nfs -o -fstype nfs4 -o -fstype afs -o -fstype binfmt_misc -o -fstype proc -o -fstype smbfs -o -fstype autofs -o -fstype iso9660 -o -fstype ncpfs -o -fstype coda -o -fstype devpts -o -fstype ftpfs -o -fstype devfs -o -fstype mfs -o -fstype shfs -o -fstype sysfs -o -fstype cifs -o -fstype lustre_lite -o -fstype tmpfs -o -fstype usbfs -o -fstype udf -o -fstype ocfs2 -o -type d -regex \(^/tmp$\)\|\(^/usr/tmp$\)\|\(^/var/tmp$\)\|\(^/afs$\)\|\(^/amd$\)\|\(^/alex$\)\|\(^/var/spool$\)\|\(^/sfs$\)\|\(^/media$\)\|\(^/var/lib/schroot/mount$\) ) -prune -o -print0
Something was suggesting me to check fs mounts... and i discovered this:
10.50.0.160:/var/nfs/general on /mnt/pve/nfs_backup type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.50.0.160,mountvers=3,mountport=55917,mountproto=udp,local_lock=none,addr=10.50.0.160)
This mount is an old storage backup that we removed on 30/08, because we decomissioned the physical machine that hosted NFS server (before decommisioning and shutdown NFS service on the old storage backup we removed the storage from GUI).
As I said, we made this operation from Proxmox GUI and I would expect that Proxmox would have unmounted the NFS folder, I was wrong. In fact the mount is still there and every morning something tries to access this folder to perform a backup task that was already removed, causing the number of processes (and consequently load to increase). Process are never killed for some reason, they remain there hanging on.
Also trying to manually unmount the folder is not working... At this point we will try a reboot during the next maintenance window. Node 4 is not affected because it was rebooted multiple times after having added it to the cluster, and having removed NFS storage from cluster.
I think that a reboot will solve this situation, but I would like to report this: maybe someone alse can be in the same situation.