High boot disk usage on Proxmox 9.2 node (Ext4)

DerekG

Renowned Member
Mar 30, 2021
186
72
68
45
Hi all,

Can anyone explain why the boot disk of my nodes (8 node cluster) are utilising +80% of the 32GB disk capacity I've allocated for the boot disk on the nodes in my cluster.?
I don't store any VM's / CT's / templates etc on the boot partition, they are stored on the LVM partition. I have installed the agents for monitoring / patching tools etc, but these can't add more than 1GB.
When I run commands like ~# du -sh /* .[^.]* | sort -hr that indicates that some 13Gb is in use, so I'm trying to find some 13GB which is unaccounted for.
I've been through the standard process of the apt autoremove and limiting the journal size/days stored.

The one node which is indicating less disk used (17GB), is a node with no VM's or CT's running.

Maybe I'm missing something, but I don't believe that this is just a Proxmox reporting issue. Any advice would be welcome.
 
Try this to find out what the space is used by
Bash:
apt install -yU gdu
gdu -x /
It could also be a mount point that shadows something so also share this
Bash:
df -hT
If there are any foreign ones unmout them and run above again.
 
Last edited:
apt install -yU gdu gdu -x /
Thanks Impact, that command seems to be working the same as ncdu -x, except it also shows the mounted stores.
The total usage is as I said 13GB of 26Gb indicted in the Proxmox GUI. The problem is that the usage seems to have grown after the many updates over the last month.


Code:
root@pve-2:~# df -hT
Filesystem                              Type      Size  Used Avail Use% Mounted on
udev                                    devtmpfs   18G     0   18G   0% /dev
tmpfs                                   tmpfs     4.0G  4.0M  3.9G   1% /run
/dev/mapper/pve-root                    ext4       32G   27G  3.6G  89% /
tmpfs                                   tmpfs      20G   66M   20G   1% /dev/shm
efivarfs                                efivarfs  192K   37K  151K  20% /sys/firmware/efi/efivars
tmpfs                                   tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                                   tmpfs     1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
tmpfs                                   tmpfs      20G     0   20G   0% /tmp
/dev/nvme0n1p2                          vfat     1022M  9.1M 1013M   1% /boot/efi
/dev/fuse                               fuse      128M  192K  128M   1% /etc/pve
192.168.XX.XX:/mnt/data                 nfs4      7.7T  5.4G  7.7T   1% /mnt/data_store
tmpfs                                   tmpfs     1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
//192.168.XX.XXX/Backup1                cifs      9.3T  773G  8.6T   9% /mnt/pve/TrueNAS
//192.168.XX.XX/proxmox                 cifs      3.6T   55G  3.5T   2% /mnt/pve/proxmox
//192.168.XX.XXX/backup                 cifs      8.9T  399G  8.5T   5% /mnt/pve/Backup
//192.168.XX.XX/archivist               cifs      7.7T  9.7G  7.7T   1% /mnt/archivist_host
tmpfs                                   tmpfs     4.0G  8.0K  4.0G   1% /run/user/0
root@pve-2:~#

The only think strange are those tmpfs. But I'm seeing the simular usage across multiple nodes.
 
What's the output of du -sh /lib/modules /boot /var/cache/apt/archives/?

Assuming the system is running a 7.0.x kernel now, I feel another apt purge proxmox-kernel-6* && apt autoclean coming up. ;)
 
Last edited:
Try to disable and unmount all these /mnt mounts and run gdu again without -x.
Also see: https://serverfault.com/questions/5...ng-data-to-an-unmounted-mount-point-directory

Having read up on the provided link, I understand the reason for unmounting and checking, except that the same is occurring on 7 nodes, all of which have different cifs mounts. Even the node with nothing at all running is using 17GB (13Gb after autoclean).

What's the output of du -sh /lib/modules /boot /var/cache/apt/archives/?

Before autoclean

Code:
root@pve-2:~# du -sh /lib/modules /boot /var/cache/apt/archives/
4.7G    /lib/modules
1.1G    /boot
24K     /var/cache/apt/archives/
root@pve-2:~#

After autoclean:
Code:
root@pve-2:~# du -sh /lib/modules /boot /var/cache/apt/archives/
2.9G    /lib/modules
480M    /boot
24K     /var/cache/apt/archives/
root@pve-2:~#

This is not accounting for the massive discrepancy, boot drive is still over 80% and triggering my monitor alarms. I can override the alarms but if the used space on the drive is going to continue to rise, then I'll be in trouble.
 
Your df -hT showed 27G out of 32G used on root (/) before.
With about 2.5G cleared by removing 6.* kernels and autoclean, du -sh / should now show roughly 24.5G. Correct?

Did you already traverse through the folder structure to find out where this 24.5G is used with ncdu, gdu or just:
du -d3 -xh / | sort -rh | head -n 20 ?
 
Last edited:
A really common way this happens is backup up to a remote server using NFS or CIFS. The mount fails for whatever reason and the backups get written to the directory where the remote is supposed to be mounted. That's why everyone is suggesting you unmount those shares before checking space.

When I say "common", I mean multiple threads per month here. There is an "is_mountpoint" flag you can set on the storage config to prevent that from happening and generate an error instead.
 
A really common way this happens is backup up to a remote server using NFS or CIFS. The mount fails for whatever reason and the backups get written to the directory where the remote is supposed to be mounted. That's why everyone is suggesting you unmount those shares before checking space.

Oh, what an idiot I am, I unmounted the shares as many have suggested, and found 14GB of YouTube videos hiding in one of the /mnt directories. I've deleted them and now the boot drive is down to 9GB. Now I'm going to have to double-check all nodes because there are a few nodes with the similar usage (although none have the same app installed).

Thank you all for putting me straight, on this one.
 
  • Like
Reactions: daanw