[SOLVED] "Error: No space left on device"

mattlach

Renowned Member
Mar 23, 2016
181
21
83
Boston, MA
Hey all,

I am having an odd problem I'm hopng someone might help me solve.

I keep randomly getting "Error: No space left on device" error messages in console.

Examples:
Code:
# ifup vmbr3

Waiting for vmbr3 to get ready (MAXWAIT is 2 seconds).
Error: No space left on device
Error: No space left on device

This happens with other random commands as well, like when restarting the nfs server, etc. etc.

Thing is, I can't seem to find ANY device that is full.

Code:
# df -h
Filesystem                 Size  Used Avail Use% Mounted on
udev                        10M     0   10M   0% /dev
tmpfs                       38G   18M   38G   1% /run
rpool/ROOT/pve-1           435G  7.6G  427G   2% /
tmpfs                       95G   43M   95G   1% /dev/shm
tmpfs                      5.0M  4.0K  5.0M   1% /run/lock
tmpfs                       95G     0   95G   0% /sys/fs/cgroup
rpool                      427G  128K  427G   1% /rpool
rpool/ROOT                 427G     0  427G   0% /rpool/ROOT
rpool/subvol-101-disk-1    8.0G  601M  7.5G   8% /rpool/subvol-101-disk-1
rpool/subvol-102-disk-1    8.0G  439M  7.6G   6% /rpool/subvol-102-disk-1
rpool/subvol-110-disk-1     16G  2.2G   14G  14% /rpool/subvol-110-disk-1
rpool/subvol-111-disk-1    8.0G  368M  7.7G   5% /rpool/subvol-111-disk-1
rpool/subvol-120-disk-1     16G  422M   16G   3% /rpool/subvol-120-disk-1
rpool/subvol-125-disk-1    8.0G  355M  7.7G   5% /rpool/subvol-125-disk-1
rpool/subvol-130-disk-1     16G  4.5G   12G  28% /rpool/subvol-130-disk-1
rpool/subvol-140-disk-1     16G  1.4G   15G   9% /rpool/subvol-140-disk-1
zfshome                     18T  2.2T   16T  13% /zfshome
zfshome/media               24T  7.7T   16T  33% /zfshome/media
zfshome/mythtv_recordings   19T  2.8T   16T  15% /zfshome/mythtv_recordings
/dev/sds1                  917G  769G  149G  84% /mnt/mythbuntu/scheduled
/dev/sdt1                  118G   60M  118G   1% /mnt/mythbuntu/live1
tmpfs                      100K     0  100K   0% /run/lxcfs/controllers
cgmfs                      100K     0  100K   0% /run/cgmanager/fs
/dev/fuse                   30M   16K   30M   1% /etc/pve
rpool/subvol-150-disk-1     32G  264M   32G   1% /rpool/subvol-150-disk-1

The host seems to be working just fine, other than this strange error message.

Does anyone know what might be going on?

Thanks,
Matt
 
Paste output of
Code:
 df -hi

Thanks for your help.

Inodes look good to me too:

Code:
~# df -hi
Filesystem                Inodes IUsed IFree IUse% Mounted on
udev                         24M   884   24M    1% /dev
tmpfs                        24M  1.4K   24M    1% /run
rpool/ROOT/pve-1            854M   84K  854M    1% /
tmpfs                        24M    67   24M    1% /dev/shm
tmpfs                        24M    24   24M    1% /run/lock
tmpfs                        24M    18   24M    1% /sys/fs/cgroup
rpool                       854M    16  854M    1% /rpool
rpool/ROOT                  854M     7  854M    1% /rpool/ROOT
rpool/subvol-101-disk-1      15M   34K   15M    1% /rpool/subvol-101-disk-1
rpool/subvol-102-disk-1      16M   25K   16M    1% /rpool/subvol-102-disk-1
rpool/subvol-110-disk-1      28M   95K   28M    1% /rpool/subvol-110-disk-1
rpool/subvol-111-disk-1      16M   22K   16M    1% /rpool/subvol-111-disk-1
rpool/subvol-120-disk-1      32M   25K   32M    1% /rpool/subvol-120-disk-1
rpool/subvol-125-disk-1      16M   22K   16M    1% /rpool/subvol-125-disk-1
rpool/subvol-130-disk-1      23M   38K   23M    1% /rpool/subvol-130-disk-1
rpool/subvol-140-disk-1      30M   24K   30M    1% /rpool/subvol-140-disk-1
zfshome                      32G  1.3M   32G    1% /zfshome
zfshome/media                32G   39K   32G    1% /zfshome/media
zfshome/mythtv_recordings    32G  2.8K   32G    1% /zfshome/mythtv_recordings
/dev/sds1                    59M   656   59M    1% /mnt/mythbuntu/scheduled
/dev/sdt1                   7.5M    10  7.5M    1% /mnt/mythbuntu/live1
tmpfs                        24M    12   24M    1% /run/lxcfs/controllers
cgmfs                        24M    14   24M    1% /run/cgmanager/fs
/dev/fuse                   9.8K    35  9.8K    1% /etc/pve
rpool/subvol-150-disk-1      63M   30K   63M    1% /rpool/subvol-150-disk-1

Any idea what else could be causing this puzzling out of disk space message?
 
So, googling around I found someone else with an identical problem to mine, on a Debian Webserver.

both df -h and df -ih show plenty of free space, but he, like me, is still getting "no space left on device" error messages.

He seems to have solved his issue by making a change to fs.inotify.max_user_watches. I ahve no idea what this is, or what it does. Can anyone explain this to me? I don't want to touch it before I do.

Does anyone have any thoughts?

Another thought I had was that it might be drive corruption, but I am running off of a ZFS mirror, and a scrub shows no issues...
 
you can check whether there are still inotify watches available by trying to "watch" a file ;)

if the "ENOSPACE" error is reproducible and caused by hitting the inotify limit, you should get an error message when trying "tail -f /var/log/messages": "tail: cannot watch '/var/log/messages': No space left on device". if it is just spurious, you will need to try that test when the issue occurs (or just proactively try with an increased inotify limit).

if you run a lot of (containers with) processes that utilize inotify, you might need to increase the watch limit. the only downside is that inotify's need some unswappable kernel memory, but the amount is pretty much irrelevant on any modern server (1kb for each used inotify watch). you can definitely bump the default 8k limit to at least 64k without any worries, unless your system is really really memory constrained (in which case, you should probably not use ZFS like you are ;))

unfortunately, a lot of programs that use inotify and fail don't propagate the root cause and instead report "ENOSPACE"..
 
you can check whether there are still inotify watches available by trying to "watch" a file ;)

if the "ENOSPACE" error is reproducible and caused by hitting the inotify limit, you should get an error message when trying "tail -f /var/log/messages": "tail: cannot watch '/var/log/messages': No space left on device". if it is just spurious, you will need to try that test when the issue occurs (or just proactively try with an increased inotify limit).

if you run a lot of (containers with) processes that utilize inotify, you might need to increase the watch limit. the only downside is that inotify's need some unswappable kernel memory, but the amount is pretty much irrelevant on any modern server (1kb for each used inotify watch). you can definitely bump the default 8k limit to at least 64k without any worries, unless your system is really really memory constrained (in which case, you should probably not use ZFS like you are ;))

unfortunately, a lot of programs that use inotify and fail don't propagate the root cause and instead report "ENOSPACE"..

Ahh, I think that is the cause.

I have Crashplan running in a LXC container, and it seems to want to use A LOT of watches.

I'm probably going to boost it up to 1048576, and see if that helps. RAM is really not an issue for me, I have 192GB in my server. I upgraded it from 96GB before I switched from ESXi. Now with Proxmox and containers, my system is so much more RAM efficient, that I have more RAM than I know what to do with!

Am I correct in assuming that I only have to do this for the host, and all the LXC containers are automatically included?

Is the appropriate way to do this to edit /etc/sysctl.conf, or is there a built in way in the management interface to deal with this?

Since Proxmox is geared towards these types of activities, maybe it makes sense for it to have a higher max_user_watches than the Debian default?

Thanks,
Matt
 
yes, just like all other sysctl values, you can temporarily set it by echoing some value to /proc/sys/... and persist it by adding a line to /etc/sysctl.conf . setting it on the host also affects the containers, yes (shared kernel). I am not sure whether you need a (container) reboot or not - but that should be easy enough to empirically find out ;) increasing the default seems like a good idea, I'll file a tracking bug for it.
 
Thanks for the help.

If anyone else is trying to solve the same problem, here is how I wound up doing it: (from the console in the host)

Check the current limit:
Code:
cat /proc/sys/fs/inotify/max_user_watches
For me this was 8192

Temporarily test if increasing the value fixes things: (this will reset to the default of 8192 next reboot)
Code:
echo 1048576 > /proc/sys/fs/inotify/max_user_watches
Keep in mind at the size of ~1k each, this will increase the RAM consumption by watches from ~8MB to ~1GB. You may not need it to be this large. Try something smaller to conserve RAM.

If the above makes the problem go away, we can now make it permanent through reboots:

Edit /etc/sysctl.conf
Code:
nano /etc/sysctl.conf

Add the following line (or edit it if present already)
Code:
fs.inotify.max_user_watches=1048576
Edit the 1048576 value to suit your needs as discussed above

Reboot, or do the following:
Code:
sysctl -p /etc/sysctl.conf
 
I am unable to make a new post so I am bumping here.

I have the same issue
But the steps taken did not resolve the issue for me.

I increased to 2048576 after my first try increasing to 1048576.

I get the error

No space left on device at /usr/share/perl5/PVE/RESTEnvironment.pm

pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-11
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-2-pve: 4.15.17-10
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-41
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-9
pve-firewall: 3.0-14
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

df -h

Filesystem Size Used Avail Use% Mounted on
udev 7.8G 0 7.8G 0% /dev
tmpfs 1.6G 19M 1.6G 2% /run
/dev/md2 3.7T 1.7T 1.8T 49% /
tmpfs 7.9G 43M 7.8G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
/dev/md1 488M 173M 289M 38% /boot
/dev/sdc1 954G 538G 417G 57% /var/lib/lxc-ssd
/dev/fuse 30M 24K 30M 1% /etc/pve
tmpfs 1.6G 0 1.6G 0% /run/user/0



Any idea what could be the issue?

The system seems stable until I make any change on the VM then the HOST reports the error. The change I am making is just adding a user in example to the VM.
 
After a reboot I was able to run commands from CLI on HOST

I ran df -hi
Filesystem Inodes IUsed IFree IUse% Mounted on
udev 2.0M 513 2.0M 1% /dev
tmpfs 2.0M 2.5K 2.0M 1% /run
/dev/md2 233M 193K 233M 1% /
tmpfs 2.0M 103 2.0M 1% /dev/shm
tmpfs 2.0M 15 2.0M 1% /run/lock
tmpfs 2.0M 17 2.0M 1% /sys/fs/cgroup
/dev/md1 128K 368 128K 1% /boot
/dev/sdc1 477M 100 477M 1% /var/lib/lxc-ssd
/dev/fuse 9.8K 55 9.8K 1% /etc/pve
tmpfs 2.0M 10 2.0M 1% /run/user/0