[SOLVED] ENOSPC No space left on device during garbage collect

carsten2

Well-Known Member
Mar 25, 2017
249
21
58
55
How much free space does the garbage collect need? I got the following error during garbage collect, while having 16GB free on root und 6 GB free on the backup drive:

Code:
2021-07-31T14:34:18+02:00: starting garbage collection on store pvebackup
2021-07-31T14:34:18+02:00: Start GC phase1 (mark used chunks)
2021-07-31T14:47:14+02:00: TASK ERROR: update atime failed for chunk/file "/data1/pvebackup/.chunks/8500/xxxxxxxxx" - ENOSPC: No space left on device
 
GC only updates the "atime" metadata or remove files. So it does not really need space.
When I start the garbage collection and look at "watch df -h", it counts down by 100MB/s until the 6GB of free disk space is exhausted and fails. After failing the free disk space is 6GB again.
For what does it use 6GB? How to solve this problem?
 
Again, It does not store any data. What kind of storage do you use? If you use ZFS, make sure there is no snapshot on the volume - else the snapshot will use the space.
 
Last edited:
Again, It does not store any data. What kind of storage do you use? If you use ZFS, make sure there is no snapshot on the volume - else the snapshot will use the space.
There are NO snapshots on the volume and it definitely consume 50-100 MBytes/s directly after starting garbage collection until the garbage collection fails. Should I make video for to proof this?
 
what kind of storage is it ?
can you please post the output of 'df -h'
and if its zfs a 'zfs list -t all' (or the equivalent of that for your storage)
 
what kind of storage is it ?
can you please post the output of 'df -h'
and if its zfs a 'zfs list -t all' (or the equivalent of that for your storage)

It is ZFS. Here you see the log, while starting the GC phase. You also see that after failing, there are 5.9GB free again.

1627888251126.png

Code:
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.7G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.6G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.4G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.3G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.2G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.8G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.6G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.5G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.3G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.7G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.5G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.5G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.5G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.4G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.3G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  2.7G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  2.4G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  2.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.8G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.4G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.3G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.2G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  946M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  758M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  556M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  350M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  143M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T   56M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T   14M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup

1627888677384.png
 
Last edited:
can you please post the output of the command i suggested?

Code:
zfs list -t all
 
can you please post the output of the command i suggested?

Code:
zfs list -t all
Code:
root@xxx:~# zfs list -t all
NAME                                                 USED  AVAIL     REFER  MOUNTPOINT
data1                                               7.02T  5.89G      112K  /data1
data1/pvebackup                                     4.52T  5.89G     4.52T  /data1/pvebackup
data1/vm-106-disk-0                                  202G  5.89G      202G  -
data1/vm-106-disk-0@__replicate_106-0_1616916620__     0B      -      202G  -
data1/vm-106-disk-1                                 2.30T  5.89G     2.30T  -
data1/vm-106-disk-1@__replicate_106-0_1616916620__     0B      -     2.30T  -

The only snapshots are on another file system/volume. There are no snapshots on data1/pvebackup. And even if there were snapshot: How should this influence the filling up of the pveback filesystem? Could you please explain? IMHO, the filling up can only be caused by something writing to the file system, which must be more than than atime changes. Even then, it the space would not be freed up after failing of the GC, so I suggest, that there is a temporary file written, which is deleted after failing.
 
Last edited:
. And even if there were snapshot: How should this influence the filling up of the pveback filesystem?
because the updated inodes with the new timestamps need space

Even then, it the space would not be freed up after failing of the GC, so I suggest, that there is a temporary file written, which is deleted after failing.
no there is definitely no temp file that gets written during gc, i just tested it again. the only time where it uses up space is when i make a snapshot beforehand...

can you post the version you are on?

Code:
proxmox-backup-manager versions -verbose

also can you post the output of
Code:
zfs get all
?
 
NAME PROPERTY VALUE SOURCE
data1/pvebackup type filesystem -
data1/pvebackup creation Fri Aug 21 23:04 2020 -
data1/pvebackup used 4.52T -
data1/pvebackup available 2.72G -
data1/pvebackup referenced 4.52T -
data1/pvebackup compressratio 1.02x -
data1/pvebackup mounted yes -
data1/pvebackup quota none default
data1/pvebackup reservation none default
data1/pvebackup recordsize 128K default
data1/pvebackup mountpoint /data1/pvebackup default
data1/pvebackup sharenfs off default
data1/pvebackup checksum on default
data1/pvebackup compression on inherited from data1
data1/pvebackup atime on default
data1/pvebackup devices on default
data1/pvebackup exec on default
data1/pvebackup setuid on default
data1/pvebackup readonly off default
data1/pvebackup zoned off default
data1/pvebackup snapdir hidden default
data1/pvebackup aclmode discard default
data1/pvebackup aclinherit restricted default
data1/pvebackup createtxg 8750459 -
data1/pvebackup canmount on default
data1/pvebackup xattr on default
data1/pvebackup copies 1 default
data1/pvebackup version 5 -
data1/pvebackup utf8only off -
data1/pvebackup normalization none -
data1/pvebackup casesensitivity sensitive -
data1/pvebackup vscan off default
data1/pvebackup nbmand off default
data1/pvebackup sharesmb off default
data1/pvebackup refquota none default
data1/pvebackup refreservation none default
data1/pvebackup guid 1534747767796120345 -
data1/pvebackup primarycache all default
data1/pvebackup secondarycache all default
data1/pvebackup usedbysnapshots 0B -
data1/pvebackup usedbydataset 4.52T -
data1/pvebackup usedbychildren 0B -
data1/pvebackup usedbyrefreservation 0B -
data1/pvebackup logbias latency default
data1/pvebackup objsetid 1962 -
data1/pvebackup dedup off default
data1/pvebackup mlslabel none default
data1/pvebackup sync standard default
data1/pvebackup dnodesize legacy default
data1/pvebackup refcompressratio 1.02x -
data1/pvebackup written 4.52T -
data1/pvebackup logicalused 4.63T -
data1/pvebackup logicalreferenced 4.63T -
data1/pvebackup volmode default default
data1/pvebackup filesystem_limit none default
data1/pvebackup snapshot_limit none default
data1/pvebackup filesystem_count none default
data1/pvebackup snapshot_count none default
data1/pvebackup snapdev hidden default
data1/pvebackup acltype off default
data1/pvebackup context none default
data1/pvebackup fscontext none default
data1/pvebackup defcontext none default
data1/pvebackup rootcontext none default
data1/pvebackup relatime off default
data1/pvebackup redundant_metadata all default
data1/pvebackup overlay on default
data1/pvebackup encryption off default
data1/pvebackup keylocation none default
data1/pvebackup keyformat none default
data1/pvebackup pbkdf2iters 0 default
data1/pvebackup special_small_blocks 0 default
 
Last edited:
because the updated inodes with the new timestamps need space
The space of changed timestamps would not be freed up after process failure, but they get freed up. How do you explain this?
proxmox-backup-manager versions -verbose
proxmox-backup unknown running kernel: 5.4.128-1-pve
proxmox-backup-server 1.1.12-1 running version: 1.1.12
pve-kernel-5.4 6.4-5
pve-kernel-helper 6.4-5
pve-kernel-5.4.128-1-pve 5.4.128-1
pve-kernel-5.4.106-1-pve 5.4.106-1
pve-kernel-5.4.78-2-pve 5.4.78-2
pve-kernel-5.4.73-1-pve 5.4.73-1
pve-kernel-5.4.65-1-pve 5.4.65-1
pve-kernel-5.4.60-1-pve 5.4.60-2
pve-kernel-5.4.41-1-pve 5.4.41-1
pve-kernel-4.15 5.4-18
pve-kernel-4.15.18-29-pve 4.15.18-57
pve-kernel-4.15.18-10-pve 4.15.18-32
ifupdown2 3.0.0-1+pve4~bpo10
libjs-extjs 6.0.1-10
proxmox-backup-docs 1.1.12-1
proxmox-backup-client 1.1.12-1
proxmox-mini-journalreader 1.1-1
proxmox-widget-toolkit 2.6-1
pve-xtermjs 4.7.0-3
smartmontools 7.2-pve2
zfsutils-linux 2.0.5-pve1~bpo10+1
 
Does this help?

lsof | grep data1
Code:
proxmox-b  1864                           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864  1905 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864  1909 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864  8594 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864  9841 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 11897 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 16972 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 19562 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 28768 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 30813 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
 
ok it seems the problem is more how zfs behaves..
a colleague told me that probably zfs (since its a cow filesystem) first allocates the space for the new inodes and then simply runs into an ENOSPC error

in general, it is recommended to keep free space on a zpool over 10% [0][1][2]

in your case, i guess the only possible way out of it, is to delete enough data on the zpool to let a gc successfully run (e.g. by removing one of the 2 vm images)

0: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html#free-space
1: https://openzfs.github.io/openzfs-d...uning/Workload Tuning.html#metaslab-allocator
2: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Module Parameters.html#spa-slop-shift
 
I found the problem and it was my fault!

It was pure chance, that the space started to get less, when I started the GC, because at the same time, a still actice replication from another proxmox started itsself writing onto the other file system on the same pool. I tested it three times, but it was also the same coincidence at the same time. Now after stopping this replication on the other server, it works fine.

Thank you for your support. It was still helpful, because your statement that almost nothing gets written during GC made me search for the cause.
 
Last edited:
We are running into this issue on a Proxmox Backup Server. The ZFS pool has filled. We did not realize we needed to schedule garbage collection. I could not find any replication from the PVE cluster, so I am not sure we have the exact same root cause. Any attempts to free space, mostly be trashing VMs in from the GUI, have not had any effect. We have moved some chunks off of the datastore, also to no effect. We have no ZFS snapshots to remove.

Suggestions?

Code:
root@pbs:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   16G     0   16G   0% /dev
tmpfs                 3.2G  972K  3.2G   1% /run
/dev/mapper/pbs-root  188G  5.3G  173G   3% /
tmpfs                  16G     0   16G   0% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
efivarfs               56K   55K     0 100% /sys/firmware/efi/efivars
/dev/sde2             511M  352K  511M   1% /boot/efi
datastore             3.4T  3.4T     0 100% /mnt/datastore/datastore
tmpfs                 3.2G     0  3.2G   0% /run/user/0
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!