[SOLVED] ENOSPC No space left on device during garbage collect

carsten2 · Jul 31, 2021

How much free space does the garbage collect need? I got the following error during garbage collect, while having 16GB free on root und 6 GB free on the backup drive:

Code:

2021-07-31T14:34:18+02:00: starting garbage collection on store pvebackup
2021-07-31T14:34:18+02:00: Start GC phase1 (mark used chunks)
2021-07-31T14:47:14+02:00: TASK ERROR: update atime failed for chunk/file "/data1/pvebackup/.chunks/8500/xxxxxxxxx" - ENOSPC: No space left on device

dietmar · Aug 1, 2021

carsten2 said:
How much free space does the garbage collect need?

GC only updates the "atime" metadata or remove files. So it does not really need space.

carsten2 · Aug 1, 2021

dietmar said:
GC only updates the "atime" metadata or remove files. So it does not really need space.

When I start the garbage collection and look at "watch df -h", it counts down by 100MB/s until the 6GB of free disk space is exhausted and fails. After failing the free disk space is 6GB again.
For what does it use 6GB? How to solve this problem?

dietmar · Aug 2, 2021

Again, It does not store any data. What kind of storage do you use? If you use ZFS, make sure there is no snapshot on the volume - else the snapshot will use the space.

carsten2 · Aug 2, 2021

dietmar said:
Again, It does not store any data. What kind of storage do you use? If you use ZFS, make sure there is no snapshot on the volume - else the snapshot will use the space.

There are NO snapshots on the volume and it definitely consume 50-100 MBytes/s directly after starting garbage collection until the garbage collection fails. Should I make video for to proof this?

dcsapak · Aug 2, 2021

what kind of storage is it ?
can you please post the output of 'df -h'
and if its zfs a 'zfs list -t all' (or the equivalent of that for your storage)

carsten2 · Aug 2, 2021

dcsapak said:
what kind of storage is it ?
can you please post the output of 'df -h'
and if its zfs a 'zfs list -t all' (or the equivalent of that for your storage)

It is ZFS. Here you see the log, while starting the GC phase. You also see that after failing, there are 5.9GB free again.

Code:

data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.7G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.6G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.4G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.3G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.2G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.8G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.6G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.5G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.3G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  4.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.7G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.5G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.5G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.5G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.4G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.3G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  3.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  2.7G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  2.4G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  2.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.8G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.4G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.3G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.2G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  1.1G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  946M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  758M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  556M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  350M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  143M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T   56M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T   14M 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup
data1/pvebackup  4.6T  4.6T  5.9G 100% /data1/pvebackup

dcsapak · Aug 2, 2021

can you please post the output of the command i suggested?

Code:

zfs list -t all

carsten2 · Aug 2, 2021

dcsapak said:
can you please post the output of the command i suggested?

Code:

zfs list -t all

Code:

root@xxx:~# zfs list -t all
NAME                                                 USED  AVAIL     REFER  MOUNTPOINT
data1                                               7.02T  5.89G      112K  /data1
data1/pvebackup                                     4.52T  5.89G     4.52T  /data1/pvebackup
data1/vm-106-disk-0                                  202G  5.89G      202G  -
data1/vm-106-disk-0@__replicate_106-0_1616916620__     0B      -      202G  -
data1/vm-106-disk-1                                 2.30T  5.89G     2.30T  -
data1/vm-106-disk-1@__replicate_106-0_1616916620__     0B      -     2.30T  -

The only snapshots are on another file system/volume. There are no snapshots on data1/pvebackup. And even if there were snapshot: How should this influence the filling up of the pveback filesystem? Could you please explain? IMHO, the filling up can only be caused by something writing to the file system, which must be more than than atime changes. Even then, it the space would not be freed up after failing of the GC, so I suggest, that there is a temporary file written, which is deleted after failing.

dcsapak · Aug 2, 2021

carsten2 said:
. And even if there were snapshot: How should this influence the filling up of the pveback filesystem?

because the updated inodes with the new timestamps need space

carsten2 said:
Even then, it the space would not be freed up after failing of the GC, so I suggest, that there is a temporary file written, which is deleted after failing.

no there is definitely no temp file that gets written during gc, i just tested it again. the only time where it uses up space is when i make a snapshot beforehand...

can you post the version you are on?

Code:

proxmox-backup-manager versions -verbose

also can you post the output of

Code:

zfs get all

?

carsten2 · Aug 2, 2021

dcsapak said:
NAME PROPERTY VALUE SOURCE
data1/pvebackup type filesystem -
data1/pvebackup creation Fri Aug 21 23:04 2020 -
data1/pvebackup used 4.52T -
data1/pvebackup available 2.72G -
data1/pvebackup referenced 4.52T -
data1/pvebackup compressratio 1.02x -
data1/pvebackup mounted yes -
data1/pvebackup quota none default
data1/pvebackup reservation none default
data1/pvebackup recordsize 128K default
data1/pvebackup mountpoint /data1/pvebackup default
data1/pvebackup sharenfs off default
data1/pvebackup checksum on default
data1/pvebackup compression on inherited from data1
data1/pvebackup atime on default
data1/pvebackup devices on default
data1/pvebackup exec on default
data1/pvebackup setuid on default
data1/pvebackup readonly off default
data1/pvebackup zoned off default
data1/pvebackup snapdir hidden default
data1/pvebackup aclmode discard default
data1/pvebackup aclinherit restricted default
data1/pvebackup createtxg 8750459 -
data1/pvebackup canmount on default
data1/pvebackup xattr on default
data1/pvebackup copies 1 default
data1/pvebackup version 5 -
data1/pvebackup utf8only off -
data1/pvebackup normalization none -
data1/pvebackup casesensitivity sensitive -
data1/pvebackup vscan off default
data1/pvebackup nbmand off default
data1/pvebackup sharesmb off default
data1/pvebackup refquota none default
data1/pvebackup refreservation none default
data1/pvebackup guid 1534747767796120345 -
data1/pvebackup primarycache all default
data1/pvebackup secondarycache all default
data1/pvebackup usedbysnapshots 0B -
data1/pvebackup usedbydataset 4.52T -
data1/pvebackup usedbychildren 0B -
data1/pvebackup usedbyrefreservation 0B -
data1/pvebackup logbias latency default
data1/pvebackup objsetid 1962 -
data1/pvebackup dedup off default
data1/pvebackup mlslabel none default
data1/pvebackup sync standard default
data1/pvebackup dnodesize legacy default
data1/pvebackup refcompressratio 1.02x -
data1/pvebackup written 4.52T -
data1/pvebackup logicalused 4.63T -
data1/pvebackup logicalreferenced 4.63T -
data1/pvebackup volmode default default
data1/pvebackup filesystem_limit none default
data1/pvebackup snapshot_limit none default
data1/pvebackup filesystem_count none default
data1/pvebackup snapshot_count none default
data1/pvebackup snapdev hidden default
data1/pvebackup acltype off default
data1/pvebackup context none default
data1/pvebackup fscontext none default
data1/pvebackup defcontext none default
data1/pvebackup rootcontext none default
data1/pvebackup relatime off default
data1/pvebackup redundant_metadata all default
data1/pvebackup overlay on default
data1/pvebackup encryption off default
data1/pvebackup keylocation none default
data1/pvebackup keyformat none default
data1/pvebackup pbkdf2iters 0 default
data1/pvebackup special_small_blocks 0 default

carsten2 · Aug 2, 2021

dcsapak said:
because the updated inodes with the new timestamps need space

The space of changed timestamps would not be freed up after process failure, but they get freed up. How do you explain this?

dcsapak said:
proxmox-backup-manager versions -verbose

proxmox-backup unknown running kernel: 5.4.128-1-pve
proxmox-backup-server 1.1.12-1 running version: 1.1.12
pve-kernel-5.4 6.4-5
pve-kernel-helper 6.4-5
pve-kernel-5.4.128-1-pve 5.4.128-1
pve-kernel-5.4.106-1-pve 5.4.106-1
pve-kernel-5.4.78-2-pve 5.4.78-2
pve-kernel-5.4.73-1-pve 5.4.73-1
pve-kernel-5.4.65-1-pve 5.4.65-1
pve-kernel-5.4.60-1-pve 5.4.60-2
pve-kernel-5.4.41-1-pve 5.4.41-1
pve-kernel-4.15 5.4-18
pve-kernel-4.15.18-29-pve 4.15.18-57
pve-kernel-4.15.18-10-pve 4.15.18-32
ifupdown2 3.0.0-1+pve4~bpo10
libjs-extjs 6.0.1-10
proxmox-backup-docs 1.1.12-1
proxmox-backup-client 1.1.12-1
proxmox-mini-journalreader 1.1-1
proxmox-widget-toolkit 2.6-1
pve-xtermjs 4.7.0-3
smartmontools 7.2-pve2
zfsutils-linux 2.0.5-pve1~bpo10+1

carsten2 · Aug 2, 2021

Does this help?

lsof | grep data1

Code:

proxmox-b  1864                           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864  1905 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864  1909 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864  8594 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864  9841 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 11897 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 16972 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 19562 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 28768 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock
proxmox-b  1864 30813 tokio-run           backup   19u      REG               0,54        0          3 /data1/pvebackup/.lock

dcsapak · Aug 2, 2021

ok it seems the problem is more how zfs behaves..
a colleague told me that probably zfs (since its a cow filesystem) first allocates the space for the new inodes and then simply runs into an ENOSPC error

in general, it is recommended to keep free space on a zpool over 10% [0][1][2]

in your case, i guess the only possible way out of it, is to delete enough data on the zpool to let a gc successfully run (e.g. by removing one of the 2 vm images)

0: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html#free-space
1: https://openzfs.github.io/openzfs-d...uning/Workload Tuning.html#metaslab-allocator
2: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Module Parameters.html#spa-slop-shift

carsten2 · Aug 2, 2021

I found the problem and it was my fault!

It was pure chance, that the space started to get less, when I started the GC, because at the same time, a still actice replication from another proxmox started itsself writing onto the other file system on the same pool. I tested it three times, but it was also the same coincidence at the same time. Now after stopping this replication on the other server, it works fine.

Thank you for your support. It was still helpful, because your statement that almost nothing gets written during GC made me search for the cause.

Adam Smith · Dec 26, 2023

We are running into this issue on a Proxmox Backup Server. The ZFS pool has filled. We did not realize we needed to schedule garbage collection. I could not find any replication from the PVE cluster, so I am not sure we have the exact same root cause. Any attempts to free space, mostly be trashing VMs in from the GUI, have not had any effect. We have moved some chunks off of the datastore, also to no effect. We have no ZFS snapshots to remove.

Suggestions?

Code:

root@pbs:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   16G     0   16G   0% /dev
tmpfs                 3.2G  972K  3.2G   1% /run
/dev/mapper/pbs-root  188G  5.3G  173G   3% /
tmpfs                  16G     0   16G   0% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
efivarfs               56K   55K     0 100% /sys/firmware/efi/efivars
/dev/sde2             511M  352K  511M   1% /boot/efi
datastore             3.4T  3.4T     0 100% /mnt/datastore/datastore
tmpfs                 3.2G     0  3.2G   0% /run/user/0

Elliott Partridge · Oct 9, 2024

Adam Smith said:
We are running into this issue on a Proxmox Backup Server. The ZFS pool has filled. We did not realize we needed to schedule garbage collection. I could not find any replication from the PVE cluster, so I am not sure we have the exact same root cause. Any attempts to free space, mostly be trashing VMs in from the GUI, have not had any effect. We have moved some chunks off of the datastore, also to no effect. We have no ZFS snapshots to remove.

Suggestions?

Code:

root@pbs:~# df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 972K 3.2G 1% /run /dev/mapper/pbs-root 188G 5.3G 173G 3% / tmpfs 16G 0 16G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock efivarfs 56K 55K 0 100% /sys/firmware/efi/efivars /dev/sde2 511M 352K 511M 1% /boot/efi datastore 3.4T 3.4T 0 100% /mnt/datastore/datastore tmpfs 3.2G 0 3.2G 0% /run/user/0

I've run into this before and moving chunks off the datastore, running garbage collect, and restoring them afterwards is what resolved it for me. Currently doing this again, as I accidentally doubled-up on an encrypted backup set.

Note: You will see a bunch of warnings related to missing chunks. These can be ignored.

Code:

2024-10-09T16:43:26-04:00: WARN: warning: unable to access non-existent chunk 000117b8af54a8dd74b2e274d07b53ef35f8c12d260b0d8f19638941b3eab69b, required by "/backup/ct/105/2024-09-23T02:01:44Z/root.pxar.didx"

Search

Search

[SOLVED] ENOSPC No space left on device during garbage collect

carsten2

Renowned Member

dietmar

Proxmox Staff Member

carsten2

Renowned Member

dietmar

Proxmox Staff Member

carsten2

Renowned Member

dcsapak

Proxmox Staff Member

carsten2

Renowned Member

dcsapak

Proxmox Staff Member

carsten2

Renowned Member

dcsapak

Proxmox Staff Member

carsten2

Renowned Member

carsten2

Renowned Member

carsten2

Renowned Member

dcsapak

Proxmox Staff Member

carsten2

Renowned Member

Adam Smith

Active Member

Elliott Partridge

Well-Known Member

We value your privacy