PBS server full: two days later almost empty?!?!?!?

Rainerle · Jan 29, 2021

No idea how that happened! But 206GByte ist definately not the size of my source five node PVE cluster.

Any idea how that could happen?

So on 27.01.2021 it went full and two days later it is almost empty...

I am pretty shocked right now...

Rainer

dcsapak · Jan 29, 2021

what is your underlying storage? is the directory maybe not mounted correctly? did a prune job + garbage collect run? anything in the syslog/journal?

Rainerle · Jan 29, 2021

Still venting air here...

Underlying storage is a ZFS pool, which looks ok:

There are so many warnings that the Browser stops working, when I tried to have a look at them:

Mount looks ok:

Code:

root@pbs01:~# mount | grep pbs
rpool/ROOT/pbs-1 on / type zfs (rw,relatime,xattr,noacl)
backup/pbs on /backup/pbs type zfs (rw,noatime,xattr,noacl)
backup/pbs/pve-prod on /backup/pbs/pve-prod type zfs (rw,noatime,xattr,noacl)
backup/pbs/pve-infra on /backup/pbs/pve-infra type zfs (rw,noatime,xattr,noacl)
backup/pbs/pve-muc01 on /backup/pbs/pve-muc01 type zfs (rw,noatime,xattr,noacl)
root@pbs01:~# ls -al /backup/pbs/pve-prod
total 16508
drwxr-xr-x     5 backup backup     7 Jan 29 14:02 .
drwxr-xr-x     5 root   root       5 Nov 26 21:22 ..
drwxr-x--- 65538 backup backup 65538 Nov 27 12:18 .chunks
drwxr-xr-x     7 backup backup     7 Jan 23 06:47 ct
-rw-r--r--     1 backup backup   338 Jan 29 14:02 .gc-status
-rw-r--r--     1 backup backup     0 Nov 27 12:18 .lock
drwxr-xr-x   141 backup backup   141 Dec 22 07:51 vm
root@pbs01:~# zfs list
NAME                                   USED  AVAIL     REFER  MOUNTPOINT
backup                                28.9T  13.2T      176K  /backup
...
backup/pbs                            17.7T  13.2T      192K  /backup/pbs
backup/pbs/pve-infra                  12.8T  13.2T     12.8T  /backup/pbs/pve-infra
backup/pbs/pve-muc01                  4.72T  13.2T     4.72T  /backup/pbs/pve-muc01
backup/pbs/pve-prod                    207G  13.2T      207G  /backup/pbs/pve-prod
rpool                                 7.33G  20.3G       96K  /rpool
rpool/ROOT                            7.32G  20.3G       96K  /rpool/ROOT
rpool/ROOT/pbs-1                      7.32G  20.3G     7.32G  /
root@pbs01:~#

dcsapak · Jan 29, 2021

mhmm, could you try to show (part of) the task log of one of the garbage collects,
it should be working using the command line

Code:

proxmox-backup-manager task list --all # get the 'UPID' 
proxmox-backup-manager task log 'UPID' # shows the content

you may want to limit the output of the task log with e.g. 'tail -n 100', e.g.

Code:

proxmox-backup-manager task log 'UPID:....' | tail -n 100

Rainerle · Jan 29, 2021

Here are parts of the log:

Code:

starting garbage collection on store pve-prod
task triggered by schedule '15:40'
Start GC phase1 (mark used chunks)
marked 1% (17 of 1651 index files)
WARN: warning: unable to access chunk 2cd5a53b5d8aa3c9d530c2ee2b89ccf7ed238ad7bf97afb1a3424666784656d0, required by "/backup/pbs/pve-prod/vm/116/2021-01-17T21:00:13Z/drive-scsi2.img.fidx" - update atime failed for chunk/file "/backup/pbs/pve-prod/.chunks/2cd5/2cd5a53b5d8aa3c9d530c2ee2b89ccf7ed238ad7bf97afb1a3424666784656d0" - ENOSPC: No space left on device
WARN: warning: unable to access chunk 70941e056497f76929f70327f35c854cc0e6b6547c0d58277d9296c39208b11c, required by "/backup/pbs/pve-prod/vm/116/2021-01-17T21:00:13Z/drive-scsi2.img.fidx" - update atime failed for chunk/file "/backup/pbs/pve-prod/.chunks/7094/70941e056497f76929f70327f35c854cc0e6b6547c0d58277d9296c39208b11c" - ENOSPC: No space left on device
...
WARN: warning: unable to access chunk 786afeafded9e5dbe893c014fdd80fc893a9c77af332e6a0e06811a1d52f85f9, required by "/backup/pbs/pve-prod/vm/161/2021-01-23T00:36:35Z/drive-scsi2.img.fidx" - update atime failed for chunk/file "/backup/pbs/pve-prod/.chunks/786a/786afeafded9e5dbe893c014fdd80fc893a9c77af332e6a0e06811a1d52f85f9" - ENOSPC: No space left on device
WARN: warning: unable to access chunk 6cbbc6cc2c0dd33e1c1f91e10524a63c81e6bbef2ea73363c98788e87deefb79, required by "/backup/pbs/pve-prod/vm/161/2021-01-23T00:36:35Z/drive-scsi2.img.fidx" - update atime failed for chunk/file "/backup/pbs/pve-prod/.chunks/6cbb/6cbbc6cc2c0dd33e1c1f91e10524a63c81e6bbef2ea73363c98788e87deefb79" - ENOSPC: No space left on device
WARN: warning: unable to access chunk c67d56926be2e45b42f2548bb5a135cd9b36104455a5b513cc0a62f405298fd4, required by "/backup/pbs/pve-prod/vm/161/2021-01-23T00:36:35Z/drive-scsi2.img.fidx" - update atime failed for chunk/file "/backup/pbs/pve-prod/.chunks/c67d/c67d56926be2e45b42f2548bb5a135cd9b36104455a5b513cc0a62f405298fd4" - ENOSPC: No space left on device

After a few minutes of log cat'ing into a file I cancelled that process

Code:

root@pbs01:~# proxmox-backup-manager task log "UPID:pbs01:000069FC:01710831:00000EFB:6012CCC0:garbage_collection:pve\x2dprod:root@pam:" > log.txt
^C
root@pbs01:~# ls -alh log.txt
-rw-r--r-- 1 root root 92M Jan 29 15:27 log.txt
root@pbs01:~# less log.txt

Does that help?

Rainerle · Jan 29, 2021

So I let the log output continue to run and after about 2 hours and about one GByte I stopped it again.

The content is just about the same for the whole file.

dcsapak · Feb 1, 2021

ok, i looked at the logs and did a bit of investigation:

it seems that you run into a bug in the garbage collection.
the gc relies on updating the atime for chunks to determine if it should delete it
for missing chunks, we intended to only warn [0](since they cannot be removed anyway)
but the code currently only warns for any errors, including 'out of space' which happened in your case (your subvolume was full)
that means that in the second phase of the gc, it deleted all chunks that were not touched in > 24hours

i have sent a patch to the list[1] to correct that behaviour in the future, but i fear that all backups where now chunks are missing are broken
you can sync it back though if you have synced them somewhere else, or a zfs snapshot or similar

you can test which backups are affected by running a verification

0: https://bugzilla.proxmox.com/show_bug.cgi?id=2909
1: https://lists.proxmox.com/pipermail/pbs-devel/2021-February/002039.html

Rainerle · Feb 1, 2021

So, here comes the next uncertainity from my side:
I generally disable atime on ZFS pools and this becomes inherited on the volumes then.

If you rely on atime does this still work?

dcsapak · Feb 1, 2021

Rainerle said:
If you rely on atime does this still work?

a short test here revealed: on zpools with noatime, setting it manually still works (what gc does) but it does not get updated on open/read
it could work, but i am not sure that it will work under all circumstances (certain actions together might interfere: restore/backup/prune/delete etc)

so i would leave atime set to 'on' for datasets which are pbs datastores

Search

Search

PBS server full: two days later almost empty?!?!?!?

Rainerle

Renowned Member

dcsapak

Proxmox Staff Member

Rainerle

Renowned Member

dcsapak

Proxmox Staff Member

Rainerle

Renowned Member

Rainerle

Renowned Member

dcsapak

Proxmox Staff Member

Rainerle

Renowned Member

dcsapak

Proxmox Staff Member

We value your privacy