ESTALE: Stale file handle

cpulove · Oct 2, 2024

waltar said:
Maybe the short version would solve such lock cases also: "mount -o remount /mnt/yourmountpoint"

If someone clould try this, this would be nice! I have to make some important backups first, which will take a while and are already running.

Maybe @molnart who had a similar problem?

molnart · Oct 2, 2024

i did try to unmount and remount several times and also i think during reboot basically the same happens. my case is a bit different than @cpulove 's, i understood he was unable to add the datastores and make any backups. for me backups work fine (there is 20+ of them running each day), its the garbage collection that fails, often after 10+ hours. sometimes at 15%, sometimes at 40%, its always a different file.

also I am always getting the "update atime failed" error. while i had stale file handle errors from other apps as well, it was never the atime and it was always related to the NFS server disconnecting improperly.
i am leaning towards some improper settings on the zfs dataset, but I have no idea what it could be...

waltar · Oct 2, 2024

@molnart: Maybe you did on your nfs-server zfs dataset set atime=off while PBS is relying on ? Do atime=on and relatime=on on zfs !!
Even GC >10h daily is nearly unacceptable - or ?! You need zfs or xfs metadata special device ...

molnart · Oct 2, 2024

my ZFS config is posted here https://forum.proxmox.com/threads/estale-stale-file-handle.120000/post-696875 both options are enabled.

also i really dont know why does it take so long.... in increased in april from 10 minutes to 9 hours between two runs. back then the datastore was on a single non-raid drive. and i think around that time i have moved the datastore to a different disk. maybe rsyncing the data between the drives broke something pbs needs to work properly???
i am thinking of setting up a new datastore from scratch, because apparently there is something wrong with this one...

waltar · Oct 2, 2024

Yes, new pbs inst. may look best option if runtimes were so short in the past vs. today. Good luck

molnart · Oct 3, 2024

something's definitely wrong here... i have created a new datastore. just creating it took 42 minutes. then i have started a garbage collection job on the newly created empty datastore and its running for an hour already.
the wait IO on the NFS server is constantly high for months, but I dont really know what is causing it. when i look at iotop, there is a 3 MB/s read and write mostly caused by the storj storage node I am hosting. this is definitely not caused by ZFS either, as I have migrated to ZFS only in august, before that it was a mergerFS setup.

running some benchmarks on the NFS share from PBS the performance seems OK:
RANDOM WRITES: WRITE: bw=40.8MiB/s (42.8MB/s), 40.8MiB/s-40.8MiB/s (42.8MB/s-42.8MB/s), io=8192MiB (8590MB), run=200807-200807msec
RANDOM READS: READ: bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=8192MiB (8590MB), run=62053-62053msec
SEQ WRITES: WRITE: bw=214MiB/s (224MB/s), 214MiB/s-214MiB/s (224MB/s-224MB/s), io=30.0GiB (32.2GB), run=143832-143832msec
SEQ READ: READ: bw=283MiB/s (297MB/s), 283MiB/s-283MiB/s (297MB/s-297MB/s), io=30.0GiB (32.2GB), run=108527-108527msec

The sequential speeds indeed feel a tad slow, but there was the garbage collection running on the array

waltar · Oct 3, 2024

Mmh, you should further invest ... your cores have mostly i/o wait most probably to your to slow nfs answering.
Zfs without special device accessed via nfs is not a performance winner with pbs as the gc runs trash the arc.

molnart · Oct 3, 2024

so according to my current investigation on the high wait ios:
- currently are caused by storj storage node. turning the storagenode off make wait io drop immediately form 40 to 2-3
- the wait io increase in april (when i was not running zfs or storagenode) was caused by moving the datastore from an ext4 storage to a btrfs one

waltar · Oct 4, 2024

Yeah, my experience is like btrfs should just be used as a worm filesystem else in daily usage of remove, manipulate, rewrite files ... than it gets ... somethink like horribly fragmentated (or the hell what ever ?!?) and slow. Most need of a admin ever, complicated handling and uncomplete features, in my eyes nice for enthusiast usege.

molnart · Oct 7, 2024

looks like my problems are caused by high wait io. PBS tries to update the access time on the files, but it times out due to high wait IO. i did some tweaks in the ZFS caching and allocated some more ram to ZFS host, so my wait io has decreased and garbage collection can finish. still takes 4-7 hours, but finally have could free up a few hundred GBs of data.
Have ordered some extra RAM, hopefully will help with caching, as I have no physical space left for a ZFS special device.

proxmont · Dec 17, 2024

my solution is to run this script every 30 seconds using cron
what it does is list mount points in dir XXX if it returns a "stale.." error then unmount and remount all nfs mount point in that dir that's it.

Bash:

list=$(ls -la /mnt/XXX/ 2>&1 | grep 'Stale file handle' | awk '{print ""$4"" }' | tr -d \: | tr -d \')

echo "running stale destroyer script"

for directory in $list

do

        umount -l "$directory"

        mount -a

        echo "$directory"

done

NDev · Dec 18, 2024

Remounting sometimes works on my PVE machine, but never on my PBS.
My PVE is doing the remount by itself i think. My pbs doesn't.

But thank you for the script, i will just do a reboot of my pbs in this script instead. This should always work, and i think it will solve it.

hieu · Feb 8, 2025

cpulove said:
I might have found the solution for the ESTALE error.

Cause: This problem occurs when an application opens or creates a file, deletes and closes it, and then attempts to access or delete the same file again.

My solution: unmount the file system and then remount it. This may require using the -f flag in the umount command.

Example:

Code:

umount -f 10.x.x.x:/nfs-export-path /mnt/yourmountpoint mount 10.x.x.x:/nfs-export-path /mnt/yourmountpoint

After unmount with -f and a remount of the share, the datastore was able to create the chunk files on the nfs share!

This works with my system as well. All I need to do is to unmount state file handle mountpoints.

For example

Bash:

umount -f 10.x.y.z:/exports/nfs /mnt/pve/nfs

All NFS mountpoints are automatically restored. Files are accessible after that.

Search

Search

ESTALE: Stale file handle

cpulove

Member

molnart

Active Member

waltar

Renowned Member

molnart

Active Member

waltar

Renowned Member

molnart

Active Member

waltar

Renowned Member

molnart

Active Member

waltar

Renowned Member

molnart

Active Member

proxmont

New Member

NDev

Member

hieu

New Member

We value your privacy