zfs send/receive fails to unmount pool

LBX_Blackjack

Member
Jul 24, 2020
35
1
8
31
Hey all,
I have my backups in a zfs pool ("s_pool") of which I take regular snapshots. These are then streamed elsewhere with zfs send ... | zfs recv .... Today, however, it is failing to stream because zfs can't unmount s_pool. Running an export fails as well (specifying that the device is busy), unless ran in recovery mode. I cannot figure out what has its hooks in the pool, and the folks at /r/zfs were unable to help me. I tried running lsof | grep s_pool but nothing came up. Any ideas? This is preventing my from making backups of several production VMs.
 
anything regarding the issue in `dmesg` or `journalctl -b` output?
 
I just read you postings on reddit. Yes this is really weird.
So just some things to check.
  • Make sure you don't use s_pool as swap device
  • Make sure you did not run out of space, zfs list -o space s_pool Are quotas used?
  • Make sure that the pool is not mounted via NFS or SMB
  • Check zfs versions. For send receive both pools must be the same version and feature set. Are LZ4, zstd used and installed?
  • Do you run scrubs regularly?
 
Last edited:
  • It's definitely not set up as SWAP. There is an NVME drives that holds all of the operating partitions.
  • The pool is purely storage. I have over 4TB still available, but I did run out of space a couple of weeks ago since I was out sick and no one was cleaning up the old snapshots in the pool and had to run my cleanup script.
  • I'm not sure how to check how the pool is mounted. Google isn't turning up anything so I must not understand what to search for exactly.
  • None of the pools return a version when I prompt it. The "Value" field is just "--"
  • I've never run a scrub as far as I remember
I am new to ZFS so there is probably a ton I'm doing wrong, so thanks for your help and patience. The setup I've put together seems to be fragile so I likely did that wrong, too.
 
I've never run a scrub as far as I remember
It might be possible that a drive has a write problem. If it is an enterprise drive it'll recover by default, consumer drives must be "convinced". This might zfs to cause a hang, rather than reporting an error. Anyway to rule out drive problems I'd suggest to run a short smart selftest on them.
Code:
# short smart test
smartctl -t short <drive>
# check results
smartctl -a <drive>

On the long term you should setup a cron job to run a scrub at least once a month.

To make sure no NFS, SMB is mounted run zfs unshare -a s_pool


zpool exports support a force and an undocumented hardforce:

-f This option is not supported on Linux. This command will forcefully export the pool even if it has a shared spare that is currently being used. This may lead to potential data corruption.

Makes me wonder what -F does...

Code:
Where -f mean force and -F means hardforce.

zpool export|destroy <pool>
zpool export|destroy -f <pool>
zpool export|destroy -F <pool>

The setup I've put together seems to be fragile so I likely did that wrong, too.
I don't think so. I had exactly the same issue as you had a couple of months back, but this was on my desktop and just rebooting fixed the problem. zfs is just buggy. The many features zfs has don't make things easier for developers, also there many different linux kernels.

Backups are very difficult, and it takes a decade for software to mature IMHO...
But people want the latest features and just use their new hardware.
 
Last edited:
It might be possible that a drive has a write problem. If it is an enterprise drive it'll recover by default, consumer drives must be "convinced". This might zfs to cause a hang, rather than reporting an error. Anyway to rule out drive problems I'd suggest to run a short smart selftest on them.
Code:
# short smart test
smartctl -t short <drive>
# check results
smartctl -a <drive>

On the long term you should setup a cron job to run a scrub at least once a month.

To make sure no NFS, SMB is mounted run zfs unshare -a s_pool


zpool exports support a force and an undocumented hardforce:

-f This option is not supported on Linux. This command will forcefully export the pool even if it has a shared spare that is currently being used. This may lead to potential data corruption.

Makes me wonder what -F does...

Code:
Where -f mean force and -F means hardforce.

zpool export|destroy <pool>
zpool export|destroy -f <pool>
zpool export|destroy -F <pool>


I don't think so. I had exactly the same issue as you had a couple of months back, but this was on my desktop and just rebooting fixed the problem. zfs is just buggy. The many features zfs has don't make things easier for developers, also there many different linux kernels.

Backups are very difficult, and it takes a decade for software to mature IMHO...
But people want the latest features and just use their new hardware.
All drives passed SMART, so that's good. unshare complained that the share is not nfs or smb, so that answers that as well. I'll certainly set up a cronjob for scrubs, but for now should I use that force/hardforce export?
 
It might be possible that a drive has a write problem. If it is an enterprise drive it'll recover by default, consumer drives must be "convinced". This might zfs to cause a hang, rather than reporting an error. Anyway to rule out drive problems I'd suggest to run a short smart selftest on them.
Code:
# short smart test
smartctl -t short <drive>
# check results
smartctl -a <drive>

On the long term you should setup a cron job to run a scrub at least once a month.

To make sure no NFS, SMB is mounted run zfs unshare -a s_pool


zpool exports support a force and an undocumented hardforce:

-f This option is not supported on Linux. This command will forcefully export the pool even if it has a shared spare that is currently being used. This may lead to potential data corruption.

Makes me wonder what -F does...

Code:
Where -f mean force and -F means hardforce.

zpool export|destroy <pool>
zpool export|destroy -f <pool>
zpool export|destroy -F <pool>


I don't think so. I had exactly the same issue as you had a couple of months back, but this was on my desktop and just rebooting fixed the problem. zfs is just buggy. The many features zfs has don't make things easier for developers, also there many different linux kernels.

Backups are very difficult, and it takes a decade for software to mature IMHO...
But people want the latest features and just use their new hardware.
Running zpool export -f s_pool or zpool export -F s_pool both failed to unmount. Even forced and lazy unmounts fail. I'm at a loss here.
 
I have the same problem and my only solution so far was to do a

sleep 2
systemctl try-reload-or-restart proxmox-backup proxmox-backup-proxy

after the

proxmox-backup-manager datastore remove XXX

( without sleep it sometimes failed anyway)


Take care .. the "proxmox-backup-manager datastore remove " only removes the datastore from gui not the zpool and not the harddrive
 
At this point in thinking of nuking the datastore and pool and starting fresh. I jsut wish I knew what happened so I could try to prevent it in the future.
 
Coming back to this because I'm having trouble again. I ran lsof to try to find the process locking the mountpoint and got this
Code:
:~$ sudo lsof | grep /mnt/datastore/storage/
proxmox-b  2325                           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325  3524 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325  5496 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325  6106 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325  6484 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325 32620 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock

Any idea what's going on and why this mountpoint is getting locked? It's preventing me from doing my backup routine.
 
like @mike2012 said above, you need to reload the PBS service and wait for the old ones to exit if they are still executing tasks - then the lock files shouldn't be open anymore if no datastore definition for that path exists anymore.
 
like @mike2012 said above, you need to reload the PBS service and wait for the old ones to exit if they are still executing tasks - then the lock files shouldn't be open anymore if no datastore definition for that path exists anymore.
I restarted the service, but it's not unlocking. The datastore has been locked for weeks at this point, preventing me from taking snapshots and "send|recv"ing them. It's persisted through reboots and every attempt to fix it.
 
Replying again because the issue persists. Restarting the services has no effect. Stopping them changes the error to "Dataset is busy." All new backup jobs are failing but I can't troubleshoot it until I can resolve this first issue. Also, there is a partial recv but I can't destroy it.
 
Update: While restarting the services and waiting a month did not work, stopping the services and waiting a couple of days did. I don't know why this is happening but it is causing a lot of downtime. If someone comes across this and has a suggestion on how to improve robustness to mitigate these incidents, please let me know.