zfs send/receive fails to unmount pool

LBX_Blackjack · May 18, 2021

Hey all,
I have my backups in a zfs pool ("s_pool") of which I take regular snapshots. These are then streamed elsewhere with zfs send ... | zfs recv .... Today, however, it is failing to stream because zfs can't unmount s_pool. Running an export fails as well (specifying that the device is busy), unless ran in recovery mode. I cannot figure out what has its hooks in the pool, and the folks at /r/zfs were unable to help me. I tried running lsof | grep s_pool but nothing came up. Any ideas? This is preventing my from making backups of several production VMs.

Stoiko Ivanov · May 19, 2021

anything regarding the issue in `dmesg` or `journalctl -b` output?

LBX_Blackjack · May 19, 2021

Stoiko Ivanov said:
anything regarding the issue in `dmesg` or `journalctl -b` output?

No, all I can find is journalctl shows I ran zpool export mypool. I don't see any other references to it.

LBX_Blackjack · May 24, 2021

Can anyone help? My entire production environment is vulnerable until I can fix this.

0xd149e38e · May 24, 2021

Did you check that your pool is OK zpool status -v

LBX_Blackjack · May 24, 2021

0xd149e38e said:
Did you check that your pool is OK zpool status -v

No errors. This is baffling.

0xd149e38e · May 24, 2021

I just read you postings on reddit. Yes this is really weird.
So just some things to check.

Make sure you don't use s_pool as swap device
Make sure you did not run out of space, zfs list -o space s_pool Are quotas used?
Make sure that the pool is not mounted via NFS or SMB
Check zfs versions. For send receive both pools must be the same version and feature set. Are LZ4, zstd used and installed?
Do you run scrubs regularly?

LBX_Blackjack · May 24, 2021

It's definitely not set up as SWAP. There is an NVME drives that holds all of the operating partitions.
The pool is purely storage. I have over 4TB still available, but I did run out of space a couple of weeks ago since I was out sick and no one was cleaning up the old snapshots in the pool and had to run my cleanup script.
I'm not sure how to check how the pool is mounted. Google isn't turning up anything so I must not understand what to search for exactly.
None of the pools return a version when I prompt it. The "Value" field is just "--"
I've never run a scrub as far as I remember

I am new to ZFS so there is probably a ton I'm doing wrong, so thanks for your help and patience. The setup I've put together seems to be fragile so I likely did that wrong, too.

0xd149e38e · May 24, 2021

LBX_Blackjack said:
I've never run a scrub as far as I remember

It might be possible that a drive has a write problem. If it is an enterprise drive it'll recover by default, consumer drives must be "convinced". This might zfs to cause a hang, rather than reporting an error. Anyway to rule out drive problems I'd suggest to run a short smart selftest on them.

Code:

# short smart test
smartctl -t short <drive>
# check results
smartctl -a <drive>

On the long term you should setup a cron job to run a scrub at least once a month.

To make sure no NFS, SMB is mounted run zfs unshare -a s_pool

zpool exports support a force and an undocumented hardforce:

-f This option is not supported on Linux. This command will forcefully export the pool even if it has a shared spare that is currently being used. This may lead to potential data corruption.

Makes me wonder what -F does...

Code:

Where -f mean force and -F means hardforce.

zpool export|destroy <pool>
zpool export|destroy -f <pool>
zpool export|destroy -F <pool>

LBX_Blackjack said:
The setup I've put together seems to be fragile so I likely did that wrong, too.

I don't think so. I had exactly the same issue as you had a couple of months back, but this was on my desktop and just rebooting fixed the problem. zfs is just buggy. The many features zfs has don't make things easier for developers, also there many different linux kernels.

Backups are very difficult, and it takes a decade for software to mature IMHO...
But people want the latest features and just use their new hardware.

LBX_Blackjack · May 24, 2021

0xd149e38e said:
It might be possible that a drive has a write problem. If it is an enterprise drive it'll recover by default, consumer drives must be "convinced". This might zfs to cause a hang, rather than reporting an error. Anyway to rule out drive problems I'd suggest to run a short smart selftest on them.

Code:

# short smart test smartctl -t short <drive> # check results smartctl -a <drive>

On the long term you should setup a cron job to run a scrub at least once a month.

To make sure no NFS, SMB is mounted run zfs unshare -a s_pool

zpool exports support a force and an undocumented hardforce:

-f This option is not supported on Linux. This command will forcefully export the pool even if it has a shared spare that is currently being used. This may lead to potential data corruption.

Makes me wonder what -F does...

Code:

Where -f mean force and -F means hardforce. zpool export|destroy <pool> zpool export|destroy -f <pool> zpool export|destroy -F <pool>

I don't think so. I had exactly the same issue as you had a couple of months back, but this was on my desktop and just rebooting fixed the problem. zfs is just buggy. The many features zfs has don't make things easier for developers, also there many different linux kernels.

Backups are very difficult, and it takes a decade for software to mature IMHO...
But people want the latest features and just use their new hardware.

All drives passed SMART, so that's good. unshare complained that the share is not nfs or smb, so that answers that as well. I'll certainly set up a cronjob for scrubs, but for now should I use that force/hardforce export?

LBX_Blackjack · May 26, 2021

0xd149e38e said:
It might be possible that a drive has a write problem. If it is an enterprise drive it'll recover by default, consumer drives must be "convinced". This might zfs to cause a hang, rather than reporting an error. Anyway to rule out drive problems I'd suggest to run a short smart selftest on them.

Code:

# short smart test smartctl -t short <drive> # check results smartctl -a <drive>

On the long term you should setup a cron job to run a scrub at least once a month.

To make sure no NFS, SMB is mounted run zfs unshare -a s_pool

zpool exports support a force and an undocumented hardforce:

-f This option is not supported on Linux. This command will forcefully export the pool even if it has a shared spare that is currently being used. This may lead to potential data corruption.

Makes me wonder what -F does...

Code:

Where -f mean force and -F means hardforce. zpool export|destroy <pool> zpool export|destroy -f <pool> zpool export|destroy -F <pool>

I don't think so. I had exactly the same issue as you had a couple of months back, but this was on my desktop and just rebooting fixed the problem. zfs is just buggy. The many features zfs has don't make things easier for developers, also there many different linux kernels.

Backups are very difficult, and it takes a decade for software to mature IMHO...
But people want the latest features and just use their new hardware.

Running zpool export -f s_pool or zpool export -F s_pool both failed to unmount. Even forced and lazy unmounts fail. I'm at a loss here.

mike2012 · May 28, 2021

I have the same problem and my only solution so far was to do a

sleep 2
systemctl try-reload-or-restart proxmox-backup proxmox-backup-proxy

after the

proxmox-backup-manager datastore remove XXX

( without sleep it sometimes failed anyway)

Take care .. the "proxmox-backup-manager datastore remove " only removes the datastore from gui not the zpool and not the harddrive

LBX_Blackjack · May 28, 2021

At this point in thinking of nuking the datastore and pool and starting fresh. I jsut wish I knew what happened so I could try to prevent it in the future.

LBX_Blackjack · Aug 12, 2021

Coming back to this because I'm having trouble again. I ran lsof to try to find the process locking the mountpoint and got this

Code:

:~$ sudo lsof | grep /mnt/datastore/storage/
proxmox-b  2325                           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325  3524 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325  5496 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325  6106 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325  6484 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock
proxmox-b  2325 32620 tokio-run           backup   19u      REG               0,58        0          3 /mnt/datastore/storage/.lock

Any idea what's going on and why this mountpoint is getting locked? It's preventing me from doing my backup routine.

fabian · Aug 13, 2021

like @mike2012 said above, you need to reload the PBS service and wait for the old ones to exit if they are still executing tasks - then the lock files shouldn't be open anymore if no datastore definition for that path exists anymore.

LBX_Blackjack · Aug 13, 2021

fabian said:
like @mike2012 said above, you need to reload the PBS service and wait for the old ones to exit if they are still executing tasks - then the lock files shouldn't be open anymore if no datastore definition for that path exists anymore.

I restarted the service, but it's not unlocking. The datastore has been locked for weeks at this point, preventing me from taking snapshots and "send|recv"ing them. It's persisted through reboots and every attempt to fix it.

LBX_Blackjack · Sep 10, 2021

Replying again because the issue persists. Restarting the services has no effect. Stopping them changes the error to "Dataset is busy." All new backup jobs are failing but I can't troubleshoot it until I can resolve this first issue. Also, there is a partial recv but I can't destroy it.

LBX_Blackjack · Sep 15, 2021

Update: While restarting the services and waiting a month did not work, stopping the services and waiting a couple of days did. I don't know why this is happening but it is causing a lot of downtime. If someone comes across this and has a suggestion on how to improve robustness to mitigate these incidents, please let me know.

Search

Search

zfs send/receive fails to unmount pool

LBX_Blackjack

Member

Stoiko Ivanov

Proxmox Staff Member

LBX_Blackjack

Member

LBX_Blackjack

Member

0xd149e38e

Member

LBX_Blackjack

Member

0xd149e38e

Member

LBX_Blackjack

Member

0xd149e38e

Member

LBX_Blackjack

Member

LBX_Blackjack

Member

mike2012

Renowned Member

LBX_Blackjack

Member

LBX_Blackjack

Member

fabian

Proxmox Staff Member

LBX_Blackjack

Member

LBX_Blackjack

Member

LBX_Blackjack

Member

We value your privacy