zfs not detecting disk removal in pools

diversity · Nov 23, 2022

I am running proxmox 7.2.11
with
zfs-zed/stable,now 2.1.6-pve1 amd64
zfs-initramfs/stable,now 2.1.6-pve1 all
zfsutils-linux/stable,now 2.1.6-pve1 amd64

After taking out the only disk out of a hot swap bay in a single disk pool.
zpool status still listed the pool as online including that disk that I just took out. and as such no emails were sent by ZED.

The OS did detect the disk being lost though just ZFS seemed to have missed that event.

Is this cause for concern?

leesteken · Nov 23, 2022

ZFS probably logged errors but they could not be written to any disk. Did you enable logging to a remote system and did it not get there?
Maybe the e-mailing failed because some configuration could not be read because all disks were missing. Do you maybe have remote log about this?
Unless there are remote logs about why some of those things did not happen, it's hard to determine how the various software components failed to warn you.
Running a system with only one drive is always risky as it could have failed by itself at any one time. I wouldn't worry about this and use raid1/mirror next time.

EDIT: I completely misjudged the situation, assuming it was the rpool.

diversity · Nov 23, 2022

I can run the experiment again tomorrow.
it was a test pool to test just these kind of scenarios
syslog:
Nov 23 20:14:15 pveasus smartd[3939]: Device: /dev/sda [SAT], removed ATA device: No such device.
But ZFS and thus also ZED did not detect it thus no email was sent.

I tried pulling a drive out of a multi disk pool after that and then both the OS and ZFS and ZED detected it and an email was sent.

I hope this clarifies a bit more.
EDIT: Also my root pool is mirrored and are not used for experiments

diversity · Nov 24, 2022

I am getting really worried.

Just tried the drive removal from a 3 way mirrored pool on a different system.

I pulled out one of the disks.
syslog:

Code:

Nov 24 08:37:24 pver1 kernel: ata6: SATA link down (SStatus 0 SControl 300)
Nov 24 08:37:29 pver1 kernel: ata6: SATA link down (SStatus 0 SControl 300)
Nov 24 08:37:29 pver1 kernel: ata6.00: disabled
Nov 24 08:37:29 pver1 kernel: ata6.00: detaching (SCSI 5:0:0:0)
Nov 24 08:37:29 pver1 kernel: sd 5:0:0:0: [sdd] Synchronizing SCSI cache
Nov 24 08:37:29 pver1 kernel: sd 5:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 24 08:37:29 pver1 kernel: sd 5:0:0:0: [sdd] Stopping disk
Nov 24 08:37:29 pver1 kernel: sd 5:0:0:0: [sdd] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

But now almost 8 minutes later zpool status still reports the pool is fine and dandy with the pulled out disk online.
The system specs are the same as the system that only misbehaved on a single drive pool.

pveversion
pve-manager/7.2-11/b76d3178 (running kernel: 5.15.64-1-pve)

apt list zfs* | grep installed
zfs-initramfs/stable,now 2.1.6-pve1 all [installed]
zfs-zed/stable,now 2.1.6-pve1 amd64 [installed]
zfsutils-linux/stable,now 2.1.6-pve1 amd64 [installed]

I'd like to escalate this issue if I can somehow.

diversity · Nov 24, 2022

I changed the title of this thread to indicate it is no longer only about single disk pools

leesteken · Nov 24, 2022

diversity said:
I'd like to escalate this issue if I can somehow.

Wait for a staff member to take an interest to this thread, report a bug or use one of your subscription support tickets.

fiona · Nov 24, 2022

Hi,
stupid question, but did you try to write anything to the pool after removing the disk?

Dunuin · Nov 24, 2022

We had such a thread before. So not the first case this was observed.
And if I remember right, ZFS will only complain when it tries to write to that missing disk, so fiona got a good point.

Are you maybe also using a raid card and not a dumb HBA, so there could be some unnecessary abstraction layer in between?

Tried to find that older thread, but can't find it...

diversity · Nov 24, 2022

Inserting point,

No I never did do this while the disks were in obvious operation. The pools on where I ran the tests were not in use so no changes.

Hold on and I will do the same test but then force a change to the pool.

diversity · Nov 24, 2022

I did not even start the test yet (and I will soon) but having a pool in active data flux should not be the only way to have ZFS tell you when things might go bad.

I have more to say when I did my test

diversity · Nov 24, 2022

Dunuin said:
We had such a thread before. So not the first case this was observed.
And if I remember right, ZFS will only complain when it tries to write to that missing disk, so fiona got a good point.

Are you maybe also using a raid card and not a dumb HBA, so there could be some unnecessary abstraction layer in between?

Tried to find that older thread, but can't find it...

I have a superMicro and a SilverStone case. Both with HDD BackPlates configured to let the motherboard do all the "thinking"
Hence the removal of a drive does get noticed by the OS in both cases, whether system a or b I tried this in.

More to follow

diversity · Nov 24, 2022

I had to break down my test because of... reasons ;(

Ok, Please let me be your method then to settle this debate once and for all.

What test shall I run to make it certain that there is an issue worth exploring or whether it is just a misunderstanding from a user perspective.

diversity · Nov 24, 2022

While I get my thoughts together again, can one of you please motivate why it would be good thing for ZFS (so in no way am I involving the proxmox team here) to only do health checks when there are IO operations?

the more I think of it the more it does not make sense to me as one just a newcomer to the field.

diversity · Nov 25, 2022

In an effort to shine more light on the matter I am trying to create a disk pull out scenario while the pool is in operation
..
zpool status

Code:

  pool: notimportant
 state: ONLINE
  scan: resilvered 1.18T in 02:42:31 with 0 errors on Thu Nov 24 15:48:51 2022
config:

        NAME                                          STATE     READ WRITE CKSUM
        notimportant                                  ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            ata-WDC_WD60EZAZ-00SF3B0_WD-WX32DB1EN69Y  ONLINE       0     0     0
            sda                                       ONLINE       0     0     0
            ata-ST6000DM003-2U9186_WSB05H6W           ONLINE       0     0     0

..
rsync --times --atimes --acls --owner --group --perms --xattrs --links --mkpath --recursive --progress root@x.x.x.x/notimportant /notimportant

but I fail to get the IO operation to start

Code:

rsync: [Receiver] Failed to exec s: No such file or directory (2)
rsync error: error in IPC code (code 14) at pipe.c(85) [Receiver=3.2.3]
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: error in IPC code (code 14) at io.c(228) [Receiver=3.2.3]

But as far as I know there is such a direcoty

ls -l /
drwxr-xr-x 7 root root 7 Nov 23 10:53 notimportant

I am most likely just running into my own ignorance here regarding the not being able to get a long running IO operation going.

Only once I can get that in effect I can continue with the disk pulling out midst operation.

Can anyone please help me get through this? I am quite new to the field.
I would love to leave the evil clutches of Microsoft behind.

diversity · Nov 25, 2022

anyone? How do I create an intensive IO operation so I can pull a disk from the pool involved in said operation?

I do stress out though that this should not be needed in the first place. I'd be so much helped to learn why the current mechanism is how it is.

diversity · Nov 25, 2022

also, if I would run a FreeBSD system and do the same. Would it behave similar by being silent some of the times?

The point I am trying to make is that I no longer believe it is the ZFS code. It might be there have been done some slight adaptations by the Proxmox team that allows for this weird and unwanted behavior.

fiona · Nov 25, 2022

diversity said:
also, if I would run a FreeBSD system and do the same. Would it behave similar by being silent some of the times?

The point I am trying to make is that I no longer believe it is the ZFS code. It might be there have been done some slight adaptations by the Proxmox team that allows for this weird and unwanted behavior.

You can see all modifications we do here. None of it touches anything related to disk availability detection. So how do you reach that conclusion?

fiona · Nov 25, 2022

diversity said:
anyone? How do I create an intensive IO operation so I can pull a disk from the pool involved in said operation?

I do stress out though that this should not be needed in the first place. I'd be so much helped to learn why the current mechanism is how it is.

Just try to touch/write a file on the ZFS after pulling the disk, no need for intensive IO.

diversity · Nov 26, 2022

fiona said:
Just try to touch/write a file on the ZFS after pulling the disk, no need for intensive IO.

yes Fiona, A file create operation did trigger the pool to notice one of it's drives was missing.

Is this intended behavior from the source code? If so then I would like to follow up with the OpenZFS people that build the code to learn about the reasoning behind this decision. It might be perfectly valid but I just do not see any benefits yet, only potential dangers.

leesteken · Nov 26, 2022

Please note that ZFS is an advanced filesystem but not a disk management system (like a NAS). ZFS just uses the block devices its configured for and does not monitor system events, as far as I know. You could argue that Proxmox should report on (problematic) changes to its storage (which I would have expected because of its frequent updating of the usage graphs). Or maybe administrators should install software (of their choice) that is explicitly designed for monitoring servers and/or drives.
Personally, I wouldn't worry too much about errors not being detected before they actually become a problem, if some kind of redundancy is already in place (like raid1). And if there is no redundancy, every problem becomes fatal eventually and an earlier notification does not add much.

zfs not detecting disk removal in pools

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

Proxmox Staff Member

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Proxmox Staff Member

Well-Known Member

Distinguished Member

We value your privacy