Proxmox fail repair ZFS rpool and after reach 93.69% start freeze

iHostART · Jun 30, 2023

Hello everyone , This morning I encountered the following problem

One of our Virtualisation Server start crashed complete random , after reboot server zfs pool start repair

root@LV4:~# zpool status
pool: rpool
state: ONLINE
scan: scrub in progress since Sun May 14 00:24:01 2023
42.1T scanned at 167M/s, 41.7T issued at 123M/s, 44.6T total
0B repaired, 93.69% done, 06:38:12 to go
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
scsi-3644a8420223a11002aadd866048e61ae-part3 ONLINE 0 0 0

After I m waiting approx 3-4 hours and repair reached to 93.69% it did not continue to go further and shortly after the server started to crash, no command works, but we already had iotop open and there is no intensive use of IOPS (it s a screenshot above about this)

How it's possible fixed this error?Any idea?
In same time , proxmox web panel stop work , I m tried 2-3 times for reboot server again and waiting for repair partitions but no work

Regards,
Calin

LnxBil · Jun 30, 2023

Are there any entries in dmesg?

iHostART · Jun 30, 2023

Hello @LnxBil here it s logs https://pastebin.com/JTpQyKYb

Thanks for you time!

Regards,
Calin

LnxBil · Jun 30, 2023

iHostART said:
here it s logs https://pastebin.com/JTpQyKYb

No useful information there

Can you please post the output of ps auxf.

iHostART · Jun 30, 2023

Hello @LnxBil check please https://pastebin.com/2TSGGDDV

Regards,
Calin

LnxBil · Jun 30, 2023

That process hangs in an uninterruptible sleep state:

Code:

root         966  9.2  0.0      0     0 ?        D    04:03  54:51  \_ [txg_sync]

Try to reboot in order to solve the problem. Maybe it'll hang on reboot, too, so you need to reset the machine. I have no other idea than that.

After the machine is back online, try updating all packages and probably another reboot in order to run a new kernel.

iHostART · Jun 30, 2023

Hello @LnxBil I already tried reboot today and to any results , process put on sleep again

pool: rpool
state: ONLINE
scan: scrub in progress since Fri Jun 30 04:23:27 2023
8.81T scanned at 264M/s, 8.44T issued at 253M/s, 55.4T total
0B repaired, 15.24% done, 2 days 05:58:43 to go
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
scsi-3644a8420223a11002aadd866048e61ae-part3 ONLINE 0 0 0

On this time put on 15.24%

But thanks again for help I m very appreciate

Regards,
Calin

Neobin · Jun 30, 2023

iHostART said:
pool: rpool
state: ONLINE
scan: scrub in progress since Sun May 14 00:24:01 2023
42.1T scanned at 167M/s, 41.7T issued at 123M/s, 44.6T total
0B repaired, 93.69% done, 06:38:12 to go
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
scsi-3644a8420223a11002aadd866048e61ae-part3 ONLINE 0 0 0

iHostART said:
pool: rpool
state: ONLINE
scan: scrub in progress since Fri Jun 30 04:23:27 2023
8.81T scanned at 264M/s, 8.44T issued at 253M/s, 55.4T total
0B repaired, 15.24% done, 2 days 05:58:43 to go
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
scsi-3644a8420223a11002aadd866048e61ae-part3 ONLINE 0 0 0

What exact storage-hardware and storage-structure is underneath this pool / behind that single device (scsi-3644a8420223a11002aadd866048e61ae)?

iHostART · Jun 30, 2023

Hello @Neobin 6x 12 TB HDD SATA at raid 0 with ZFS Partitions , it s default install by proxmox , I m don t change any setting about ZFS cache , arch etc...

Regards,
Calin

Neobin · Jul 1, 2023

iHostART said:
Hello @Neobin 6x 12 TB HDD SATA at raid 0 with ZFS Partitions , it s default install by proxmox , I m don t change any setting about ZFS cache , arch etc...

Regards,
Calin

So, you are using a (pseudo-)hardware or software raid underneath ZFS? This should not be done: [1]!
Really a (r)aid0?
Can the controller not be switched/flashed into IT-mode or why did you not use the ZFS-raid?

I would probably check the whole different storage layers (disks, controller, cables, raid) to rule out those.

But to be honest, in my humble opinion and sorry to say that, but if this is really a (r)aid0 (with ZFS on top), this setup is scuffed af and a ticking time bomb...

[1] https://openzfs.github.io/openzfs-d...uning/Hardware.html#hardware-raid-controllers

rj45 · Jul 1, 2023

Neobin said:
But to be honest, in my humble opinion and sorry to say that, but if this is really a (r)aid0 (with ZFS on top), this setup is scuffed af and a ticking time bomb...

Probably zfs set copies=2 <pool> may disarm such bomb..

leesteken · Jul 1, 2023

rj45 said:
Probably zfs set copies=2 <pool> may disarm such bomb..

Not when using a raid0 of 6 drives and hiding that fact from ZFS; both copies might end up on the same drive. And the raid0 is 6 times more likely to fail than a single drive...

Dunuin · Jul 1, 2023

rj45 said:
Probably zfs set copies=2 <pool> may disarm such bomb..

And also would be a waste, as a striped 3-disk-mirror would offer the same usable capacity, better IOPS performance, better reliability, more versatile management options, easier expandability and would allow for smaller blocksizes...

Neobin · Jul 1, 2023

rj45 said:
Probably zfs set copies=2 <pool> may disarm such bomb..

Like @leesteken already said, not if we talk about disk failure: [1]. (Beside the fact, that ZFS is, in this case here, not even aware of the individual disks underneath. It can/does not see them.)

And like @Dunuin already said, if we talk about data redundancy in case of e.g. corruption and using multiple disks, you are way better off with e.g. a raid10.

[1] https://jrs-s.net/2016/05/02/zfs-copies-equals-n

rj45 · Jul 1, 2023

leesteken said:
Not when using a raid0 of 6 drives and hiding that fact from ZFS; both copies might end up on the same drive. And the raid0 is 6 times more likely to fail than a single drive...

So i told "probably". Told about "Hw raid 10 of 4-8 disks" (all consistency checks, patrol reads, replace disk logic is done by hw raid firmware), so i think scrub in "zfs level" may be disabled

zfs sees such config as "raid0" with single disk .

And the raid0 is 6 times more likely to fail than a single drive...

if it single drive and not if set of stripes formed by hw raid card in "raid10"

Neobin said:
Like @leesteken already said, not if we talk about disk failure

I see we talk about "dead hardware raid card"

)

rj45 · Jul 1, 2023

Neobin said:
https://jrs-s.net/2016/05/02/zfs-copies-equals-n

This article not covered hw raid behaviour when replacing disk backed by controller

Search

Search

Proxmox fail repair ZFS rpool and after reach 93.69% start freeze

iHostART

Member

Attachments

LnxBil

Distinguished Member

iHostART

Member

LnxBil

Distinguished Member

iHostART

Member

LnxBil

Distinguished Member

iHostART

Member

Neobin

Distinguished Member

iHostART

Member

Neobin

Distinguished Member

rj45

Member

leesteken

Distinguished Member

Dunuin

Distinguished Member

Neobin

Distinguished Member

rj45

Member

rj45

Member