Proxmox fail repair ZFS rpool and after reach 93.69% start freeze

iHostART

Member
Nov 26, 2021
17
0
6
23
Hello everyone , This morning I encountered the following problem

One of our Virtualisation Server start crashed complete random , after reboot server zfs pool start repair

root@LV4:~# zpool status
pool: rpool
state: ONLINE
scan: scrub in progress since Sun May 14 00:24:01 2023
42.1T scanned at 167M/s, 41.7T issued at 123M/s, 44.6T total
0B repaired, 93.69% done, 06:38:12 to go
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
scsi-3644a8420223a11002aadd866048e61ae-part3 ONLINE 0 0 0


After I m waiting approx 3-4 hours and repair reached to 93.69% it did not continue to go further and shortly after the server started to crash, no command works, but we already had iotop open and there is no intensive use of IOPS (it s a screenshot above about this)

How it's possible fixed this error?Any idea?
In same time , proxmox web panel stop work , I m tried 2-3 times for reboot server again and waiting for repair partitions but no work

Regards,
Calin
 

Attachments

  • XyHOkJF.png
    XyHOkJF.png
    146.3 KB · Views: 8
That process hangs in an uninterruptible sleep state:

Code:
root         966  9.2  0.0      0     0 ?        D    04:03  54:51  \_ [txg_sync]

Try to reboot in order to solve the problem. Maybe it'll hang on reboot, too, so you need to reset the machine. I have no other idea than that.

After the machine is back online, try updating all packages and probably another reboot in order to run a new kernel.
 
Hello @LnxBil I already tried reboot today and to any results , process put on sleep again

pool: rpool
state: ONLINE
scan: scrub in progress since Fri Jun 30 04:23:27 2023
8.81T scanned at 264M/s, 8.44T issued at 253M/s, 55.4T total
0B repaired, 15.24% done, 2 days 05:58:43 to go
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
scsi-3644a8420223a11002aadd866048e61ae-part3 ONLINE 0 0 0


On this time put on 15.24%

But thanks again for help I m very appreciate


Regards,
Calin
 
pool: rpool
state: ONLINE
scan: scrub in progress since Sun May 14 00:24:01 2023
42.1T scanned at 167M/s, 41.7T issued at 123M/s, 44.6T total
0B repaired, 93.69% done, 06:38:12 to go
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
scsi-3644a8420223a11002aadd866048e61ae-part3 ONLINE 0 0 0
pool: rpool
state: ONLINE
scan: scrub in progress since Fri Jun 30 04:23:27 2023
8.81T scanned at 264M/s, 8.44T issued at 253M/s, 55.4T total
0B repaired, 15.24% done, 2 days 05:58:43 to go
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
scsi-3644a8420223a11002aadd866048e61ae-part3 ONLINE 0 0 0

What exact storage-hardware and storage-structure is underneath this pool / behind that single device (scsi-3644a8420223a11002aadd866048e61ae)?
 
Hello @Neobin 6x 12 TB HDD SATA at raid 0 with ZFS Partitions , it s default install by proxmox , I m don t change any setting about ZFS cache , arch etc...

Regards,
Calin
 
Hello @Neobin 6x 12 TB HDD SATA at raid 0 with ZFS Partitions , it s default install by proxmox , I m don t change any setting about ZFS cache , arch etc...

Regards,
Calin

So, you are using a (pseudo-)hardware or software raid underneath ZFS? This should not be done: [1]!
Really a (r)aid0?
Can the controller not be switched/flashed into IT-mode or why did you not use the ZFS-raid?

I would probably check the whole different storage layers (disks, controller, cables, raid) to rule out those.

But to be honest, in my humble opinion and sorry to say that, but if this is really a (r)aid0 (with ZFS on top), this setup is scuffed af and a ticking time bomb...

[1] https://openzfs.github.io/openzfs-d...uning/Hardware.html#hardware-raid-controllers
 
  • Like
Reactions: Dunuin
But to be honest, in my humble opinion and sorry to say that, but if this is really a (r)aid0 (with ZFS on top), this setup is scuffed af and a ticking time bomb...
Probably zfs set copies=2 <pool> may disarm such bomb..
 
Probably zfs set copies=2 <pool> may disarm such bomb..
And also would be a waste, as a striped 3-disk-mirror would offer the same usable capacity, better IOPS performance, better reliability, more versatile management options, easier expandability and would allow for smaller blocksizes...
 
  • Like
Reactions: Neobin
Probably zfs set copies=2 <pool> may disarm such bomb..

Like @leesteken already said, not if we talk about disk failure: [1]. (Beside the fact, that ZFS is, in this case here, not even aware of the individual disks underneath. It can/does not see them.)

And like @Dunuin already said, if we talk about data redundancy in case of e.g. corruption and using multiple disks, you are way better off with e.g. a raid10.

[1] https://jrs-s.net/2016/05/02/zfs-copies-equals-n
 
Not when using a raid0 of 6 drives and hiding that fact from ZFS; both copies might end up on the same drive. And the raid0 is 6 times more likely to fail than a single drive...
So i told "probably". Told about "Hw raid 10 of 4-8 disks" (all consistency checks, patrol reads, replace disk logic is done by hw raid firmware), so i think scrub in "zfs level" may be disabled :) zfs sees such config as "raid0" with single disk .
And the raid0 is 6 times more likely to fail than a single drive...
if it single drive and not if set of stripes formed by hw raid card in "raid10"

Like @leesteken already said, not if we talk about disk failure
I see we talk about "dead hardware raid card" :))
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!