ZFS corrupted - repair? (RAIDZ1)

seatiger91

New Member
Oct 16, 2021
5
0
1
32
I have a very large VM-disk on my pool which. I wanted to move the data, but after some time the transfer rate dropped from ~120MB/s so ~20MB/s.
I restarted the VM, but it stops during boot since it cannot mount the virtual disk:

1671308721756.png

Also not possible to mount after boot:
Code:
root@fileserver:~# mount /dev/sda1 /mnt/vhd0/
mount: /mnt/vhd0: can't read superblock on /dev/sda1.

One HDD showed degraded and zpool status showed errors (Read, Write and CKSUM; SMART was still fine though).

So what I did was first clear the errors:
Code:
zpool clear -F storagepool0
And then start a scrub job:
Code:
zpool scrub storagepool0

Same HDD then showed degraded again, storage was ONLINE though. I already got a potential replacement HDD but the hardware actually seems fine. The proxmox server has always been gracefully shut down/rebooted.

This is what I currently get (once again running scrub, just for the record):
Code:
root@proxmox:~# sudo zpool status -v storagepool0
  pool: storagepool0
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat Dec 17 14:40:43 2022
        11.1T scanned at 507M/s, 10.0T issued at 457M/s, 63.0T total
        0B repaired, 15.95% done, 1 days 09:43:40 to go
config:

        NAME                                          STATE     READ WRITE CKSUM
        storagepool0                                  ONLINE       0     0     0
          raidz1-0                                    ONLINE       0     0     0
            ata-WDC_WD60EFRX-68MYMN1_WD-WX31D743YL1P  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68MYMN1_WD-WXK1H641YJ3U  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68L0BN1_WD-WX11D86HUHAJ  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68MYMN1_WD-WX11DC4FKEYC  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68MYMN1_WD-WX61D65NNCXV  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68MYMN1_WD-WXK1H645X9YJ  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68L0BN1_WD-WX11D76EPRCF  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68MYMN1_WD-WX11D259V25T  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68MYMN1_WD-WX31D743YN91  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68MYMN1_WD-WX11DB4H8277  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68MYMN1_WD-WXK1H6432F0C  ONLINE       0     0     6
            ata-WDC_WD60EFRX-68TGBN1_WD-WX31DC46JKJK  ONLINE       0     0     6

errors: Permanent errors have been detected in the following files:

        storagepool0/vm-209-disk-0:<0x1>

Zpool history is attached in the file, this happend in December 2022. I do not have any backup since the vm disk mostly contains media data (~40TB though).
My fileserver cannot access the vm disk at all.

Is there any possibility to save it? Also, I just do not see what happened here since no hardware faulted and ZFS is known to have remarkable resilience.
 

Attachments

  • zpool history.txt
    22.5 KB · Views: 5
I just tried to move the disk again (from pool to another). This time it started, but fails with this:

1671309995148.png

Edit: above was tried with the VM online. When offline, I get this:

Code:
root@proxmox:~# qm move_disk 209 scsi1 storagepool3
create full clone of drive scsi1 (storagepool0:vm-209-disk-0)
transferred 0.0 B of 46.7 TiB (0.00%)
qemu-img: error while reading at byte 25920792576: Input/output error
qemu-img: error while reading at byte 25929181184: Input/output error
qemu-img: error while reading at byte 25927084032: Input/output error
qemu-img: error while reading at byte 25924986880: Input/output error
qemu-img: error while reading at byte 25922889728: Input/output error
storage migration failed: copy failed: command '/usr/bin/qemu-img convert -p -n -t none -T none -f raw -O raw /dev/zvol/storagepool0/vm-209-disk-0 zeroinit:/dev/zvol/storagepool3/vm-209-disk-1' failed: exit code 1
 
Last edited:
Looks like more than one drive (or cable or drive controller) is failing. It could also be memory failures. The pool can only handle one drive failing and ZFS cannot recover one or more blocks because the errors overlap. Either all drives are going bad, you suffered an unexpected power loss or it is higher up the chain.
Since all drives are reporting checksum errors, maybe it's the controller or memory instead. Maybe try the drives on another similar system?
Find out what part in the chain is failing and replace it. Recover the VM from backups.
 
Jup, looks like that VM is permanently damaged. Next time you shouldn't be that close-fisted and choose at least a raidz2.
And raid doesn't replace a backup. You still should have backups of all data you care about.

I would also recommend to run memtest86+ over night after that scrub job has finished.
 
Jup, looks like that VM is permanently damaged. Next time you shouldn't be that close-fisted and choose at least a raidz2.
And raid doesn't replace a backup. You still should have backups of all data you care about.

I would also recommend to run memtest86+ over night after that scrub job has finished.
I don't see how raidz2 should have contributed in preventing this since there is no hardware fault. I can still access every HDD over the bus directly on the chipset.
Even further: If this wasn't in a raidz1, currently just 1 disk wouldn't be available and not all of the data gone into the abyss.
I refuse to accept that ZFS can just randomly fault logically regardless of any parity mechanism and be considered reliable?
I'll do a memtest, the server is completely based on ECC RAM though. Is there no mechanism though preventing that if a RAM module produces an error it is not faulting the total storage?
 
I'll do a memtest, the server is completely based on ECC RAM though. Is there no mechanism though preventing that if a RAM module produces an error it is not faulting the total storage?
Jup, that is what ECC is for. It can fix minor RAM errors. But a slowly dying RAM module can still corrupt data.
Did you check your EDAC counter to see if your RAM had encounted correctible or uncorrectible errors?
I don't see how raidz2 should have contributed in preventing this since there is no hardware fault. I can still access every HDD over the bus directly on the chipset.
Even further: If this wasn't in a raidz1, currently just 1 disk wouldn't be available and not all of the data gone into the abyss.
I refuse to accept that ZFS can just randomly fault logically regardless of any parity mechanism and be considered reliable?
With a raidz1 your data is save as long as no single disk goes degraded. Once it is degraded the smallest error will permanently corrupt data, as you get no other redundancy. With a raidz2 and a single degraded disk, single errors could be fixed, as still one parity disk is available.

And ZFS can't do anything if the error happens in RAM, or if the error hits all disks at the same time...like for example a failing disk controller, failing PSU, power outage without UPS, kernel error, ...

Thats why you backup data, so you can restore it when the whole pool fails.
 
Last edited:
  • Like
Reactions: LnxBil
On a sidenote:
I do not know what the general experience is, but I personally feel really uncomfortable with such large multi-TB vdisks. Not only are they absolutely unhandy, but, like you may now experience, can potentially lead to losing your whole data with a single (little) error/problem with/on that (additional) single vdisk(-layer).

That does not help you now, but maybe you might consider another concept for storing your multi-TBs of cold data in the future.
 
I might want to clarify this:
On a sidenote:
I do not know what the general experience is, but I personally feel really uncomfortable with such large multi-TB vdisks. Not only are they absolutely unhandy, but, like you may now experience, can potentially lead to losing your whole data with a single (little) error/problem with/on that (additional) single vdisk(-layer).

That does not help you now, but maybe you might consider another concept for storing your multi-TBs of cold data in the future.
It's true, the handling is not perfect. It's just that that proxmox does not offer a better solution. I was thinking about passing through the disks to a server with a software that is able to handle storage better.
 
Against all claims that this would not be possible, I am currently able to access the data and I am migrating. I'd like to add, that all the smart assery about backups etc. is what I expected from a forum but it's not helpful and I consider it misplaced.

I will post a solution once I'm done. Any helpful suggestions about what happened here is still welcome.
 
Against all claims that this would not be possible, I am currently able to access the data and I am migrating. I'd like to add, that all the smart assery about backups etc. is what I expected from a forum but it's not helpful and I consider it misplaced.

I will post a solution once I'm done. Any helpful suggestions about what happened here is still welcome.
So what was the solution?
 
It's just that that proxmox does not offer a better solution. I was thinking about passing through the disks to a server with a software that is able to handle storage better.
Better solution than ZFS? You had a failure in multiple disks and therefore for RAID hole puncture, how can any software prevent this hardware fault??? a hardware raid controller would have prevented this main memory error, but yield other failure types like raid controller malfunction (which in my experience occurs more often than a main memory error).

Is there no mechanism though preventing that if a RAM module produces an error it is not faulting the total storage?
An ECC ram failure should always trigger a machine check exception (MCE) implying a cold reset, deactivating the memory module and reboot into the Os. At least in my experience with enterprise-grade hardware. If you have a hardware error like memory malfunction or CPU compute error .. how would any software detect this? Everything running inside your OS is already abstracted and in case of memory already virtualized by the MMU. You should also check your PVe logging if there are MCEs recorded and also check your SEL in your management controller.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!