VMs on ZFS not booting after disk replacing

decibel83

Renowned Member
Oct 15, 2008
210
1
83
Hi,
on my Proxmox 6 system I have some ZFS pools, one of this is a raidz1 pool with 6 drives.
One drive had a failure prediction on SMART so I replaced it with a new one following the guide at https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks#Replacing_a_failed_disk_in_the_root_pool.
After that the pool began to resilver but unfortunately all virtual machines on that ZFS pool does not boot anymore:

Screenshot 2020-02-28 at 18.22.29.png

ZFS volumes are still here:
Screenshot 2020-02-28 at 18.24.11.png

And this is the pool status:

Code:
  pool: sas
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Feb 28 16:57:23 2020
    3.31G scanned at 2.63M/s, 1.34G issued at 1.06M/s, 110G total
    26.6M resilvered, 1.22% done, no estimated completion time
config:

    NAME                       STATE     READ WRITE CKSUM
    sas                        DEGRADED     0     0     0
      raidz1-0                 DEGRADED     0     0     0
        sdb                    DEGRADED     0     0  617K  too many errors  (resilvering)
        sdc                    DEGRADED     0     0  616K  too many errors  (resilvering)
        sde                    DEGRADED     0     0  617K  too many errors  (resilvering)
        sdf                    DEGRADED     0     0  616K  too many errors  (resilvering)
        sdg                    ONLINE       0     0  620K  (resilvering)
        replacing-5            DEGRADED     0     0  505K
          1098362286105437803  UNAVAIL      0     0     0  was /dev/sdh1/old
          sdh                  ONLINE       0     0     0  (resilvering)

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x284>
        sas/vm-204-disk-0:<0x1>
        sas/vm-208-disk-0:<0x0>
        sas/vm-205-disk-0:<0x1>
        sas/vm-208-disk-1:<0x0>
        sas/vm-207-disk-0:<0x1>

Could you help me please?
 
Your pool is corrupt, as the output of the command shows. Every disk is showing errors, this is often caused by a total disk damage by heat (all disks went to warm) or a bad cable. You cannot do anything besides reinstalling and restoring from backup.
 
Your pool is corrupt, as the output of the command shows. Every disk is showing errors, this is often caused by a total disk damage by heat (all disks went to warm) or a bad cable. You cannot do anything besides reinstalling and restoring from backup.

I understand that, but it's very strange that this happened after a disk replacement, isn't it?
 
Have you turned off your server by any chance?
How old are your drives?

I have been in the very same situation a few weeks ago (mirrored vDevs though): After a RAM-Upgrade of my server (where I shut it down) the drives started with erratic behavior, creating read and write errors.
After a lot of time I have wasted I went and inspected all my drives. A lot of them were running over 7 years. One of them died (in a mirror) and the second drive created read-errors. That's was it for me.

I think you are in a similar (if not same) situation. You could check your cables but I guess you are really in trouble.
Code:
errors: Permanent errors have been detected in the following files:
Once you are there - I think there is no going back. Expect to dump the pool (and your data). You are not getting it back.
You can try to do a "zfs status -v" to see which datasets have unrecoverable data. In my case all major datasets were affected.

Aside that: recoveries of devices often stress remaining devices so they will fail during recovery. This is a common problem in raid- or raid-alike storage technologies (even with ZFS). The only good thing about this is: ZFS shows you there is a problem. So it prevents you relying on it finding yourself in the situation where you try, but can't read the data.
Depending on your setup you could also be a victim of a memory error. If you are not using ECC memory a defective memory module can also cause similar behavior.
 
You could check your cables but I guess you are really in trouble.

Unfortunately yes, you're in trouble.
I never had a bad cable and was always puzzled when someone talked about it until 4 month ago. I built a new system and one SAS-Connector (4 drives behind it) flaky connector. After shaking it a bit it worked, but I had the strangest I/O problems.

If you are not using ECC memory a defective memory module can also cause similar behavior.

I just want to state: In any setup, not just ZFS.
It's often written, that ZFS requires ECC, which is not true. Every filesystem works better mit ECC, but they also work without.
 
I just want to state: In any setup, not just ZFS.
It's often written, that ZFS requires ECC, which is not true. Every filesystem works better mit ECC, but they also work without.
I just pointed out that with NON-ECC-Memory, you can have an defective memory not realizing it causing you all sorts of trouble (including disk corruption. Have been there - didn't like it). That is a problem, in any situation. Thats why I would always recommend to use ECC memory, just I am concerned about my data.
Same if you have a storage not using checksums. You think you are fine, but you maybe not. Choose what best suits you ;)
 
I just pointed out that with NON-ECC-Memory, you can have an defective memory not realizing it causing you all sorts of trouble (including disk corruption. Have been there - didn't like it). That is a problem, in any situation. Thats why I would always recommend to use ECC memory, just I am concerned about my data.
Same if you have a storage not using checksums. You think you are fine, but you maybe not. Choose what best suits you ;)

Yes sure, you're totally right. I just wanted to state that this misconception is often written about ZFS and still many people believe in it.
 
Many years, more than 5, almost 7-8.
There you go. This is likely the problem (combined with you powering off your system).
You are in the very same situation than I have been 4 weeks ago. You have my sympathy.
I would recommend to stop wasting time on this gear. Get new drives and start over.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!