Unavailable ZFS Pool after restart. What Should i do?

glider

New Member
Sep 22, 2021
3
1
3
39
Today when i restarted my proxmox PVE host i couldnt reach the web gui so i had to connect to the video output to check the POST and i noticed that the boot process was stuck at the very begin.

The message i were getting was

"Failed to import pool 'rpool"

If i type "zpool import" i get this:

pool: rpool
id: 52589027502762957852
state: UNAVAIL
status: One or more devices contain corrupted data.
action: The pool cannot be imported due to damaged devices or data
config:

rpool UNAVAIL insufficient replications

mirror-0 ONLINE
ata-CT500MX500SSD1_1918E1FF2394-part3 ONLINE
ata-CT500MX500SSD1_1918E1FF2349-part3 ONLINE
mirror-1 UNAVAIL insufficient replications
sdc UNAVAIL
sdd UNAVAIL



I need to know where to go from here, from what i understand it seems like sda and sdb are correctly loaded while both sdc and sdd are "unavailable".
Its the first time i use ZFS, it is possibile that one unit between sdc and sdd is faulty thus crashing the whole pool or those 2 UNAVAIL on the sdc and sdd unit means that they are both unreachable?

The only other strange thing that i noticed is this:
If i type

ls /dev/disk/by-id

i don't get 4 drives but i get 8 (and also other 8 wwn- but let's forget about them for now) like:

ata-CT500MX500SSD1_1918E1FF9032
ata-CT500MX500SSD1_1918E1FF9032-part1
ata-CT500MX500SSD1_1918E1FF9032-part2
ata-CT500MX500SSD1_1918E1FF9032-part3
ata-CT500MX500SSD1_1918E1FFB0D3
ata-CT500MX500SSD1_1918E1FFB0D3-part1
ata-CT500MX500SSD1_1918E1FFB0D3-part2
ata-CT500MX500SSD1_1918E1FFB0D3-part3

Is this normal for a setup like mine or there's something wrong with that?

At this point i'm considering the option to remove sdc and sdd and see if the pool boot up again. Is this possible?


This is a small staging server and nothing mission critical runs on it but i would like to know what is causing the issue and how to recover my zpools in those situations.
The disks are mx500 SSD units and i know that they are consumer grade but they are like 1 year old and the server has been used VERY sporadically so i find it hard to believe this is an actual hardware problem.
I'm more prone to believe this has something to do with the update but i want to hear your opinions on this.
I kindly thank you for your help and i hope that i can share my little bit of knowledge with you in the future.
 
Its the first time i use ZFS, it is possibile that one unit between sdc and sdd is faulty thus crashing the whole pool or those 2 UNAVAIL on the sdc and sdd unit means that they are both unreachable?
Sounds like both disks aren't available.
ata-CT500MX500SSD1_1918E1FF9032
ata-CT500MX500SSD1_1918E1FF9032-part1
ata-CT500MX500SSD1_1918E1FF9032-part2
ata-CT500MX500SSD1_1918E1FF9032-part3
ata-CT500MX500SSD1_1918E1FFB0D3
ata-CT500MX500SSD1_1918E1FFB0D3-part1
ata-CT500MX500SSD1_1918E1FFB0D3-part2
ata-CT500MX500SSD1_1918E1FFB0D3-part3
That are just two disk. 6 of that are just partitions on those 2 drives. You can ignore the WNN, they refer to the identical drives.
Is this normal for a setup like mine or there's something wrong with that?
You should see all 4 drives and not just the two.
At this point i'm considering the option to remove sdc and sdd and see if the pool boot up again. Is this possible?
If both sdc and sdd are dead you lost all your data, even that one on sba and sdb. A striped mirror can only handle one failing drive of each mirror. You got one mirror where both drives failed so the complete pool is degraded as long as you don't get those two disk to work again (check cables and so on).
This is a small staging server and nothing mission critical runs on it but i would like to know what is causing the issue and how to recover my zpools in those situations.
The disks are mx500 SSD units and i know that they are consumer grade but they are like 1 year old and the server has been used VERY sporadically so i find it hard to believe this is an actual hardware problem.
Run smartctl -a /dev/sda to see details of the drives health. How much do they have written? With ZFS you get alot of write amplification so its still possible that you wrote them to death after only one year. Depending on your workload you might kill consumer SSDs within months.
I'm more prone to believe this has something to do with the update but i want to hear your opinions on this.
Did you update from PVE6 to 7 short before?
 
Sounds like both disks aren't available.

That are just two disk. 6 of that are just partitions on those 2 drives. You can ignore the WNN, they refer to the identical drives.

You should see all 4 drives and not just the two.

If both sdc and sdd are dead you lost all your data, even that one on sba and sdb. A striped mirror can only handle one failing drive of each mirror. You got one mirror where both drives failed so the complete pool is degraded as long as you don't get those two disk to work again (check cables and so on).

Run smartctl -a /dev/sda to see details of the drives health. How much do they have written? With ZFS you get alot of write amplification so its still possible that you wrote them to death after only one year. Depending on your workload you might kill consumer SSDs within months.

Did you update from PVE6 to 7 short before?
Yes. Sorry i forgot to mention that in my first message, i must have deleted the sentence somehow.
I updated from pve 6 to 7 just a week ago.
I will now check for the health status of those disks.
Everything can happen but i find so strange that two disks have failed at the same time!
If that's what happened, i will definitely go for enterprise disks in my next setup.
Thank you very much for your input.
 
i don't get 4 drives but i get 8 (and also other 8 wwn- but let's forget about them for now) like:

ata-CT500MX500SSD1_1918E1FF9032
ata-CT500MX500SSD1_1918E1FF9032-part1
ata-CT500MX500SSD1_1918E1FF9032-part2
ata-CT500MX500SSD1_1918E1FF9032-part3
ata-CT500MX500SSD1_1918E1FFB0D3
ata-CT500MX500SSD1_1918E1FFB0D3-part1
ata-CT500MX500SSD1_1918E1FFB0D3-part2
ata-CT500MX500SSD1_1918E1FFB0D3-part3
please post the complete lsblk and ls -la /dev/disk/by-id output - it looks to me as if the disks might still be here (the serial numbers here (1918E1FF9032, 1918E1FFB0D3) are different than the one zpool finds ( 1918E1FF2394, 1918E1FF2349))


on a hunch - try setting a root-delay to the kernel command line - see https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks#Boot_fails_and_goes_into_busybox

else - check `dmesg` and `journalctl -b` in the rescue shell for further hints at what's going wrong

I hope this helps!
 
I need to update this post cause today the server started without problems and the zpool has all the units available and functioning.
I didnt change anything in the hardware or in the software, it simply started as nothing has ever happened.
I will now further dig into the dmesg and smart to try to understand something more and i will try to add that delay to the boot that a lot of guys seem to advise, i will also try to update my initramfs.
I'll keep the post updated if i manage to understand more of this problem in case someone else get here for this same issue.
 
  • Like
Reactions: Stoiko Ivanov

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!