ZFS drive replaced and added new one, error

Manny Vazquez

Well-Known Member
Jul 12, 2017
107
2
58
Miami, FL USA
Hi,

One of the drives on my 5 drives raidz2 failed.
I replaced that one drive perfectly fine, and resilvered in about 35 minutes..
1584719027471.png
BUT my ambition got the best of me and I tried to add a 6th drive to the pool, and I thought it was perfectly added ...
This is command I used
zpool add rpool ata-WDC_WDS100T2B0A_191809802355

But the drive ended up in a different TAB level .. (highlighted)
1584719092647.png
Also here on terminal
1584719126862.png

Now, I am told this <quote>"Your whole pool is now with no redundancy at all."</quote>

How can I fix this? without having to reinstall the whole system? I really do not want to drive 200 miles to the datacenter, again.

Please.. any idea is appreciated.
 
Hi,
How can I fix this? without having to reinstall the whole system? I really do not want to drive 200 miles to the datacenter, again.

You can't extend and existing raidz1 vdev by just adding another disk, due to how the data is stored on the devices ZFS doesn't supports that directly.
But you could add another whole vdev to the pool to increase the pool's capacity.
You'll need more than one additional disk if you want to retain redundancy. For example, you could use two disks to add a mirror vdev or 5+ disks for another raidz2 vdev.

For now I'd drop the newly added vdev again with zpool remove rpool ata-WDC_WDS100T2B0A_191809802355
 
Hi,


You can't extend and existing raidz1 vdev by just adding another disk, due to how the data is stored on the devices ZFS doesn't supports that directly.
But you could add another whole vdev to the pool to increase the pool's capacity.
You'll need more than one additional disk if you want to retain redundancy. For example, you could use two disks to add a mirror vdev or 5+ disks for another raidz2 vdev.

For now I'd drop the newly added vdev again with zpool remove rpool ata-WDC_WDS100T2B0A_191809802355

Thanks, appreciate the reply..

The remove does not work.
1584723688326.png
 
Oh, maybe it could be worth to get some remote console, IPMI/iKVM/... :)
What I meant with that, was not to drive to reinstall proxmox from scratch .. in which I need to physically connect the USB drive to boot into it. I have 100% terminal and web access. But I do not have "hands on" to actually put a USB on the server.
 
The remove does not work.
View attachment 15804

It sure doesn't. As soon, as ZFS has put data on the drive, which happens almost instantly, ZFS won't let you remove the drive. You will have no choice but to perform a clean backup from the pool and destroy it. If you can pop in another drive, which is big enough to hold all the data, I'd suggest that you perform a ZFS send/receive to the new drive and then destroy the rpool.

Btw, whenever I see rpool as a zpool name, I immediately associate that zpool with the boot drive on Illumos-type systems and I really hope, that this is not teh case here, because you wouln't be able to destroy the volume you booted your host from, while running live on the same…

I'd never, never ever have my guests on the same volume, as where I installed the OS - this is just asking for trouble…
 
It sure doesn't. As soon, as ZFS has put data on the drive, which happens almost instantly, ZFS won't let you remove the drive. You will have no choice but to perform a clean backup from the pool and destroy it. If you can pop in another drive, which is big enough to hold all the data, I'd suggest that you perform a ZFS send/receive to the new drive and then destroy the rpool.

Btw, whenever I see rpool as a zpool name, I immediately associate that zpool with the boot drive on Illumos-type systems and I really hope, that this is not teh case here, because you wouldn't be able to destroy the volume you booted your host from, while running live on the same…

Uppss, you got me there..

This was the default setup, I didn't know better and it has been running, no issue at all , until that one drive died, for almost 400 days.
So, no, this is the boot drive, nothing else on this server... Does that mean, I am out of luck and my only option is to reformat the server?

I have no data at all on this server, only the replications from the other servers, in theory, I could live like this indefinitely since the other 2 servers are plenty to host my VMs.

So, what do you recommend I do?
Please, whatever it is, I will do it, even if I have to travel the 200 miles (which will be next week)
If you don't mind giving me 'best practices' to reinstall this server. It only has room for 6 drives, which I replaced about 1 year ago for SSDs, and this is the first one that gives me any problem at all, ever.

So, what do you recommend I do to setup , if I have to reinstall the server, with 6 drives only available on the server..

Thanks
 
Sorry to have hit the nail on the head… Okay, first I'd check, if this server really doesn't have some extra option for using a DOM or other means of local storage to boot it. If it has, use that and install PVE on it. You may also get away with a durable USB thumb drive, on which you can install PVE, the amount of logging shouldn't wear it out, if your system is able to boot from a USB device.

If this server is only on the receiving end for replications and you don't rely on the data, you can of course have those copies shipped again.
Running SSDs on a replication end seems a bit over the top since you will mostly have streaming data form your replication sources and don't benefit from the SSD's high IOPs, so I'd rather get big spinning disks inside that, unless you plan on actually running guests on that server.

However you approach it, seperate the zpool for local storage from the boot stuff. Even if you loose your entire boot disk, the zpool with your data will still be there and can be easily imported after you have re-installed PVE. Last tipp: practise that scenario. After I setup my two-node PVE at home, I deliberately destroyed the PVE on my replication server to check my assumptions and to see for myself, that this worked - and it did nicely. I re-installed PVE removed the orphaned PVE from ym cluster setup, imported back my zpool for local storage and simply re-joined my repaird PVE to my cluster and everything worked smoothly further on.
 
I appreciate your comments but there are several points to 'validate'

This is one node of 3, I do put VMs on this node, but my calculation of load is that one of my 3 nodes should be able to handle all the VMs, so I only have 3 because I can :)
I have 3 clusters, of 3 servers each..all more or less in the same config..
256 gb ram, dual cpu with multiple cores (this one 6 cores per cpu)

When I said that the only data on this server was some replication info, it is because at this moment since I have just replaced the drive, I had moved all the VMs to the other 2 nodes.
1584726617898.png
This is the cluster with the node in 'trouble' (pve1 which as you see above has no VMs on it.)

1584726593724.png
this is the another cluster, with no issue
1584726760497.png

So, in summary, I could just reinstall the server and be done with it, but I was looking for an alternative NOT to have to do it, since the original setup had worked perfectly fine for a long time..

I share all this info, in hope that maybe if you see the whole picture, you maybe able to come up with another alternative that does not involve reformatting.. which means driving 200 miles, one way .. or paying a person (engineer at $150 per hour support fee), per hour, to just burn a USB and put it in the server and push power button.

I am in your savvy hands :) if you can come up with alternative...

In reality, the only important thing on this node, at this moment, is the boot, I do not care about anything else..

But also, is it true that at this moment I have no redundancy? i other words, if ONE drive fails, the whole system goes? which I guess, I do not care since replication is working fine at 3 minutes intervals. and we do not store data on ANY of the VMs. on this drives anyhow, all data storage for the VMs is on the NAS.

Again, sorry for the lengthy reply, i figured you would do a better assessment having more info
 
Please provide the output of zpool status rpool…
 
Last edited:
I was actually thinking about that, just pulling the drive (it would be 10 minutes of engineer support) and hopefully get the zraid back to normal after removing that one...

is that what you mean? just pull the drive from the server?

and just to make sure, I can not add a 6th drive to a 5 drive raid... correct?
 
Sorry, I obviously hit the reply butten, when I was still researching the issue. Please get me the output of

Code:
zpool status rpool

from the PVE Konsole.
 
It was already posted, but here it is again
1584730181240.png

root@pve1:~# zpool status rpool
pool: rpool
state: ONLINE
scan: resilvered 268G in 0h56m with 0 errors on Wed Mar 18 18:21:56 2020
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
sde3 ONLINE 0 0 0
sdf3 ONLINE 0 0 0
sdc3 ONLINE 0 0 0
ata-Samsung_SSD_860_QVO_1TB_S4PGNF0M600442P ONLINE 0 0 0
ata-WDC_WDS100T2B0A_191809802355 ONLINE 0 0 0

errors: No known data errors
root@pve1:~#
 

Attachments

  • 1584730093453.png
    1584730093453.png
    20.8 KB · Views: 0
Yeah - sorry, I somewhat expected to see a vdev with an underlying device… what ZFS version is on your PVE host. The removal of a vdev should be possible in recent versions of ZFS.
 
root@pve1:~# dpkg-query -s zfsutils-linux
Package: zfsutils-linux
Status: install ok installed
Priority: optional
Section: contrib/admin
Installed-Size: 1049
Maintainer: Proxmox Support Team <support@proxmox.com>
Architecture: amd64
Source: zfs-linux
Version: 0.7.13-pve1~bpo2
Provides: zfsutils
Depends: python3, init-system-helpers (>= 1.18~), python3:any, libblkid1 (>= 2.16), libc6 (>= 2.17), libnvpair1linux (>= 0.7.13), libudev1 (>= 183), libuuid1 (>= 2.16), libuutil1linux (>= 0.7.13), libzfs2linux (>= 0.7.13), libzpool2linux (>= 0.7.13), zlib1g (>= 1:1.1.4)
Recommends: lsb-base, zfs-zed
Suggests: nfs-kernel-server, samba-common-bin (>= 3.0.23), zfs-initramfs
Conflicts: zfs, zfs-fuse
Conffiles:
/etc/cron.d/zfsutils-linux 27db3c2e738f030ab15d717cfda8261a
/etc/default/zfs 5483fced1e3de63c0d35f433131c4f26
/etc/sudoers.d/zfs ad829cd4055ab2a8d41f8b5f2b564a3f
/etc/zfs/zfs-functions 87bbaa4b01e66f8d802795e6b5fe8b59
Description: command-line tools to manage OpenZFS filesystems
The Z file system is a pooled filesystem designed for maximum data
integrity, supporting data snapshots, multiple copies, and data
checksums.
.
This package provides the zfs and zpool commands to create and administer
OpenZFS filesystems.
Homepage: http://www.zfsonlinux.org/
root@pve1:~#
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!