PVE Wiki: improving ZFS "Changing a failed device"?

Sep 13, 2022
69
9
8
Hi,

how to suggest improvements to the Wiki, like ZFS::Changing a failed device? If I see correctly, the Admin Manual references to the Wiki and there the infos are very brief. In particular I think it could happen that users could replace to something like /dev/sdx which is not recommended.

I suggest to add some guide how to find the ID. I used a combination of dmesg, fdisk -l and ls -l /dev/disk/by-id/|grep sdX. Probably there is a more elegant way, but by this I successfully replaced a device and ZFS automatically partitioned the disk (with the 8 MB safetey margin).

In detail:

Changing a failed device​

First look up the failed device using a command line "zpool status". The bad disk should be marked as "FAULTED". Remember the device name (e.g. "ata-TOSHIBA_MG04ACA400E_1234567800").

Now replace the disk and find how it is called. When using hotplug, dmesg should show a very recent new disk device like
Bash:
root@pve:~# dmesg
[1234.943372] ata5.00: ATA-8: TOSHIBA MG04ACA400E, FP4B, max UDMA/100
[1234.943898] ata5.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[1234.945993] ata5.00: configured for UDMA/100
[1234.946136] scsi 4:0:0:0: Direct-Access     ATA      TOSHIBA MG04ACA4 FP4B PQ: 0 ANSI: 5
[1234.946284] sd 4:0:0:0: Attached scsi generic sg0 type 0
[1234.946581] sd 4:0:0:0: [sda] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[1234.946585] sd 4:0:0:0: [sda] 4096-byte physical blocks

When the disk was replaced before a reboot, /var/log/kern.log contains the names among all others. Be careful not to mix it up.
To get futher information, also fdisk -l, gdisk -l, sfdisk -l and sgdisk -l can be helpful, others suggest lsblk.

It is recommended not to use this possibly dynamic device name, but instead a fixed one, preferably the linux disk ID. A simple way to find it for /dev/sda could be
Bash:
root@pve:~# ls -l /dev/disk/by-id/|grep sda
lrwxrwxrwx 1 root root  9 Nov  9 15:10 ata-TOSHIBA_MG04ACA400E_1234567801 -> ../../sda

Finally, replace it, wait a few seconds and query a status. Assuming the pool would be called dpool, it could be:

Bash:
root@pve:~# zpool replace dpool ata-TOSHIBA_MG04ACA400E_1234567800 ata-TOSHIBA_MG04ACA400E_1234567801
root@pve:~# zpool status dpool
 
I think it could happen that users could replace to something like /dev/sdx which is not recommended.
This is a myth. I had never any problems with sdxx (yes, I run system with sdcx) and why should you? 15 years ago you could just use USB drives and swap them around on different usb ports in your pool and everything went smoothly. In general, you should identify your drives with its bus number. Therefore "real" servers with caddies have that number on them that correspond directly to their port and therefore their bus id.
 
This is a myth. I had never any problems with sdxx (yes, I run system with sdcx) and why should you? 15 years ago you could just use USB drives and swap them around on different usb ports in your pool and everything went smoothly. In general, you should identify your drives with its bus number. Therefore "real" servers with caddies have that number on them that correspond directly to their port and therefore their bus id.
Good to hear that you accidentally never had problems. For LVM, device names usually also do not matter. Are you saying ZFS is not using the devices names but is performing some autodetect that works as long as all devices can be found? The OpenZFS FAQ suggests the opposite. Or maybe you are not using a cache file (reference)?

EDIT: Why should I identity to ZFS (in contrast to the admin) by bus number? If devices would be accidentally changed between slots, it could fail.
EDIT ANSWER: To lower the risk of mistakes when designing redundancy groups, especially when having many disks (>10).

However, in case ZFS uses the device names, which AFAIK is suggested by many sources, others may find themselves using multiple controllers or some that do not map to stable devices or need to connect the disks to another system after a controller failure, for example, or consider virtualized environments where a controller is mapped through PCI but not as boot device or whatever. Having this "ata" part IMHO already could cause trouble. I think that's why some suggest GPT partition UUIDs, but on the other hand that are the most hard to use for the operator needing to replace a disk, I think.

But in general, such trouble seems to be possible and happening, a quick google found this and this, for example.
 
Last edited:
But in general, such trouble seems to be possible and happening

Probably yes. Perhaps only on "old" systems.

Because I wanted to know the behavior to expect on my systems I run a quick test with Ubuntu 22.04 a moment ago. I created a simple VM and added a zpool using sdb,c,d (sda being the systems OS disk). Then I did several reboots flipping the disk-images around in be-tween. Each and every try succeeded to find the right devices. No surprise here.

Then I added some more disk and rearranged them to get mixed into the sequence, instead of b,c,d my Pool disks were now sdc,e,g. And this did work also without losing the pool. It seems to be more robust than it used to be in the far past :-)
 
  • Like
Reactions: sdettmer
Probably yes. Perhaps only on "old" systems.

Because I wanted to know the behavior to expect on my systems I run a quick test with Ubuntu 22.04 a moment ago. I created a simple VM and added a zpool using sdb,c,d (sda being the systems OS disk). Then I did several reboots flipping the disk-images around in be-tween. Each and every try succeeded to find the right devices. No surprise here.

Then I added some more disk and rearranged them to get mixed into the sequence, instead of b,c,d my Pool disks were now sdc,e,g. And this did work also without losing the pool. It seems to be more robust than it used to be in the far past :)
Great to simply test it - thank you.
Did you use a cache file? I've read problems occure if a cache is used (see previous post for a link).
 
And this did work also without losing the pool.
I have never seen it otherwise. ZFS as well as LVM uses identifiers on disk, so it just looks for them and loads the disks. This is not the case for fstab, you need the FS UUID in there to get a similar "slot-independent-setup".
 
Did you use a cache file?
My test-VM was still avaiable, so I added one - but please keep in mind: this is a very simplified test without any load:

Code:
My testpool "ppp" consists of 8GB disks:
root@zfstest:~# zpool status
  pool: ppp
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        ppp         ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

root@zfstest:~# ls -al /dev/disk/by-partuuid/
total 0
drwxr-xr-x 2 root root 200 Nov 15 07:37 .
drwxr-xr-x 8 root root 160 Nov 15 07:37 ..
lrwxrwxrwx 1 root root  10 Nov 15 07:37 13e683de-03aa-eb49-9fc5-e70e59b639e3 -> ../../sdg1
lrwxrwxrwx 1 root root  10 Nov 15 07:37 1af2524c-7abc-8646-bb4f-e60ccffe44d8 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Nov 15 07:37 25a1c7b2-cfaa-cd41-b193-8d4da64d9bed -> ../../sde1
lrwxrwxrwx 1 root root  10 Nov 15 07:37 48cf3cc8-c8a6-8f4b-9068-98ef0ebecc43 -> ../../sdg9
lrwxrwxrwx 1 root root  10 Nov 15 07:37 8e492a3a-64c8-f74b-85c6-4a6afaef0563 -> ../../sde9
lrwxrwxrwx 1 root root  10 Nov 15 07:37 acac595c-f342-48c1-870e-1c799570fe0c -> ../../sda2
lrwxrwxrwx 1 root root  10 Nov 15 07:37 eb19dcc1-9f25-4403-b775-98e27cd00013 -> ../../sda1
lrwxrwxrwx 1 root root  10 Nov 15 07:37 f7ca3d71-2327-ff42-86d4-d4bc5b053520 -> ../../sdc9

The 2 G disks are dummies = not used, only present to shuffle them around:
root@zfstest:~# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0    6G  0 disk
├─sda1   8:1    0    1M  0 part
└─sda2   8:2    0    6G  0 part /
sdb      8:16   0    2G  0 disk
sdc      8:32   0    8G  0 disk
├─sdc1   8:33   0    8G  0 part
└─sdc9   8:41   0    8M  0 part
sdd      8:48   0    2G  0 disk
sde      8:64   0    8G  0 disk
├─sde1   8:65   0    8G  0 part
└─sde9   8:73   0    8M  0 part
sdf      8:80   0    2G  0 disk
sdg      8:96   0    8G  0 disk
├─sdg1   8:97   0    8G  0 part
└─sdg9   8:105  0    8M  0 part

Now add another disk, 1 GB in size for simple recognition:
Code:
New disk became sdb
root@zfstest:~# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0    6G  0 disk
├─sda1   8:1    0    1M  0 part
└─sda2   8:2    0    6G  0 part /
sdb      8:16   0    1G  0 disk
sdc      8:32   0    2G  0 disk
sdd      8:48   0    8G  0 disk
├─sdd1   8:49   0    8G  0 part
└─sdd9   8:57   0    8M  0 part
sde      8:64   0    8G  0 disk
├─sde1   8:65   0    8G  0 part
└─sde9   8:73   0    8M  0 part
sdf      8:80   0    2G  0 disk
sdg      8:96   0    2G  0 disk
sdh      8:112  0    8G  0 disk
├─sdh1   8:113  0    8G  0 part
└─sdh9   8:121  0    8M  0 part

Add cache without partition table. 
root@zfstest:~# zpool  add ppp cache /dev/sdb

root@zfstest:~# zpool status
  pool: ppp
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        ppp         ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
        cache
          sdb       ONLINE       0     0     0

root@zfstest:~# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0    6G  0 disk
├─sda1   8:1    0    1M  0 part
└─sda2   8:2    0    6G  0 part /
sdb      8:16   0    1G  0 disk 
├─sdb1   8:17   0 1014M  0 part     # new cache
└─sdb9   8:25   0    8M  0 part
sdc      8:32   0    2G  0 disk
sdd      8:48   0    8G  0 disk
├─sdd1   8:49   0    8G  0 part
└─sdd9   8:57   0    8M  0 part
sde      8:64   0    8G  0 disk
├─sde1   8:65   0    8G  0 part
└─sde9   8:73   0    8M  0 part
sdf      8:80   0    2G  0 disk
sdg      8:96   0    2G  0 disk
sdh      8:112  0    8G  0 disk
├─sdh1   8:113  0    8G  0 part
└─sdh9   8:121  0    8M  0 part

Flip some disks:

root@zfstest:~# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0    6G  0 disk
├─sda1   8:1    0    1M  0 part
└─sda2   8:2    0    6G  0 part /
sdb      8:16   0    8G  0 disk
├─sdb1   8:17   0    8G  0 part
└─sdb9   8:25   0    8M  0 part
sdc      8:32   0    2G  0 disk
sdd      8:48   0    1G  0 disk    # cache sdd
├─sdd1   8:49   0 1014M  0 part
└─sdd9   8:57   0    8M  0 part
sde      8:64   0    2G  0 disk
sdf      8:80   0    8G  0 disk
├─sdf1   8:81   0    8G  0 part
└─sdf9   8:89   0    8M  0 part
sdg      8:96   0    2G  0 disk
sdh      8:112  0    8G  0 disk
├─sdh1   8:113  0    8G  0 part
└─sdh9   8:121  0    8M  0 part

root@zfstest:~# zpool status
  pool: ppp
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        ppp         ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
        cache
          sdd       ONLINE       0     0     0


So it seems that my test-setup is able to accept flipping sdabcdefgh with three data disks and a cache device.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!