[SOLVED] 92 Hard Disks - Improving the naming practice?

mikeinnyc · Jan 30, 2023

I have a D3284 with 84 HDD and my primary NODE1 with 8 SSD already named.

16 RaidZ3 pools with 84 disks. This is because each Raidz3 pool requires at least 5 disks, so 84 disks can be divided into 16 pools of 5 disks each with 4 remaining for other use.
How can this naming process be improved?
No matter if Fdisk and deleted pools... when I start creating NEW pools the disk naming process jumps to different disks after like 8 pools are created.
Different IDs too so I'm at a loss. When I reboot my Node 1 first and then start my Direct Attached Storage, the HDD naming process goes back to normal. Then I start creating pools but only after Fdisk, Delete Pools, reboot, and check the status of pools then I begin. I thought the ordering process does not matter with storage pools

What I want to know is if the naming process starts with SDA all the way to SDZ?

Since the first 8 SSD disks were already named SDA to SDH

should I start the new pool naming process with the next one after SDH like - SDI, SDJ, SDK, SDL, SDM, all the way until SDZ then start with SDAA, SDAB etc?

I'm thinking this is why this is happening to me. Will let everyone know if continuing to SDZ pays off.

mikeinnyc · Jan 31, 2023

Pilot error - I left a straggler and didn't fdisk him fully.
I assumed the Proxmox Gui was correct maybe it was but I haven't even tried to replicate this.
It was only when I ran

Code:

ls -lF /dev/disk/by-id/

lrwxrwxrwx 1 root root 10 Jan 30 16:46 wwn- -> ../../sddb
lrwxrwxrwx 1 root root 10 Jan 30 10:56 wwn- -> ../../sdeq
lrwxrwxrwx 1 root root 10 Jan 30 10:56 wwn- -> ../../sded
lrwxrwxrwx 1 root root 10 Jan 30 16:34 wwn- -> ../../sdau
lrwxrwxrwx 1 root root 11 Jan 30 16:34 wwn--part1 -> ../../sdea1
lrwxrwxrwx 1 root root 11 Jan 30 16:34 wwn--part9 -> ../../sdea9
lrwxrwxrwx 1 root root 10 Jan 30 16:30 wwn- -> ../../sdah
lrwxrwxrwx 1 root root 10 Jan 30 16:43 wwn- -> ../../sdcw

I'm going to assume that since this disk wasn't fully deleted of partitions that would be a reason for my errors in creating new pools, especially a few pools in then I would like this mistake. Man, I need Coffee.

mikeinnyc · Jan 31, 2023

OK, we have a problem. After each disk was fdisk all 84 and verified that no partitions remained the same error happens again.

I created a Z3 Pool1 added from (SDI to SDM)
after verifying the disk order here's what randomness shows. Pool created in order from SDI to SDM 5 disk zfs3 pool but reads out-of-order disks.
ada/adb = OS (2 disks SSDs)
sdc/sdh =localzfspool (6 SSDs)

I created a DASPool1 ZFS3 5 Disks starting at sdi to sdm? Did I? I know I did. So why is this happening?

[
root@NODE1:~# lsblk
NAME      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda         8:0    0 372.6G  0 disk
├─sda1      8:1    0  1007K  0 part
├─sda2      8:2    0   512M  0 part
└─sda3      8:3    0 372.1G  0 part
sdb         8:16   0 372.6G  0 disk
├─sdb1      8:17   0  1007K  0 part
├─sdb2      8:18   0   512M  0 part
└─sdb3      8:19   0 372.1G  0 part
sdc         8:32   0 372.6G  0 disk
├─sdc1      8:33   0 372.6G  0 part
└─sdc9      8:41   0     8M  0 part
sdd         8:48   0 372.6G  0 disk
├─sdd1      8:49   0 372.6G  0 part
└─sdd9      8:57   0     8M  0 part
sde         8:64   0 372.6G  0 disk
├─sde1      8:65   0 372.6G  0 part
└─sde9      8:73   0     8M  0 part
sdf         8:80   0 372.6G  0 disk
├─sdf1      8:81   0 372.6G  0 part
└─sdf9      8:89   0     8M  0 part
sdg         8:96   0 372.6G  0 disk
├─sdg1      8:97   0 372.6G  0 part
└─sdg9      8:105  0     8M  0 part
sdh         8:112  0 372.6G  0 disk
├─sdh1      8:113  0 372.6G  0 part
└─sdh9      8:121  0     8M  0 part
sdi         8:128  0   1.8T  0 disk
├─sdi1      8:129  0   1.8T  0 part
└─sdi9      8:137  0     8M  0 part
sdj         8:144  0   1.8T  0 disk
sdk         8:160  0   1.8T  0 disk
sdl         8:176  0   1.8T  0 disk
├─sdl1      8:177  0   1.8T  0 part
└─sdl9      8:185  0     8M  0 part
sdm         8:192  0   1.8T  0 disk
sdn         8:208  0   1.8T  0 disk
sdo         8:224  0   1.8T  0 disk
sdp         8:240  0   1.8T  0 disk
sdq        65:0    0   1.8T  0 disk
sdr        65:16   0   1.8T  0 disk
sds        65:32   0   1.8T  0 disk
sdt        65:48   0   1.8T  0 disk
sdu        65:64   0   1.8T  0 disk
sdv        65:80   0   1.8T  0 disk
sdw        65:96   0   1.8T  0 disk
sdx        65:112  0   1.8T  0 disk
├─sdx1     65:113  0   1.8T  0 part
└─sdx9     65:121  0     8M  0 part
sdy        65:128  0   1.8T  0 disk
├─sdy1     65:129  0   1.8T  0 part
└─sdy9     65:137  0     8M  0 part
sdz        65:144  0   1.8T  0 disk
sdaa       65:160  0   1.8T  0 disk
├─sdaa1    65:161  0   1.8T  0 part
└─sdaa9    65:169  0     8M  0 part
sdab       65:176  0   1.8T  0 disk
sdac       65:192  0   1.8T  0 disk
/ICODE]

mikeinnyc · Jan 31, 2023

I'm going to use a recommendation of RAIDZ-3 configuration that maximizes disk space and can withstand 3 disk failures. a triple-parity RAID-Z (raidz3). I should of done this first. Damn openai crap was finally wrong.

I will use a configuration at 9 disks (6+3) but tomorrow I will use the CLI.
# zpool create rzpool raidz3 c0t0d0 c1t0d0 c2t0d0 c3t0d0 c4t0d0
c5t0d0 c6t0d0 c7t0d0 c8t0d0

fabian · Jan 31, 2023

a few things to consider:
- raidz has a lot of downsides that you should be aware of (high overhead with zvols, lower iops compared to mirrored setups)
- if you can live with those, you might want to investigate draid for faster resilvering

you can give your disks stable names/paths, the /dev/sdX ones are just assigned by the kernel in the order the disks are enumerated. one possibility is udev (it already assigns alternative names, check /dev/disk/...), which also supports custom rules. ZFS can also hook into that via vdev_id - see its man page and the example configs in /usr/share/doc/zfsutils-linux/examples.

mikeinnyc · Jan 31, 2023

I just had some good 5 hours of light sleep..again. Last year I had no problems creating fresh Z3Pools however after sleep I notice (2X my hard drives ... Not 84 but double that.)
draid - I love the fact that rebuild time is basically all the disks participating instead of one in the vdev. Since the writes are occurring to all disks in the vdev simultaneously it is faster than rebuilding to a single drive. My only concern is this stable that's my fear it looks good but does it bite? Fabian, you have my interest since I'm at ground zero again.

mikeinnyc · Jan 31, 2023

Any Ideas?

fabian · Jan 31, 2023

it seems like your storage enclosure is multipath/dual-channel, so you "see" each disk twice (because there's two connections to it). you either need to disable this on the enclosure side or configure multipath so that PVE knows to treat each "pair" as a single disk.

regarding the other question - draid is considered stable by the upstream ZFS developers. like I said, it suffers from similar downsides as raidz w.r.t. performance and potential space overhead, depending on workload.

mikeinnyc · Jan 31, 2023

Well, the serial numbers are different in CLI but not in PMGUI? Pulled my Node cables down to one node. Stumped

root@NODE1:~# echo /dev/sdi;udevadm info -p block/sdi --query all|grep ID_SERIAL
/dev/sdi
E: ID_SERIAL=4825
E: ID_SERIAL_SHORT=4825
root@NODE1:~# echo /dev/sdj;udevadm info -p block/sdj --query all|grep ID_SERIAL
/dev/sdj
E: ID_SERIAL=a809
E: ID_SERIAL_SHORT=a809
root@NODE1:~# echo /dev/sdk;udevadm info -p block/sdk --query all|grep ID_SERIAL
/dev/sdk
E: ID_SERIAL=c9f5
E: ID_SERIAL_SHORT=c9f5
root@NODE1:~#

alexskysilk · Jan 31, 2023

@fabian has the right of it. on both counts.

it does look like you're multipathed. how are you connected to the chassis? do you have two cables? if so, you need to install and confiugure multipathd. once you do, use the multipath devs to create your zpool.

As for raid level- with so many drives, this is what draid exists for. see https://openzfs.github.io/openzfs-docs/Basic Concepts/dRAID Howto.html

mikeinnyc · Jan 31, 2023

I just backed up my Containers. Time to remove all my pools & Fdisk except the os. Let's try again..more coffee.

If this works I'll just do Striped Mirror for my 6 local SSDs for my fast LXCs, and small-disk size vdevs hdd for my storage backups and slower lxcs. I was going to use zfs3 with just 5 disks each but now you have me thinking?

I'm still torn about using Draid since I never had used this before. it's the rebuild time really that I could save and it seems I would have to go big on the number of disks used for a benefit. I'd rather have the most vdevs for performance than a large number of disks used for draid for lxc/storage? No? yes? who's on first?
I need more coffee because it's time to make a decision...

mikeinnyc · Jan 31, 2023

do you have two cables? yes

alexskysilk · Jan 31, 2023

mikeinnyc said:
do you have two cables? yes

one mystery down. you have the option of either split the disks between the channels, or leave it as is and install multipath (option I'd go with)

Setting up your SSDs in a striped mirror makes tons of sense, but everything is dependent on your disk mix and risk tolerance.

mikeinnyc said:
I'd rather have the most vdevs for performance than a large number of disks used for draid for lxc/storage?

the number of vdevs, regardless of draid, has to do with how many stripes fill your available disk pool before repeating- eg

lets say you set up your 84 drives in a 8:2:1 arrangement (8 data:2 parity:1 spare). that means that you have room for 7 full stripes and half of the next, or ~7.5 "vdevs." it doesnt perform as well as a true compound raid (eg, multiple individual vdevs) but the discussion is pretty moot, however, if your disks are 3.5 spinners as that is vm storage is not their usecase in the first place- its bulk storage.

mikeinnyc · Jan 31, 2023

Ok, this problem is now solved. What I did was just pull the backup plug leaving only one cable. Now everything is back to normal.
I know what happened now last year I added a backup cable to my system and then when i just used fdisk all the storage disks in it went "Mr. Magoo with double vision." I'll set up multipath so this doesn't happen again. How to setup Multipath

if your disks are 3.5 spinners as that is vm storage is not their use case in the first place- it's bulk storage. yes, correct. My VMs that count are SSD the rest are backups of those LXCs and a few LXCs combined. My main concern is storage backups.
Thank you guys

xrobau · Feb 1, 2023

Why aren't you just using /dev/disk/by-id/scsi-THE_ACTUAL_NAME_OF_THE_DEVICE ?? That's much much easier and is deterministic. (If it's not there, install 'sg3-utils-udev')

mikeinnyc · Feb 1, 2023

xrobau said:
Why aren't you just using /dev/disk/by-id/scsi-THE_ACTUAL_NAME_OF_THE_DEVICE ?? That's much much easier and is deterministic. (If it's not there, install 'sg3-utils-udev')

Hmm, I thought because my disk IDs were already unique, Multipath or disabling this physically in my DAS Box directly would be the better choice. So, if I named each disk example: 0001, 0002, 0003, etc this would be another way to solve the problem also and not duplicate again?

xrobau · Feb 1, 2023

mikeinnyc said:
Hmm, I thought because my disk IDs were already unique, Multipath or disabling this physically in my DAS Box directly would be the better choice. So, if I named each disk example: 0001, 0002, 0003, etc this would be another way to solve the problem also and not duplicate again?

Multipath doesn't really do anything on the same machine. You're always going to be IO bound at the disk itself ('iostat -xy 1' and check the busy stats). I'm more suggesting that you use the name so you can tell which disk is which! To me this is a lot easier to read:

Code:

root@nas-194:~# zpool status
  pool: bigpool
 state: ONLINE
  scan: scrub repaired 0B in 14:15:42 with 0 errors on Sun Jan  8 14:39:45 2023
config:

        NAME                                                       STATE     READ WRITE CKSUM
        bigpool                                                    ONLINE       0     0     0
          raidz2-0                                                 ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD8MVSK0000C942MF53        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD2X2LE0000C8180YGZ        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD2X39C0000C8179ULY        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD1NJSM0000C7458K63        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD5ZWNY0000C90705B7        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD8VN340000C944FGQA        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD1ZAPY0000C8027ANP        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD210890000C75286LY        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD8AVEZ0000C9401UL3        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD1SBQJ0000C7458J3N        ONLINE       0     0     0
        cache
          nvme-Samsung_SSD_970_EVO_Plus_1TB_S4EWNF0M317581N-part2  ONLINE       0     0     0

errors: No known data errors
root@nas-194:~#

mikeinnyc · Feb 3, 2023

All is Good on Node 1 which round-robins to both connected Lenovo's ESMs. I had to take off user-friendly. I did get something strange but all pools show ok and online. I'll assume I need more coffee.

Feb 02 18:52:18 | 35000c50030256957: failback = "immediate" (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: path_grouping_policy = multibus (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: path_selector = "round-robin 0" (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: no_path_retry = "queue" (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: retain_attached_hw_handler = yes (setting: implied in kernel >= 4.3.0)
Feb 02 18:52:18 | 35000c50030256957: features = "0" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: hardware_handler = "0" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: rr_weight = "uniform" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: minio = 1 (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: fast_io_fail_tmo = 5 (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: deferred_remove = no (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: marginal_path_err_sample_time = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: marginal_path_err_rate_threshold = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: marginal_path_err_recheck_gap_time = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: marginal_path_double_failed_time = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: san_path_err_threshold = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: san_path_err_forget_rate = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: san_path_err_recovery_time = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: skip_kpartx = no (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: ghost_delay = "no" (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: flush_on_last_del = no (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: setting dev_loss_tmo is unsupported for protocol scsi:unspec
Feb 02 18:52:18 | 35000c50030256957: set ACT_CREATE (map does not exist)
Feb 02 18:52:18 | 35000c50030256957: addmap [0 781422768 multipath 1 queue_if_no_path 0 1 1 round-robin 0 1 1 8:16 1]
Feb 02 18:52:18 | libdevmapper: ioctl/libdm-iface.c(1927): device-mapper: reload ioctl on 35000c50030256957 (253:258) failed: Device or resource busy ????
Feb 02 18:52:18 | dm_addmap: libdm task=0 error: Success
Feb 02 18:52:18 | 35000c50030256957: failed to load map, error 16
Feb 02 18:52:18 | 35000c50030256957: domap (0) failure for create/reload map
Feb 02 18:52:18 | 35000c50030256957: ignoring map
Feb 02 18:52:18 | sdb: orphan path, map removed internally ????

mikeinnyc · Mar 2, 2023

Solved: libdevmapper: ioctl/libdm-iface.c(1927)
for this unkillable multipath error for my Proxmox OS drives sda and sdb. No matter what I did I could not solve this problem. libdevmapper: ioctl/libdm-iface.c(1927): device-mapper: reload ioctl on 35000c50030256957 (253:258) failed: Device or resource busy ????
The best solution was to take my backup /etc/*.* folder using rsync and just do a clean install withOUT any lxcs or vms.

It seems that a permanent LOCKED BUSY state happens once you create ANY VM or lxc. Either way by reinstalling a fresh OS and fdisking my disks I was able to get multipath working properly 100% on my OS disks that were not in a locked busy state this time.

If you have this weird, strange, rare locked error on the OS disks just reinstall fresh but take your /etc select files with you first. Reinstalling the OS was a hell of a lot faster than troubleshooting this non killable error by force, delay start, export zpool, or other methods I just couldn't solve this one.

I hope this helps someone solve this libdevmapper: ioctl/libdm-iface.c(1927)

Now, back to eating crayons.

LnxBil · Mar 2, 2023

I just stumbled over this ticket and I'm glad you solved your issue.

The resource busy is when the device is locked by another kernel system and can be further debugged by dmsetup. It is however not that easy to debug further, yet possible.
Reinstalling - like in windows - solves some issues and I see why you did it.

For the naming issue, we name your devices by scsi bus id that corresponds (in our case) to the shelf ID so that we have e.g. S1R1C1 (for shelf, row and column). With this it is easy to "see" a failed drive better.

[SOLVED] 92 Hard Disks - Improving the naming practice?

Member

Attachments

Member

Member

Attachments

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Distinguished Member

Member

Member

Distinguished Member

Member

Member

Member

Member

Member

Member

Distinguished Member

We value your privacy