[SOLVED] 92 Hard Disks - Improving the naming practice?

mikeinnyc

Member
Jul 6, 2021
43
16
13
58
New York City
I have a D3284 with 84 HDD and my primary NODE1 with 8 SSD already named.

16 RaidZ3 pools with 84 disks. This is because each Raidz3 pool requires at least 5 disks, so 84 disks can be divided into 16 pools of 5 disks each with 4 remaining for other use.
How can this naming process be improved?
No matter if Fdisk and deleted pools... when I start creating NEW pools the disk naming process jumps to different disks after like 8 pools are created.
Different IDs too so I'm at a loss. When I reboot my Node 1 first and then start my Direct Attached Storage, the HDD naming process goes back to normal. Then I start creating pools but only after Fdisk, Delete Pools, reboot, and check the status of pools then I begin. I thought the ordering process does not matter with storage pools

What I want to know is if the naming process starts with SDA all the way to SDZ?

Since the first 8 SSD disks were already named SDA to SDH

should I start the new pool naming process with the next one after SDH like - SDI, SDJ, SDK, SDL, SDM, all the way until SDZ then start with SDAA, SDAB etc?

I'm thinking this is why this is happening to me. Will let everyone know if continuing to SDZ pays off.:rolleyes:
 

Attachments

  • Screenshot 2023-01-30 at 15-43-15 Screenshot.png
    Screenshot 2023-01-30 at 15-43-15 Screenshot.png
    119 KB · Views: 14
  • HDD.pdf
    HDD.pdf
    246.3 KB · Views: 10
Last edited:
Pilot error - I left a straggler and didn't fdisk him fully.
I assumed the Proxmox Gui was correct maybe it was but I haven't even tried to replicate this.
It was only when I ran
Code:
ls -lF /dev/disk/by-id/

lrwxrwxrwx 1 root root 10 Jan 30 16:46 wwn- -> ../../sddb
lrwxrwxrwx 1 root root 10 Jan 30 10:56 wwn- -> ../../sdeq
lrwxrwxrwx 1 root root 10 Jan 30 10:56 wwn- -> ../../sded
lrwxrwxrwx 1 root root 10 Jan 30 16:34 wwn- -> ../../sdau
lrwxrwxrwx 1 root root 11 Jan 30 16:34 wwn--part1 -> ../../sdea1
lrwxrwxrwx 1 root root 11 Jan 30 16:34 wwn--part9 -> ../../sdea9
lrwxrwxrwx 1 root root 10 Jan 30 16:30 wwn- -> ../../sdah
lrwxrwxrwx 1 root root 10 Jan 30 16:43 wwn- -> ../../sdcw

I'm going to assume that since this disk wasn't fully deleted of partitions that would be a reason for my errors in creating new pools, especially a few pools in then I would like this mistake. Man, I need Coffee.
 
Last edited:
OK, we have a problem. After each disk was fdisk all 84 and verified that no partitions remained the same error happens again.

I created a Z3 Pool1 added from (SDI to SDM)
after verifying the disk order here's what randomness shows. Pool created in order from SDI to SDM 5 disk zfs3 pool but reads out-of-order disks.
ada/adb = OS (2 disks SSDs)
sdc/sdh =localzfspool (6 SSDs)

I created a DASPool1 ZFS3 5 Disks starting at sdi to sdm? Did I? I know I did. So why is this happening?

[ root@NODE1:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 372.6G 0 disk ├─sda1 8:1 0 1007K 0 part ├─sda2 8:2 0 512M 0 part └─sda3 8:3 0 372.1G 0 part sdb 8:16 0 372.6G 0 disk ├─sdb1 8:17 0 1007K 0 part ├─sdb2 8:18 0 512M 0 part └─sdb3 8:19 0 372.1G 0 part sdc 8:32 0 372.6G 0 disk ├─sdc1 8:33 0 372.6G 0 part └─sdc9 8:41 0 8M 0 part sdd 8:48 0 372.6G 0 disk ├─sdd1 8:49 0 372.6G 0 part └─sdd9 8:57 0 8M 0 part sde 8:64 0 372.6G 0 disk ├─sde1 8:65 0 372.6G 0 part └─sde9 8:73 0 8M 0 part sdf 8:80 0 372.6G 0 disk ├─sdf1 8:81 0 372.6G 0 part └─sdf9 8:89 0 8M 0 part sdg 8:96 0 372.6G 0 disk ├─sdg1 8:97 0 372.6G 0 part └─sdg9 8:105 0 8M 0 part sdh 8:112 0 372.6G 0 disk ├─sdh1 8:113 0 372.6G 0 part └─sdh9 8:121 0 8M 0 part sdi 8:128 0 1.8T 0 disk ├─sdi1 8:129 0 1.8T 0 part └─sdi9 8:137 0 8M 0 part sdj 8:144 0 1.8T 0 disk sdk 8:160 0 1.8T 0 disk sdl 8:176 0 1.8T 0 disk ├─sdl1 8:177 0 1.8T 0 part └─sdl9 8:185 0 8M 0 part sdm 8:192 0 1.8T 0 disk sdn 8:208 0 1.8T 0 disk sdo 8:224 0 1.8T 0 disk sdp 8:240 0 1.8T 0 disk sdq 65:0 0 1.8T 0 disk sdr 65:16 0 1.8T 0 disk sds 65:32 0 1.8T 0 disk sdt 65:48 0 1.8T 0 disk sdu 65:64 0 1.8T 0 disk sdv 65:80 0 1.8T 0 disk sdw 65:96 0 1.8T 0 disk sdx 65:112 0 1.8T 0 disk ├─sdx1 65:113 0 1.8T 0 part └─sdx9 65:121 0 8M 0 part sdy 65:128 0 1.8T 0 disk ├─sdy1 65:129 0 1.8T 0 part └─sdy9 65:137 0 8M 0 part sdz 65:144 0 1.8T 0 disk sdaa 65:160 0 1.8T 0 disk ├─sdaa1 65:161 0 1.8T 0 part └─sdaa9 65:169 0 8M 0 part sdab 65:176 0 1.8T 0 disk sdac 65:192 0 1.8T 0 disk /ICODE]
 

Attachments

Last edited:
I'm going to use a recommendation of RAIDZ-3 configuration that maximizes disk space and can withstand 3 disk failures. a triple-parity RAID-Z (raidz3). I should of done this first. Damn openai crap was finally wrong.

I will use a configuration at 9 disks (6+3) but tomorrow I will use the CLI.
# zpool create rzpool raidz3 c0t0d0 c1t0d0 c2t0d0 c3t0d0 c4t0d0
c5t0d0 c6t0d0 c7t0d0 c8t0d0
 
Last edited:
a few things to consider:
- raidz has a lot of downsides that you should be aware of (high overhead with zvols, lower iops compared to mirrored setups)
- if you can live with those, you might want to investigate draid for faster resilvering

you can give your disks stable names/paths, the /dev/sdX ones are just assigned by the kernel in the order the disks are enumerated. one possibility is udev (it already assigns alternative names, check /dev/disk/...), which also supports custom rules. ZFS can also hook into that via vdev_id - see its man page and the example configs in /usr/share/doc/zfsutils-linux/examples.
 
I just had some good 5 hours of light sleep..again. Last year I had no problems creating fresh Z3Pools however after sleep I notice (2X my hard drives ... Not 84 but double that.)
draid - I love the fact that rebuild time is basically all the disks participating instead of one in the vdev. Since the writes are occurring to all disks in the vdev simultaneously it is faster than rebuilding to a single drive. My only concern is this stable that's my fear it looks good but does it bite? Fabian, you have my interest since I'm at ground zero again.
 
Last edited:
it seems like your storage enclosure is multipath/dual-channel, so you "see" each disk twice (because there's two connections to it). you either need to disable this on the enclosure side or configure multipath so that PVE knows to treat each "pair" as a single disk.

regarding the other question - draid is considered stable by the upstream ZFS developers. like I said, it suffers from similar downsides as raidz w.r.t. performance and potential space overhead, depending on workload.
 
Well, the serial numbers are different in CLI but not in PMGUI? Pulled my Node cables down to one node. Stumped

root@NODE1:~# echo /dev/sdi;udevadm info -p block/sdi --query all|grep ID_SERIAL
/dev/sdi
E: ID_SERIAL=4825
E: ID_SERIAL_SHORT=4825
root@NODE1:~# echo /dev/sdj;udevadm info -p block/sdj --query all|grep ID_SERIAL
/dev/sdj
E: ID_SERIAL=a809
E: ID_SERIAL_SHORT=a809
root@NODE1:~# echo /dev/sdk;udevadm info -p block/sdk --query all|grep ID_SERIAL
/dev/sdk
E: ID_SERIAL=c9f5
E: ID_SERIAL_SHORT=c9f5
root@NODE1:~#
 
Last edited:
@fabian has the right of it. on both counts.

it does look like you're multipathed. how are you connected to the chassis? do you have two cables? if so, you need to install and confiugure multipathd. once you do, use the multipath devs to create your zpool.

As for raid level- with so many drives, this is what draid exists for. see https://openzfs.github.io/openzfs-docs/Basic Concepts/dRAID Howto.html
 
  • Like
Reactions: mikeinnyc
I just backed up my Containers. Time to remove all my pools & Fdisk except the os. Let's try again..more coffee.

If this works I'll just do Striped Mirror for my 6 local SSDs for my fast LXCs, and small-disk size vdevs hdd for my storage backups and slower lxcs. I was going to use zfs3 with just 5 disks each but now you have me thinking?

I'm still torn about using Draid since I never had used this before. it's the rebuild time really that I could save and it seems I would have to go big on the number of disks used for a benefit. I'd rather have the most vdevs for performance than a large number of disks used for draid for lxc/storage? No? yes? who's on first?
I need more coffee because it's time to make a decision...
 
do you have two cables? yes
one mystery down. you have the option of either split the disks between the channels, or leave it as is and install multipath (option I'd go with)

Setting up your SSDs in a striped mirror makes tons of sense, but everything is dependent on your disk mix and risk tolerance.
I'd rather have the most vdevs for performance than a large number of disks used for draid for lxc/storage?
the number of vdevs, regardless of draid, has to do with how many stripes fill your available disk pool before repeating- eg

lets say you set up your 84 drives in a 8:2:1 arrangement (8 data:2 parity:1 spare). that means that you have room for 7 full stripes and half of the next, or ~7.5 "vdevs." it doesnt perform as well as a true compound raid (eg, multiple individual vdevs) but the discussion is pretty moot, however, if your disks are 3.5 spinners as that is vm storage is not their usecase in the first place- its bulk storage.
 
  • Like
Reactions: mikeinnyc
Ok, this problem is now solved. What I did was just pull the backup plug leaving only one cable. Now everything is back to normal.
I know what happened now last year I added a backup cable to my system and then when i just used fdisk all the storage disks in it went "Mr. Magoo with double vision." I'll set up multipath so this doesn't happen again. How to setup Multipath

if your disks are 3.5 spinners as that is vm storage is not their use case in the first place- it's bulk storage. yes, correct. My VMs that count are SSD the rest are backups of those LXCs and a few LXCs combined. My main concern is storage backups.
Thank you guys
 
Last edited:
Why aren't you just using /dev/disk/by-id/scsi-THE_ACTUAL_NAME_OF_THE_DEVICE ?? That's much much easier and is deterministic. (If it's not there, install 'sg3-utils-udev')
Hmm, I thought because my disk IDs were already unique, Multipath or disabling this physically in my DAS Box directly would be the better choice. So, if I named each disk example: 0001, 0002, 0003, etc this would be another way to solve the problem also and not duplicate again?
 
Last edited:
Hmm, I thought because my disk IDs were already unique, Multipath or disabling this physically in my DAS Box directly would be the better choice. So, if I named each disk example: 0001, 0002, 0003, etc this would be another way to solve the problem also and not duplicate again?

Multipath doesn't really do anything on the same machine. You're always going to be IO bound at the disk itself ('iostat -xy 1' and check the busy stats). I'm more suggesting that you use the name so you can tell which disk is which! To me this is a lot easier to read:

Code:
root@nas-194:~# zpool status
  pool: bigpool
 state: ONLINE
  scan: scrub repaired 0B in 14:15:42 with 0 errors on Sun Jan  8 14:39:45 2023
config:

        NAME                                                       STATE     READ WRITE CKSUM
        bigpool                                                    ONLINE       0     0     0
          raidz2-0                                                 ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD8MVSK0000C942MF53        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD2X2LE0000C8180YGZ        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD2X39C0000C8179ULY        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD1NJSM0000C7458K63        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD5ZWNY0000C90705B7        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD8VN340000C944FGQA        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD1ZAPY0000C8027ANP        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD210890000C75286LY        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD8AVEZ0000C9401UL3        ONLINE       0     0     0
            scsi-SSEAGATE_ST6000NM0095_ZAD1SBQJ0000C7458J3N        ONLINE       0     0     0
        cache
          nvme-Samsung_SSD_970_EVO_Plus_1TB_S4EWNF0M317581N-part2  ONLINE       0     0     0

errors: No known data errors
root@nas-194:~#
 
Last edited:
All is Good on Node 1 which round-robins to both connected Lenovo's ESMs. I had to take off user-friendly. I did get something strange but all pools show ok and online. I'll assume I need more coffee.

Feb 02 18:52:18 | 35000c50030256957: failback = "immediate" (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: path_grouping_policy = multibus (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: path_selector = "round-robin 0" (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: no_path_retry = "queue" (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: retain_attached_hw_handler = yes (setting: implied in kernel >= 4.3.0)
Feb 02 18:52:18 | 35000c50030256957: features = "0" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: hardware_handler = "0" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: rr_weight = "uniform" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: minio = 1 (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: fast_io_fail_tmo = 5 (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: deferred_remove = no (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: marginal_path_err_sample_time = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: marginal_path_err_rate_threshold = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: marginal_path_err_recheck_gap_time = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: marginal_path_double_failed_time = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: san_path_err_threshold = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: san_path_err_forget_rate = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: san_path_err_recovery_time = "no" (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: skip_kpartx = no (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: ghost_delay = "no" (setting: multipath.conf defaults/devices section)
Feb 02 18:52:18 | 35000c50030256957: flush_on_last_del = no (setting: multipath internal)
Feb 02 18:52:18 | 35000c50030256957: setting dev_loss_tmo is unsupported for protocol scsi:unspec
Feb 02 18:52:18 | 35000c50030256957: set ACT_CREATE (map does not exist)
Feb 02 18:52:18 | 35000c50030256957: addmap [0 781422768 multipath 1 queue_if_no_path 0 1 1 round-robin 0 1 1 8:16 1]
Feb 02 18:52:18 | libdevmapper: ioctl/libdm-iface.c(1927): device-mapper: reload ioctl on 35000c50030256957 (253:258) failed: Device or resource busy ????
Feb 02 18:52:18 | dm_addmap: libdm task=0 error: Success
Feb 02 18:52:18 | 35000c50030256957: failed to load map, error 16
Feb 02 18:52:18 | 35000c50030256957: domap (0) failure for create/reload map
Feb 02 18:52:18 | 35000c50030256957: ignoring map
Feb 02 18:52:18 | sdb: orphan path, map removed internally ????



Screenshot from 2023-02-02 18-31-13.png

Screenshot from 2023-02-02 18-51-45.png
 
Last edited:
Solved: libdevmapper: ioctl/libdm-iface.c(1927)
for this unkillable multipath error for my Proxmox OS drives sda and sdb. No matter what I did I could not solve this problem. libdevmapper: ioctl/libdm-iface.c(1927): device-mapper: reload ioctl on 35000c50030256957 (253:258) failed: Device or resource busy ????
The best solution was to take my backup /etc/*.* folder using rsync and just do a clean install withOUT any lxcs or vms.

It seems that a permanent LOCKED BUSY state happens once you create ANY VM or lxc. Either way by reinstalling a fresh OS and fdisking my disks I was able to get multipath working properly 100% on my OS disks that were not in a locked busy state this time.

If you have this weird, strange, rare locked error on the OS disks just reinstall fresh but take your /etc select files with you first. Reinstalling the OS was a hell of a lot faster than troubleshooting this non killable error by force, delay start, export zpool, or other methods I just couldn't solve this one.

I hope this helps someone solve this libdevmapper: ioctl/libdm-iface.c(1927)

Now, back to eating crayons.

 
I just stumbled over this ticket and I'm glad you solved your issue.

The resource busy is when the device is locked by another kernel system and can be further debugged by dmsetup. It is however not that easy to debug further, yet possible.
Reinstalling - like in windows - solves some issues and I see why you did it.

For the naming issue, we name your devices by scsi bus id that corresponds (in our case) to the shelf ID so that we have e.g. S1R1C1 (for shelf, row and column). With this it is easy to "see" a failed drive better.