Issues adding storage

dlroth

New Member
May 2, 2024
6
0
1
Hi All,
I'm having an issue adding an SSD drive to my hosts. Current setup is 3 Dell hosts, running a few SSDs in each host. I'm looking to create some Ceph storage, and was able to create some OSDs succesfully.

The drives that are giving me issues are all Intel 400 GB (SSDSC2BA400G4). I have a total of 4 of these drives across the hosts, and they all give me issues when trying to add them. The other SSDs work fine.
See error messages below. I tried just creating a LVM-THIN vs adding a CEPH OSD but it still fails. I can however create a regular LVM device.

Running PVE 8.2.2

Code:
()
create OSD on /dev/sdb (bluestore)
wiping block device /dev/sdb
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.788577 s, 266 MB/s
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 0dbcd843-4671-4371-9265-1767f39291f8
Running command: vgcreate --force --yes ceph-575d7486-966e-4912-b0b1-b29870b3c609 /dev/sdb
 stdout: Physical volume "/dev/sdb" successfully created.
 stdout: Volume group "ceph-575d7486-966e-4912-b0b1-b29870b3c609" successfully created
Running command: lvcreate --yes -l 95388 -n osd-block-0dbcd843-4671-4371-9265-1767f39291f8 ceph-575d7486-966e-4912-b0b1-b29870b3c609
 stderr: Failed to initialize logical volume ceph-575d7486-966e-4912-b0b1-b29870b3c609/osd-block-0dbcd843-4671-4371-9265-1767f39291f8 at position 0 and size 4096.
  Aborting. Failed to wipe start of new LV.
 stderr: Error writing device /dev/sdb at 7168 length 1024.
 stderr: WARNING: bcache_invalidate: block (0, 0) still dirty.
  Failed to write metadata to /dev/sdb.
  Failed to write VG ceph-575d7486-966e-4912-b0b1-b29870b3c609.
  Manual intervention may be required to remove abandoned LV(s) before retrying.
--> Was unable to complete a new OSD, will rollback changes
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --yes-i-really-mean-it
 stderr: 2024-05-01T19:40:52.238-0400 7ab76fe006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2024-05-01T19:40:52.238-0400 7ab76fe006c0 -1 AuthRegistry(0x7ab768063ec8) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: purged osd.0
-->  RuntimeError: Unable to find any LV for zapping OSD: 0
TASK ERROR: command 'ceph-volume lvm create --crush-device-class nvme --data /dev/sdb' failed: exit code 1

Errors if I just try to create an LVM-THIN storage device.

Code:
()
  Physical volume "/dev/sdb" successfully created.
  Volume group "IntelTemp" successfully created
  Rounding up size to full physical extent <365.04 GiB
  Rounding up size to full physical extent <3.73 GiB
  Thin pool volume with chunk size 64.00 KiB can address at most <15.88 TiB of data.
  Failed to initialize logical volume IntelTemp/lvol0 at position 0 and size 1048576.
  Aborting. Failed to wipe start of new LV.
  Error writing device /dev/sdb at 7168 length 1024.
  WARNING: bcache_invalidate: block (0, 0) still dirty.
  Failed to write metadata to /dev/sdb.
  Failed to write VG IntelTemp.
  Manual intervention may be required to remove abandoned LV(s) before retrying.
TASK ERROR: command '/sbin/lvcreate --type thin-pool -L382766612.48K --poolmetadatasize 3905781.76K -n IntelTemp IntelTemp' failed: exit code 5
 
You should examine the state of those disks, ie are there any existing partitions/volumes/labels on them? Is the disk healthy? Is firmware up to date?
Have they been locked cryptographically? Or, perhaps, have proprietary firmware or block size on them?

As you noticed, PVE uses LVM for some underlying disk management. It's not modified and is the same LVM package that is widely deployed elsewhere.
Try to run the same LVM commands directly, add debug options, and examine the logs.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
The disks are wiped fresh, and were previously used in a different Proxmox server. I'll try running the commands directly so see what else I can find out.
 
So I tried creating a plain file system on a disk, and it shows this error.
I wonder if the disks are going bad, which is strange because they were pulled from a working Proxmox Ceph cluster.

Code:
# /sbin/sgdisk -n1 -t1:8300 /dev/sdc
The operation has completed successfully.
# /sbin/mkfs -t ext4 /dev/sdc1
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks:        0/97677585 2621440/97677585 5767168/97677585 8912896/9767758512058624/9767758515204352/9767758518350080/9767758521495808/9767758524641536/9767758527787264/9767758530932992/9767758534078720/9767758536700160/9767758539321600/9767758541943040/9767758544564480/9767758547185920/9767758550331648/9767758553477376/9767758556623104/9767758559768832/9767758562914560/9767758566060288/9767758568681728/9767758571827456/9767758574448896/9767758577070336/9767758579691776/9767758582313216/9767758584934656/9767758588080384/9767758591226112/9767758594371840/9767758597517568/97677585                 done                           
Creating filesystem with 97677585 4k blocks and 24420352 inodes
Filesystem UUID: f55caf10-8e9f-4b3b-b6a5-74cf65f3e4ac
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968

Allocating group tables:    0/2981         done                           
Writing inode tables:    0/2981   1/2981         done                           
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information:    0/2981         mkfs.ext4: Input/output error while writing out and closing file system
TASK ERROR: command '/sbin/mkfs -t ext4 /dev/sdc1' failed: exit code 1
 
SMART stats look fine. The 4 drives are between 7-10% wearout. I'm going to look into firmware updates for the drives.
 
Firmware looks to be up-to-date. I tried to boot into an older kernal: 6.5.11-8-pve, but still running into the same issues. The strange thing is the Dell backplane shows this, and I need to reboot the server for the disk to show again.
BL71VdNKws.png
 
I think, your troubleshooting moved somewhat out of the scope of "PVE installation and configuration".

I can say that we've seen some surprising things caused by disk, controller, bios, and firmware issues. For example, mixing certain two disk vendors on the same PCI switch would cause disk1 removal when disk2 was physically removed, this was not happening when the same vendor's disks were present.
Do make sure that all possible firmware is updated, iDRAC included.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: dlroth
Thanks for all your helpful suggestions. Really appreciate it! I'll try a few more things, and it may just be time for some new SSDs as the Intels are getting old anyway.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!