Where did my drive go?

bzdigblig

Member
Aug 6, 2021
29
0
6
43
I have a couple clustered Proxmox nodes that are utilizing shared storage. The shared storage uses LVM on an iSCSI LUN, connecting with a couple of multipath iSCSI connections, and it's all been working fine for the last couple months...

I noticed that the LVM was showing a question mark on the disk icon, and mousing over it just shows "Status: Unknown".

When I started trying to figure out what the deal was, I checked my multipath connections and my multipath device didn't even exist anymore, and physical volumes/volume groups that depended on that device disappeared as well, which is probably what was causing that weird LVM Status: Unknown thing, I'm guessing.

pvscan, vgscan, lvscan only show my local Proxmox info...nothing to do with the iSCSI storage at all.

If I check /etc/lvm/backup or /etc/lvm/archive, I can see the info for my missing VG.

I have two questions:

What could have possibly caused this?
What's the best way forward without losing any data?
 
You appear to have indirectly confirmed that your iSCSI storage is no longer functioning properly - ie multipath and LVM are not there.
What you should do next is investigate your iSCSI storage:
1) Are you utilizing PVE iSCSI plugin to establish connection or external tooling, i.e. iscsiadm?
2) Is your iSCSI storage device up?
3) Can you ping it?
4) Are there any iscsi sessions: iscsiadm -m node ; iscsiadm -m session
5) What is the context of your /etc/pve/storage.cfg ? Does it look correct to you?
6) Are there any kernel/iscsi events in journalctl or /var/log/message , you may need to go back a bit

There may be other information required, but this should be a good start


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
You appear to have indirectly confirmed that your iSCSI storage is no longer functioning properly - ie multipath and LVM are not there.
What you should do next is investigate your iSCSI storage:
1) Are you utilizing PVE iSCSI plugin to establish connection or external tooling, i.e. iscsiadm?
I set up the iSCSI connections with the Proxmox GUI, and then followed https://pve.proxmox.com/wiki/ISCSI_Multipath to set up multipath
2) Is your iSCSI storage device up?
Yup
3) Can you ping it?
yup, it responds to pings just fine
4) Are there any iscsi sessions: iscsiadm -m node ; iscsiadm -m session
Yup. I've got 4 sessions in total: 2 for my SSD multipath connection, and 2 for my HDD multipath connection. I hadn't gotten the HDD stuff fully set up yet, so it's not part of the scope of this issue.
5) What is the context of your /etc/pve/storage.cfg ? Does it look correct to you?
Yeah, it's all as I'd expect it to be. In this particular case, my busted LVM is the LVMSynSSD, and the vgname SynSSD is referring to my missing Volume Group.

Code:
lvm: LVMSynSSD
        vgname SynSSD
        content rootdir,images
        shared 1
6) Are there any kernel/iscsi events in journalctl or /var/log/message , you may need to go back a bit
I've been looking in those logs and have yet to find anything worthwhile, but I'm still looking.....parsing that stuff is AWFUL.

I've found a few of these...I really don't know if that has anything to do with the actual iSCSI connections to the NAS, or if it's just my test VM's iSCSI hard drive and it's not happy about something...

Aug 16 22:28:37 node1 iscsid[1146]: Kernel reported iSCSI connection 3:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3) Aug 16 22:28:39 node1 iscsid[1146]: connection3:0 is operational after recovery (1 attempts) Aug 25 10:03:22 node1 iscsid[1146]: Kernel reported iSCSI connection 3:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3) Aug 25 10:03:24 node1 iscsid[1146]: connection3:0 is operational after recovery (1 attempts) Aug 31 00:29:40 node1 iscsid[1146]: Kernel reported iSCSI connection 3:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3) Aug 31 00:29:42 node1 iscsid[1146]: connection3:0 is operational after recovery (1 attempts)

EDIT: This is from an iSCSI connection that's not part of the multipath config that has the issue, so I don't know that this info is really relevant...

There may be other information required, but this should be a good start


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
I checked the uptime on the NAS to see if it had rebooted somehow, and it's got 100+ days uptime. The NAS is a Synology UC3200, so it's got two independent redundant controllers that each have their own 10G connection, and each connection runs to its own switch. Each switch also has 100+ days uptime. Each Proxmox node has two 10G NICs, one to each switch, and each of the two iSCSI connections use separate NICs. So, if a NIC failed, or a switch rebooted, or anything like that, it would just drop the number of paths to 1, rather than causing an outright failure.
 
Last edited:
what is the output of :
lsscsi
lsscsi [0:2:0:0] disk DELL PERC H730P Mini 4.30 /dev/sda [11:0:0:1] disk SYNOLOGY Storage 4.0 /dev/sdb [12:0:0:1] disk SYNOLOGY Storage 4.0 /dev/sdc [13:0:0:1] disk SYNOLOGY Storage 4.0 /dev/sdd [14:0:0:1] disk SYNOLOGY Storage 4.0 /dev/sde
lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 446.6G 0 disk ├─sda1 8:1 0 1007K 0 part ├─sda2 8:2 0 512M 0 part /boot/efi └─sda3 8:3 0 446.1G 0 part ├─pve-swap 253:0 0 8G 0 lvm [SWAP] ├─pve-root 253:1 0 96G 0 lvm / ├─pve-data_tmeta 253:2 0 3.3G 0 lvm │ └─pve-data 253:4 0 319.6G 0 lvm └─pve-data_tdata 253:3 0 319.6G 0 lvm └─pve-data 253:4 0 319.6G 0 lvm sdb 8:16 0 20.9T 0 disk └─HDDmpath0 253:5 0 20.9T 0 mpath sdc 8:32 0 6.7T 0 disk └─SSDmpath0 253:6 0 6.7T 0 mpath sdd 8:48 0 20.9T 0 disk └─HDDmpath0 253:5 0 20.9T 0 mpath sde 8:64 0 6.7T 0 disk └─SSDmpath0 253:6 0 6.7T 0 mpath
multipath -ll
multipath -ll HDDmpath0 (36001405653161c1d1b01d4703da591db) dm-5 SYNOLOGY,Storage size=21T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='round-robin 0' prio=30 status=active |- 11:0:0:1 sdb 8:16 active ready running `- 13:0:0:1 sdd 8:48 active ready running SSDmpath0 (3600140537333770d1c41d4277db8eed4) dm-6 SYNOLOGY,Storage size=6.7T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw `-+- policy='round-robin 0' prio=30 status=active |- 12:0:0:1 sdc 8:32 active ready running `- 14:0:0:1 sde 8:64 active ready running

It may be worth noting that neither of the paths ever dropped offline for SSDmpath0, as far as I could tell, but SSDmpath0 had disappeared and didn't initially reappear until I stopped/started the multipath service.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
It may be worth noting that neither of the paths ever dropped offline for SSDmpath0, as far as I could tell, but SSDmpath0 had disappeared and didn't initially reappear until I stopped/started the multipath service.
So you restarted the multipath daemon and everything is working now? Your physical and multipath devices seem to be present based on your output.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
So you restarted the multipath daemon and everything is working now? Your physical and multipath devices seem to be present based on your output.
No. I restarted multipath daemon, and SSDmpath0 reappeared. Everything that relied on it is still broken.

If I do a vgcfgrestore --list SynSSD, it'll show a handful of restore points.
If I then do vgcfgrestore SynSSD --test, it'll tell me this:

TEST MODE: Metadata will NOT be updated and volumes will not be (de)activated. WARNING: Couldn't find device with uuid Av9NVa-Xnc8-Cins-o1Rq-FUoz-p5Rh-8uGKfw. Cannot restore Volume Group SynSSD with 1 PVs marked as missing. Restore failed.

If I look at the contents of /etc/lvm/backup/SynSSD, I get:

# Generated by LVM2 version 2.03.11(2) (2021-01-08): Wed May 11 15:47:25 2022 contents = "Text Format Volume Group" version = 1 description = "Created *after* executing '/sbin/lvcreate -aly -Wy --yes --size 528k --name vm-100-disk-1 --addtag pve-vm-100 SynSSD'" creation_host = "node1" # Linux node1 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100) x86_64 creation_time = 1652305645 # Wed May 11 15:47:25 2022 SynSSD { id = "NdyhX8-yMGW-zubR-TDFd-bWId-9jua-W2evZc" seqno = 3 format = "lvm2" # informational status = ["RESIZEABLE", "READ", "WRITE"] flags = [] extent_size = 8192 # 4 Megabytes max_lv = 0 max_pv = 0 metadata_copies = 0 physical_volumes { pv0 { id = "Av9NVa-Xnc8-Cins-o1Rq-FUoz-p5Rh-8uGKfw" device = "/dev/mapper/SSDmpath0" # Hint only status = ["ALLOCATABLE"] flags = [] dev_size = 14365491200 # 6.68945 Terabytes pe_start = 2048 pe_count = 1753599 # 6.68945 Terabytes } } logical_volumes { vm-100-disk-0 { id = "dbpWv6-lUPo-TLxm-YduL-P0dO-f4Mq-CPVes7" status = ["READ", "WRITE", "VISIBLE"] flags = [] tags = ["pve-vm-100"] creation_time = 1652305645 # 2022-05-11 15:47:25 -0600 creation_host = "node1" segment_count = 1 segment1 { start_extent = 0 extent_count = 10240 # 40 Gigabytes type = "striped" stripe_count = 1 # linear stripes = [ "pv0", 0 ] } } vm-100-disk-1 { id = "AWgHdl-GRWM-RBpK-Kmh2-SR2c-9hts-40hCFd" status = ["READ", "WRITE", "VISIBLE"] flags = [] tags = ["pve-vm-100"] creation_time = 1652305645 # 2022-05-11 15:47:25 -0600 creation_host = "node1" segment_count = 1 segment1 { start_extent = 0 extent_count = 1 # 4 Megabytes type = "striped" stripe_count = 1 # linear stripes = [ "pv0", 10240 ] } } } }

I have no idea what pv0 is, and nothing seems to know about UUID Av9NVa-Xnc8-Cins-o1Rq-FUoz-p5Rh-8uGKfw, yet vgcfgrestore seems to want it. I really don't know how to proceed from here...

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Now that you have the disk back you may want to run " systemctl try-reload-or-restart pvedaemon pveproxy pvestatd" perhaps PVE will re-scan/activate your volume groups.
Beyond that you should rerun "pvs,vgs,lvs" with the mpath device back and reassess what exactly is not working before you try to restore/change configuration.
Having the output of above commands posted may also spark some ideas in the community.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
I appreciate your help. Still no joy though..

Code:
pvs
  PV         VG  Fmt  Attr PSize   PFree
  /dev/sda3  pve lvm2 a--  446.12g 16.00g

vgs
  VG  #PV #LV #SN Attr   VSize   VFree
  pve   1   3   0 wz--n- 446.12g 16.00g

lvs
  LV   VG  Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data pve twi-a-tz-- <319.60g             0.00   0.52                           
  root pve -wi-ao----   96.00g                                                   
  swap pve -wi-ao----    8.00g

I'm not sure if I should just create a new physical volume on the multipath device and then try and figure out how to change the UUID, or if I can change the UUID in the backup file to just look for whatever the UUID of the new physical volume is. I'm sure there's a much smarter approach, I just have no friggin clue what it is.

And I still have no idea what caused this issue in the first place..
 
ok, so presumably you now you have the system in a state as shown in comment #5
the pvs and lsblk show no presence of LVM structure on your mpath disks? What does "fdisk -l" think about those devices? What about "pvscan, vgscan, lvscan " has anything changed since you restarted the multipath?

The only way to learn what happened is to carefully analyze the system log. Everything else would be wild guessing.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
ok, so presumably you now you have the system in a state as shown in comment #5
the pvs and lsblk show no presence of LVM structure on your mpath disks?
That is correct.
What does "fdisk -l" think about those devices?
Code:
fdisk -l
Disk /dev/sda: 446.63 GiB, 479559942144 bytes, 936640512 sectors
Disk model: PERC H730P Mini
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D5596B76-6862-474F-B1A6-2C8395D7392B

Device       Start       End   Sectors   Size Type
/dev/sda1       34      2047      2014  1007K BIOS boot
/dev/sda2     2048   1050623   1048576   512M EFI System
/dev/sda3  1050624 936640478 935589855 446.1G Linux LVM

Partition 1 does not start on physical sector boundary.


Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdc: 6.69 TiB, 7355131494400 bytes, 14365491200 sectors
Disk model: Storage         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/sdb: 20.95 TiB, 23030688382976 bytes, 44981813248 sectors
Disk model: Storage         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/mapper/HDDmpath0: 20.95 TiB, 23030688382976 bytes, 44981813248 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/mapper/SSDmpath0: 6.69 TiB, 7355131494400 bytes, 14365491200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/sdd: 20.95 TiB, 23030688382976 bytes, 44981813248 sectors
Disk model: Storage         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/sde: 6.69 TiB, 7355131494400 bytes, 14365491200 sectors
Disk model: Storage         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes
What about "pvscan, vgscan, lvscan " has anything changed since you restarted the multipath?
Nope.
Code:
pvscan
  PV /dev/sda3   VG pve             lvm2 [446.12 GiB / 16.00 GiB free]
  Total: 1 [446.12 GiB] / in use: 1 [446.12 GiB] / in no VG: 0 [0   ]
 
vgscan
  Found volume group "pve" using metadata type lvm2
 
lvscan
  ACTIVE            '/dev/pve/swap' [8.00 GiB] inherit
  ACTIVE            '/dev/pve/root' [96.00 GiB] inherit
  ACTIVE            '/dev/pve/data' [<319.60 GiB] inherit
The only way to learn what happened is to carefully analyze the system log. Everything else would be wild guessing.
Yeaah, I'm working my way through the last few months of logs...just taking waaaay longer than I'd like because "alua: supports implicit TPGS" and other spammy messages makes for a ton of friggin noise to sift through, and I suck at trying to filter out all the crap I don't wanna see :(

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
Based on all the information, you are indeed in recovery phase now. You may want to take a snapshot of your disks on the storage device. Beyond that look into recovery articles, such as https://serverfault.com/questions/1016744/recover-deleted-lvm-signature
good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Thank you. I appreciate your time and help. I'm sure I'll add more to this if/when I either solve it, or keep smashing my head against the wall.
 
As a final update, I never ended up figuring either the cause or solution to this.

I just nuked all my network storage and started over. Not really an ideal solution in a production environment, but what can ya do..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!