Where did my drive go?

bzdigblig · Sep 7, 2022

I have a couple clustered Proxmox nodes that are utilizing shared storage. The shared storage uses LVM on an iSCSI LUN, connecting with a couple of multipath iSCSI connections, and it's all been working fine for the last couple months...

I noticed that the LVM was showing a question mark on the disk icon, and mousing over it just shows "Status: Unknown".

When I started trying to figure out what the deal was, I checked my multipath connections and my multipath device didn't even exist anymore, and physical volumes/volume groups that depended on that device disappeared as well, which is probably what was causing that weird LVM Status: Unknown thing, I'm guessing.

pvscan, vgscan, lvscan only show my local Proxmox info...nothing to do with the iSCSI storage at all.

If I check /etc/lvm/backup or /etc/lvm/archive, I can see the info for my missing VG.

I have two questions:

What could have possibly caused this?
What's the best way forward without losing any data?

bbgeek17 · Sep 7, 2022

You appear to have indirectly confirmed that your iSCSI storage is no longer functioning properly - ie multipath and LVM are not there.
What you should do next is investigate your iSCSI storage:
1) Are you utilizing PVE iSCSI plugin to establish connection or external tooling, i.e. iscsiadm?
2) Is your iSCSI storage device up?
3) Can you ping it?
4) Are there any iscsi sessions: iscsiadm -m node ; iscsiadm -m session
5) What is the context of your /etc/pve/storage.cfg ? Does it look correct to you?
6) Are there any kernel/iscsi events in journalctl or /var/log/message , you may need to go back a bit

There may be other information required, but this should be a good start

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bzdigblig · Sep 7, 2022

bbgeek17 said:
You appear to have indirectly confirmed that your iSCSI storage is no longer functioning properly - ie multipath and LVM are not there.
What you should do next is investigate your iSCSI storage:
1) Are you utilizing PVE iSCSI plugin to establish connection or external tooling, i.e. iscsiadm?

I set up the iSCSI connections with the Proxmox GUI, and then followed https://pve.proxmox.com/wiki/ISCSI_Multipath to set up multipath

bbgeek17 said:
2) Is your iSCSI storage device up?

Yup

bbgeek17 said:
3) Can you ping it?

yup, it responds to pings just fine

bbgeek17 said:
4) Are there any iscsi sessions: iscsiadm -m node ; iscsiadm -m session

Yup. I've got 4 sessions in total: 2 for my SSD multipath connection, and 2 for my HDD multipath connection. I hadn't gotten the HDD stuff fully set up yet, so it's not part of the scope of this issue.

bbgeek17 said:
5) What is the context of your /etc/pve/storage.cfg ? Does it look correct to you?

Yeah, it's all as I'd expect it to be. In this particular case, my busted LVM is the LVMSynSSD, and the vgname SynSSD is referring to my missing Volume Group.

Code:

lvm: LVMSynSSD
        vgname SynSSD
        content rootdir,images
        shared 1

bbgeek17 said:
6) Are there any kernel/iscsi events in journalctl or /var/log/message , you may need to go back a bit

I've been looking in those logs and have yet to find anything worthwhile, but I'm still looking.....parsing that stuff is AWFUL.

I've found a few of these...I really don't know if that has anything to do with the actual iSCSI connections to the NAS, or if it's just my test VM's iSCSI hard drive and it's not happy about something...

Aug 16 22:28:37 node1 iscsid[1146]: Kernel reported iSCSI connection 3:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Aug 16 22:28:39 node1 iscsid[1146]: connection3:0 is operational after recovery (1 attempts)
Aug 25 10:03:22 node1 iscsid[1146]: Kernel reported iSCSI connection 3:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Aug 25 10:03:24 node1 iscsid[1146]: connection3:0 is operational after recovery (1 attempts)
Aug 31 00:29:40 node1 iscsid[1146]: Kernel reported iSCSI connection 3:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Aug 31 00:29:42 node1 iscsid[1146]: connection3:0 is operational after recovery (1 attempts)

EDIT: This is from an iSCSI connection that's not part of the multipath config that has the issue, so I don't know that this info is really relevant...

bbgeek17 said:
There may be other information required, but this should be a good start

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

I checked the uptime on the NAS to see if it had rebooted somehow, and it's got 100+ days uptime. The NAS is a Synology UC3200, so it's got two independent redundant controllers that each have their own 10G connection, and each connection runs to its own switch. Each switch also has 100+ days uptime. Each Proxmox node has two 10G NICs, one to each switch, and each of the two iSCSI connections use separate NICs. So, if a NIC failed, or a switch rebooted, or anything like that, it would just drop the number of paths to 1, rather than causing an outright failure.

bbgeek17 · Sep 7, 2022

what is the output of :
lsscsi
lsblk
multipath -ll

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bzdigblig · Sep 7, 2022

bbgeek17 said:
what is the output of :
lsscsi

lsscsi
[0:2:0:0]    disk    DELL     PERC H730P Mini  4.30  /dev/sda 
[11:0:0:1]   disk    SYNOLOGY Storage          4.0   /dev/sdb 
[12:0:0:1]   disk    SYNOLOGY Storage          4.0   /dev/sdc 
[13:0:0:1]   disk    SYNOLOGY Storage          4.0   /dev/sdd 
[14:0:0:1]   disk    SYNOLOGY Storage          4.0   /dev/sde

bbgeek17 said:
lsblk

lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                  8:0    0 446.6G  0 disk  
├─sda1               8:1    0  1007K  0 part  
├─sda2               8:2    0   512M  0 part  /boot/efi
└─sda3               8:3    0 446.1G  0 part  
  ├─pve-swap       253:0    0     8G  0 lvm   [SWAP]
  ├─pve-root       253:1    0    96G  0 lvm   /
  ├─pve-data_tmeta 253:2    0   3.3G  0 lvm   
  │ └─pve-data     253:4    0 319.6G  0 lvm   
  └─pve-data_tdata 253:3    0 319.6G  0 lvm   
    └─pve-data     253:4    0 319.6G  0 lvm   
sdb                  8:16   0  20.9T  0 disk  
└─HDDmpath0        253:5    0  20.9T  0 mpath 
sdc                  8:32   0   6.7T  0 disk  
└─SSDmpath0        253:6    0   6.7T  0 mpath 
sdd                  8:48   0  20.9T  0 disk  
└─HDDmpath0        253:5    0  20.9T  0 mpath 
sde                  8:64   0   6.7T  0 disk  
└─SSDmpath0        253:6    0   6.7T  0 mpath

bbgeek17 said:
multipath -ll

multipath -ll
HDDmpath0 (36001405653161c1d1b01d4703da591db) dm-5 SYNOLOGY,Storage
size=21T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=30 status=active
  |- 11:0:0:1 sdb 8:16 active ready running
  `- 13:0:0:1 sdd 8:48 active ready running
SSDmpath0 (3600140537333770d1c41d4277db8eed4) dm-6 SYNOLOGY,Storage
size=6.7T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=30 status=active
  |- 12:0:0:1 sdc 8:32 active ready running
  `- 14:0:0:1 sde 8:64 active ready running

It may be worth noting that neither of the paths ever dropped offline for SSDmpath0, as far as I could tell, but SSDmpath0 had disappeared and didn't initially reappear until I stopped/started the multipath service.

bbgeek17 said:
Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bbgeek17 · Sep 7, 2022

bzdigblig said:
It may be worth noting that neither of the paths ever dropped offline for SSDmpath0, as far as I could tell, but SSDmpath0 had disappeared and didn't initially reappear until I stopped/started the multipath service.

So you restarted the multipath daemon and everything is working now? Your physical and multipath devices seem to be present based on your output.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bzdigblig · Sep 7, 2022

bbgeek17 said:
So you restarted the multipath daemon and everything is working now? Your physical and multipath devices seem to be present based on your output.

No. I restarted multipath daemon, and SSDmpath0 reappeared. Everything that relied on it is still broken.

If I do a vgcfgrestore --list SynSSD, it'll show a handful of restore points.
If I then do vgcfgrestore SynSSD --test, it'll tell me this:

TEST MODE: Metadata will NOT be updated and volumes will not be (de)activated.
  WARNING: Couldn't find device with uuid Av9NVa-Xnc8-Cins-o1Rq-FUoz-p5Rh-8uGKfw.
  Cannot restore Volume Group SynSSD with 1 PVs marked as missing.
  Restore failed.

If I look at the contents of /etc/lvm/backup/SynSSD, I get:

# Generated by LVM2 version 2.03.11(2) (2021-01-08): Wed May 11 15:47:25 2022

contents = "Text Format Volume Group"
version = 1

description = "Created *after* executing '/sbin/lvcreate -aly -Wy --yes --size 528k --name vm-100-disk-1 --addtag pve-vm-100 SynSSD'"

creation_host = "node1" # Linux node1 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100) x86_64
creation_time = 1652305645      # Wed May 11 15:47:25 2022

SynSSD {
        id = "NdyhX8-yMGW-zubR-TDFd-bWId-9jua-W2evZc"
        seqno = 3
        format = "lvm2"                 # informational
        status = ["RESIZEABLE", "READ", "WRITE"]
        flags = []
        extent_size = 8192              # 4 Megabytes
        max_lv = 0
        max_pv = 0
        metadata_copies = 0

        physical_volumes {

                pv0 {
                        id = "Av9NVa-Xnc8-Cins-o1Rq-FUoz-p5Rh-8uGKfw"
                        device = "/dev/mapper/SSDmpath0"        # Hint only

                        status = ["ALLOCATABLE"]
                        flags = []
                        dev_size = 14365491200  # 6.68945 Terabytes
                        pe_start = 2048
                        pe_count = 1753599      # 6.68945 Terabytes
                }
        }
logical_volumes {

                vm-100-disk-0 {
                        id = "dbpWv6-lUPo-TLxm-YduL-P0dO-f4Mq-CPVes7"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        tags = ["pve-vm-100"]
                        creation_time = 1652305645      # 2022-05-11 15:47:25 -0600
                        creation_host = "node1"
                        segment_count = 1

                        segment1 {
                                start_extent = 0
                                extent_count = 10240    # 40 Gigabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 0
                                ]
                        }
                }

                vm-100-disk-1 {
                        id = "AWgHdl-GRWM-RBpK-Kmh2-SR2c-9hts-40hCFd"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        tags = ["pve-vm-100"]
                        creation_time = 1652305645      # 2022-05-11 15:47:25 -0600
                        creation_host = "node1"
                        segment_count = 1

                        segment1 {
start_extent = 0
                                extent_count = 1        # 4 Megabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 10240
                                ]
                        }
                }
        }

}

I have no idea what pv0 is, and nothing seems to know about UUID Av9NVa-Xnc8-Cins-o1Rq-FUoz-p5Rh-8uGKfw, yet vgcfgrestore seems to want it. I really don't know how to proceed from here...

bbgeek17 said:
Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bbgeek17 · Sep 7, 2022

Now that you have the disk back you may want to run " systemctl try-reload-or-restart pvedaemon pveproxy pvestatd" perhaps PVE will re-scan/activate your volume groups.
Beyond that you should rerun "pvs,vgs,lvs" with the mpath device back and reassess what exactly is not working before you try to restore/change configuration.
Having the output of above commands posted may also spark some ideas in the community.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bzdigblig · Sep 7, 2022

I appreciate your help. Still no joy though..

Code:

pvs
  PV         VG  Fmt  Attr PSize   PFree
  /dev/sda3  pve lvm2 a--  446.12g 16.00g

vgs
  VG  #PV #LV #SN Attr   VSize   VFree
  pve   1   3   0 wz--n- 446.12g 16.00g

lvs
  LV   VG  Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data pve twi-a-tz-- <319.60g             0.00   0.52                           
  root pve -wi-ao----   96.00g                                                   
  swap pve -wi-ao----    8.00g

I'm not sure if I should just create a new physical volume on the multipath device and then try and figure out how to change the UUID, or if I can change the UUID in the backup file to just look for whatever the UUID of the new physical volume is. I'm sure there's a much smarter approach, I just have no friggin clue what it is.

And I still have no idea what caused this issue in the first place..

bbgeek17 · Sep 7, 2022

ok, so presumably you now you have the system in a state as shown in comment #5
the pvs and lsblk show no presence of LVM structure on your mpath disks? What does "fdisk -l" think about those devices? What about "pvscan, vgscan, lvscan " has anything changed since you restarted the multipath?

The only way to learn what happened is to carefully analyze the system log. Everything else would be wild guessing.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bzdigblig · Sep 7, 2022

bbgeek17 said:
ok, so presumably you now you have the system in a state as shown in comment #5
the pvs and lsblk show no presence of LVM structure on your mpath disks?

That is correct.

bbgeek17 said:
What does "fdisk -l" think about those devices?

Code:

fdisk -l
Disk /dev/sda: 446.63 GiB, 479559942144 bytes, 936640512 sectors
Disk model: PERC H730P Mini
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D5596B76-6862-474F-B1A6-2C8395D7392B

Device       Start       End   Sectors   Size Type
/dev/sda1       34      2047      2014  1007K BIOS boot
/dev/sda2     2048   1050623   1048576   512M EFI System
/dev/sda3  1050624 936640478 935589855 446.1G Linux LVM

Partition 1 does not start on physical sector boundary.


Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdc: 6.69 TiB, 7355131494400 bytes, 14365491200 sectors
Disk model: Storage         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/sdb: 20.95 TiB, 23030688382976 bytes, 44981813248 sectors
Disk model: Storage         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/mapper/HDDmpath0: 20.95 TiB, 23030688382976 bytes, 44981813248 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/mapper/SSDmpath0: 6.69 TiB, 7355131494400 bytes, 14365491200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/sdd: 20.95 TiB, 23030688382976 bytes, 44981813248 sectors
Disk model: Storage         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes


Disk /dev/sde: 6.69 TiB, 7355131494400 bytes, 14365491200 sectors
Disk model: Storage         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 1048576 bytes

bbgeek17 said:
What about "pvscan, vgscan, lvscan " has anything changed since you restarted the multipath?

Nope.

Code:

pvscan
  PV /dev/sda3   VG pve             lvm2 [446.12 GiB / 16.00 GiB free]
  Total: 1 [446.12 GiB] / in use: 1 [446.12 GiB] / in no VG: 0 [0   ]
 
vgscan
  Found volume group "pve" using metadata type lvm2
 
lvscan
  ACTIVE            '/dev/pve/swap' [8.00 GiB] inherit
  ACTIVE            '/dev/pve/root' [96.00 GiB] inherit
  ACTIVE            '/dev/pve/data' [<319.60 GiB] inherit

bbgeek17 said:
The only way to learn what happened is to carefully analyze the system log. Everything else would be wild guessing.

Yeaah, I'm working my way through the last few months of logs...just taking waaaay longer than I'd like because "alua: supports implicit TPGS" and other spammy messages makes for a ton of friggin noise to sift through, and I suck at trying to filter out all the crap I don't wanna see

bbgeek17 said:
Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bbgeek17 · Sep 7, 2022

Based on all the information, you are indeed in recovery phase now. You may want to take a snapshot of your disks on the storage device. Beyond that look into recovery articles, such as https://serverfault.com/questions/1016744/recover-deleted-lvm-signature
good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bzdigblig · Sep 8, 2022

bbgeek17 said:
Based on all the information, you are indeed in recovery phase now. You may want to take a snapshot of your disks on the storage device. Beyond that look into recovery articles, such as https://serverfault.com/questions/1016744/recover-deleted-lvm-signature
good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Thank you. I appreciate your time and help. I'm sure I'll add more to this if/when I either solve it, or keep smashing my head against the wall.

bzdigblig · Nov 16, 2022

As a final update, I never ended up figuring either the cause or solution to this.

I just nuked all my network storage and started over. Not really an ideal solution in a production environment, but what can ya do..

Search

Search

Where did my drive go?

bzdigblig

Member

bbgeek17

Distinguished Member

bzdigblig

Member

bbgeek17

Distinguished Member

bzdigblig

Member

bbgeek17

Distinguished Member

bzdigblig

Member

bbgeek17

Distinguished Member

bzdigblig

Member

bbgeek17

Distinguished Member

bzdigblig

Member

bbgeek17

Distinguished Member

bzdigblig

Member

bzdigblig

Member