storage is not online 500 after upgrade

jamacdon

Active Member
Nov 2, 2013
5
0
41
I was running 4.2 for quite some time. My shared storage is a Dell MD3620i with LVM over iSCSI.

Everything has been running fine until I upgraded to 4.4 using the community updates. The servers are now reporting the storage is not online with an error 500 when I click on the storage device in the dashboard.

I can see the storage devices in command line using vgs and lvs however I am unable start containers.

I have the MD3620i configured with multipath which to be reporting properly.

I am unable to start any containers or VM's and am worried this may be causing some corruption.
I am also concerned that this may be a bug in the stable repository. Or is there a technical change that was not documented or overlooked.

Any help to identify the problem would be greatly appreciated.
Thank you
Joe

Here is multipath results:
# multipath -ll
md36xxi0_dg1_vd1 (36d4ae5200097a2c800000e5a5683f025) dm-3 DELL,MD36xxi
size=1.1T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
`-+- policy='round-robin 0' prio=11 status=active
|- 6:0:0:0 sdb 8:16 active ready running
|- 7:0:0:0 sdc 8:32 active ready running
|- 8:0:0:0 sdd 8:48 active ready running
`- 9:0:0:0 sde 8:64 active ready running
md36xxi0_dg2_storage1 (36d4ae5200097a36a000010595683abc5) dm-4 DELL,MD36xxi
size=5.5T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
`-+- policy='round-robin 0' prio=11 status=active
|- 6:0:0:1 sdf 8:80 active ready running
|- 7:0:0:1 sdg 8:96 active ready running
|- 8:0:0:1 sdi 8:128 active ready running
`- 9:0:0:1 sdh 8:112 active ready running

Here is vgs results:
# vgs
VG #PV #LV #SN Attr VSize VFree
pve 1 3 0 wz--n- 465.63g 0
storage0 1 19 0 wz--n- 1.09t 335.58g
storage1 1 8 0 wz--n- 5.46t 4.86t

Here are lvs results
# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve -wi-ao---- 361.63g
root pve -wi-ao---- 96.00g
swap pve -wi-ao---- 8.00g
vm-100-disk-1 storage0 -wi------- 50.00g
vm-103-disk-1 storage0 -wi------- 8.00g
vm-105-disk-2 storage0 -wi------- 10.00g
vm-106-disk-1 storage0 -wi------- 12.00g
vm-107-disk-1 storage0 -wi------- 10.00g
vm-108-disk-1 storage0 -wi------- 10.00g
vm-109-disk-1 storage0 -wi------- 50.00g
vm-110-disk-1 storage0 -wi------- 60.00g
vm-111-disk-1 storage0 -wi------- 50.00g
vm-113-disk-1 storage0 -wi------- 20.00g
vm-114-disk-1 storage0 -wi------- 60.00g
vm-115-disk-1 storage0 -wi------- 15.00g
vm-117-disk-1 storage0 -wi------- 15.00g
vm-121-disk-1 storage0 -wi------- 64.00g
vm-122-disk-1 storage0 -wi------- 8.00g
vm-200-disk-1 storage0 -wi------- 10.00g
vm-217-disk-1 storage0 -wi------- 64.00g
vm-317-disk-1 storage0 -wi------- 64.00g
vm-601-disk-1 storage0 -wi------- 200.00g
vm-101-disk-1 storage1 -wi------- 15.00g
vm-102-disk-1 storage1 -wi------- 15.00g
vm-116-disk-1 storage1 -wi------- 30.00g
vm-119-disk-1 storage1 -wi------- 260.00g
vm-123-disk-1 storage1 -wi------- 30.00g
vm-124-disk-1 storage1 -wi------- 50.00g
vm-125-disk-1 storage1 -wi------- 15.00g
vm-501-disk-1 storage1 -wi------- 200.00g

Here is a tail of syslog showing error
# tail /var/log/syslog
Jan 23 00:58:08 sv7n2 pvestatd[23783]: status update time (6.062 seconds)
Jan 23 00:58:14 sv7n2 pvestatd[23783]: storage 'md3620i_dg1_vg1' is not online
Jan 23 00:58:16 sv7n2 pvestatd[23783]: storage 'md3620i_dg1_vg1' is not online
Jan 23 00:58:18 sv7n2 pvestatd[23783]: storage 'md3620i_dg1_vg1' is not online
Jan 23 00:58:18 sv7n2 pvestatd[23783]: status update time (6.059 seconds)
Jan 23 00:58:24 sv7n2 pvestatd[23783]: storage 'md3620i_dg1_vg1' is not online
 
Hi
What is the output of pvesm status ?
 
As requested

# pvesm status
storage 'md3620i_dg1_vg1' is not online
storage 'md3620i_dg1_vg1' is not online
storage 'md3620i_dg1_vg1' is not online
local dir 1 98952796 18080684 75822564 19.75%
local-data dir 1 373116720 20043116 334097248 6.16%
md3620i_dg1_vg1 iscsi 0 0 0 0 100.00%
nfs_datastore01 nfs 1 9607442432 4143456256 4975949824 45.94%
storage0 lvm 0 0 0 0 100.00%
storage1 lvm 0 0 0 0 100.00%
 
Could it be that the iscsi connection is not active / working ?

Verify you're logged in to your iscsi portal with

iscsiadm -m session -P 1 | grep 'iSCSI.*State'

it should displayed LOGGED_IN. If not you need to fix your iscsi connection.
 
iSCSI is working - already checked that but here are the results

# iscsiadm -m session -P 1 | grep 'iSCSI.*State'
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN

Could it be a conflict in naming convention?

In /etc/pve/storage.cfg I see the following:
iscsi: md3620i_dg1_vg1
portal 10.3.0.200
target iqn.1984-05.com.dell:powervault.md3600i.6d4ae5200097a36a0000000052441d68
content none

lvm: storage0
vgname storage0
base md3620i_dg1_vg1:0.0.0.scsi-36d4ae5200097a2c800000e5a5683f025
shared
content images,rootdir

In my multipath I see
# multipath -ll
md36xxi0_dg1_vd1 (36d4ae5200097a2c800000e5a5683f025) dm-3 DELL,MD36xxi
size=1.1T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
`-+- policy='round-robin 0' prio=11 status=active
|- 6:0:0:0 sdb 8:16 active ready running
|- 7:0:0:0 sdc 8:32 active ready running
|- 8:0:0:0 sdd 8:48 active ready running
`- 9:0:0:0 sde 8:64 active ready running

The multipath name (md36xxi0_dg1_vd1) and the storage0 base name (md3620i_dg1_vg1) are not 100% match. It has been like this for the last 6 months and only had problems now upgrading to 4.4. I would expect path be based on WWID, not readable names.

Systems that are running appear to be working fine but I have 3 containers down and I am worried about corruption.
 
It looks like I may have a failed controller on the Dell iSCSI server. The multipath is keeping things working as the data is being routed by multipath, but it looks like proxmox uses a single IP address to determine the status of the volumes. As the primary path is down, I am unable to start or stop servers.

Is it possible for me to change the "portal 10.3.0.200" in storage.cfg to "portal 10.3.0.201" without having adverse side effects on currently running machines?

Joe
 
I have confirmed that one of the controllers on the iSCSI server locked up. After a reset/reboot it came back up and everything recovered.

It is great that multipath worked for my running machines, however I was still having issues due to 3 machines being offline and unable to reboot.

It seems that Proxmox only gets its information regarding storage from the original IP address put in the iSCSI connection. It would be great if an alternate address could be entered for a true iSCSI failover scenario.

Joe
 
Interesting....This may be related to an issue that I've been chasing for a few days now.

I'm using NFS with a cluster of 4 servers. Originally I had the storage configured with one of it's 1G interfaces just to test things before going live. After making sure everything was running fine, I configured a 10G interface on the same storage device (with a new ip). Since I was using a FQDN for the server in the storage.cfg and there were now TWO (multiple) IP addresses registered to the SAME NFS SHARE I started receiving the same 'storage is not online" in the syslog that you mentioned above.

If both interfaces are connected everything is good. The second one of them is disconnected or something on ONE path fails....you'll start seeing these errors.

Another symptom in the GUI is if you look at one of your node's summary page of the affected storage you'll see the "Active" information quickly fluttering between YES and NO.

So....you're right. Proxmox doesn't seem to be able to handle multiple storage paths very well.

Thoughts from someone on the Proxmox side?

Thanks!
<D>
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!