Two Node Cluster with iSCSI

wdatkinson · Jun 16, 2021

What I have:
2 x HP DL380 G7's
1 x EMC SAN presenting 4.7TB via iSCSI
Cisco-based network

Each server has four NICs connected as such:
enp3s0f0/vmbr0 - Managment
enp3s0f1/vmbr1 - 802.1q Trunk for VM connectivity
enp4s0f0/vmbr2 - iSCSI (this is a L2 VLAN, no routing in or out)
enp4s0f1/vmbr3 - Migration (this is a L2 VLAN, no routing in or out)

The EMC SAN has two nics configured:
NIC1 - Managment and SMB/CIFS
NIC4 - iSCSI (this is a L2 VLAN, no routing in or out)

From a connectivity standpoint, both hosts can ping each other from all three networks in question (Managment/iSCSI/Migration).

I built up host #1 and managed to figure out how to get iSCSI working and have ~50 VM's currently running with their disks stored on the SAN. Once this ran for a few days, I build the cluster on this node.

I then build host #2 and when I had the networking configured, I joined it to the cluster. And this is where the problems started.

When the join was complete, I could see the storage on node #2's web interface, but it was not mounted. So I set about using lsblk and blkid to get my UUID's and then added lines to /etc/fstab to mount the iSCSI to a directory (/mnt/emc-nas). After doing that, I could see the VM's disks on host #2, so I attempted a migration. I was warned it was making a local copy of the disk images and started a transfer..... This made no sense as the disks were clearly there.

I've been messing around with it since, based upon things I've been able to find google or in these forums. That leads me to some questions/observations:

1). Can what I'm trying to do be done from the GUI or will it be a mix of shell work and the GUI. I'm very comfortable in a shell, albeit not necessarily with pve. When there is a GUI, I tend to hestitate jacking around at the shell level, because I've seen issues in the past doing that on other platforms.

2). Do the iSCSI devices (/dev/sdx) have to be the same between hosts? For instance iSCSI is /dev/sdb on each node. Since I mount the iSCSI drive, the answer would seem to be no, but I found couple of posts in my searching that indicated that might not be the case.

3). Do I need an LVM for the iSCSI. Not LVM over iSCSI, but a straight iSCSI storage entry being placed in an LVM. Right now I'm running without an LVM.

I think that's enough info to get things started.

dcsapak · Jun 17, 2021

if i read your post correctly, you have a 'normal' filesystem (e.g. ext4/xfs) on you iscsi lun, and mounted it on multiple nodes?

this will never work correctly, and the filesystem will sooner or later corrupt

the correct way to use a lun on multiple nodes is either to use LVM (our cluster stack can share this safely)
or use a cluster-aware filesystem like OCFS2 or GFS2 (not to confuse with glusterfs; totally differen thing)

wdatkinson · Jun 17, 2021

Yes, I am using ext4. Mainly because I couldn't create an LVM using the iSCSI LUN.

I created a new iSCSI instance on the EMC and have been messing with it. I have it in storage, but cannot create an LVM. When I try, I get:

create storage failed: device '/dev/disk/by-id/scsi-35005907f8049da9c' is already used by volume group '[unknown]' (500)

If this is a new iSCSI instance that I just added, how can it be in use? Attached is a screenshot of the LVM creation dialog. Issuing a vgdisplay only shows the pve VG.

dcsapak · Jun 17, 2021

can you post the output of the following commands? (to check what might use it/what triggers that error)

Code:

pvs
pvs /dev/disk/by-id/scsi-35005907f8049da9c
vgs
lvs
lsblk

wdatkinson · Jun 17, 2021

Posting from both nodes as there are differences.

root@proxmox-1:~# pvs
WARNING: Not using device /dev/sde for PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8.
WARNING: PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8 prefers device /dev/sdd because device was seen first.
PV VG Fmt Attr PSize PFree
/dev/sda3 pve lvm2 a-- <14.09g 1.75g
/dev/sdd lvm2 --- <4.69t <4.69t
root@proxmox-1:~# pvs /dev/disk/by-id/scsi-35005907f8049da9c
WARNING: Not using device /dev/sde for PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8.
WARNING: PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8 prefers device /dev/sdd because device was seen first.
PV VG Fmt Attr PSize PFree
/dev/sde [unknown] lvm2 d-- 0 0
root@proxmox-1:~# vgs
WARNING: Not using device /dev/sde for PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8.
WARNING: PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8 prefers device /dev/sdd because device was seen first.
VG #PV #LV #SN Attr VSize VFree
pve 1 3 0 wz--n- <14.09g 1.75g
root@proxmox-1:~# lvs
WARNING: Not using device /dev/sde for PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8.
WARNING: PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8 prefers device /dev/sdd because device was seen first.
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi-a-tz-- <5.09g 0.00 1.57
root pve -wi-ao---- 3.50g
swap pve -wi-ao---- 1.75g
root@proxmox-1:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 14.6G 0 disk
├─sda1 8:1 1 1007K 0 part
├─sda2 8:2 1 512M 0 part
└─sda3 8:3 1 14.1G 0 part
├─pve-swap 253:0 0 1.8G 0 lvm [SWAP]
├─pve-root 253:1 0 3.5G 0 lvm /
├─pve-data_tmeta 253:2 0 1G 0 lvm
│ └─pve-data 253:4 0 5.1G 0 lvm
└─pve-data_tdata 253:3 0 5.1G 0 lvm
└─pve-data 253:4 0 5.1G 0 lvm
sdb 8:16 0 4.7T 0 disk
sdc 8:32 0 4.7T 0 disk
sdd 8:48 0 4.7T 0 disk
sde 8:64 0 4.7T 0 disk
sr0 11:0 1 1024M 0 rom
root@proxmox-1:~#

root@proxmox-2:~# pvs
WARNING: Not using device /dev/sde for PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8.
WARNING: PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8 prefers device /dev/sdd because device was seen first.
PV VG Fmt Attr PSize PFree
/dev/sda3 pve lvm2 a-- <558.38g 15.99g
/dev/sdd lvm2 --- <4.69t <4.69t
root@proxmox-2:~# pvs /dev/disk/by-id/scsi-35005907f8049da9c
WARNING: Not using device /dev/sde for PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8.
WARNING: PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8 prefers device /dev/sdd because device was seen first.
PV VG Fmt Attr PSize PFree
/dev/sdd lvm2 --- <4.69t <4.69t
root@proxmox-2:~# vgs
WARNING: Not using device /dev/sde for PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8.
WARNING: PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8 prefers device /dev/sdd because device was seen first.
VG #PV #LV #SN Attr VSize VFree
pve 1 5 0 wz--n- <558.38g 15.99g
root@proxmox-2:~# lvs
WARNING: Not using device /dev/sde for PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8.
WARNING: PV l12ZPk-uvn7-5B0E-4eIh-XdGA-IPe2-eGsGs8 prefers device /dev/sdd because device was seen first.
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi-aotz-- <429.61g 1.70 0.48
root pve -wi-ao---- 96.00g
swap pve -wi-ao---- 8.00g
vm-100-disk-0 pve Vwi-a-tz-- 16.00g data 42.08
vm-201-disk-0 pve Vwi-a-tz-- 10.00g data 5.71
root@proxmox-2:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 558.9G 0 disk
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 558.4G 0 part
├─pve-swap 253:0 0 8G 0 lvm [SWAP]
├─pve-root 253:1 0 96G 0 lvm /
├─pve-data_tmeta 253:2 0 4.4G 0 lvm
│ └─pve-data-tpool 253:4 0 429.6G 0 lvm
│ ├─pve-data 253:5 0 429.6G 0 lvm
│ ├─pve-vm--100--disk--0 253:6 0 16G 0 lvm
│ └─pve-vm--201--disk--0 253:8 0 10G 0 lvm
└─pve-data_tdata 253:3 0 429.6G 0 lvm
└─pve-data-tpool 253:4 0 429.6G 0 lvm
├─pve-data 253:5 0 429.6G 0 lvm
├─pve-vm--100--disk--0 253:6 0 16G 0 lvm
└─pve-vm--201--disk--0 253:8 0 10G 0 lvm
sdb 8:16 0 4.7T 0 disk
└─sdb1 8:17 0 4.7T 0 part /mnt/emc-nas
sdc 8:32 0 4.7T 0 disk
sdd 8:48 0 4.7T 0 disk
sde 8:64 0 4.7T 0 disk
sr0 11:0 1 1024M 0 rom
root@proxmox-2:~#

bbgeek17 · Jun 18, 2021

based on your output it seems that you have a single 4.7TB disk presented via 4 paths to each host, also mounted on one of the hosts.
as @dcsapak said - you are seconds away from data corruption.

You need to get rid of that mount, filesystem, partition. And then you need to configure multipath properly, so that you will have a single MD/mpath device that you can address and create your LVM on.

wdatkinson · Jun 18, 2021

I do not have multi-path. The only connection to the SAN portal IP as defined in the iSCSI dialog is via a Layer2 VLAN. To that end, I do see two connections on the SAN for each node, which would seem to imply multi-path. But based upon the network design, the unexpected connection is from the Proxmox management interface, which obviously is layer 3. That shouldn't be possible.

My storage network is on 10.5.180.0/24 and it is not in the core switch's routing table:

core-switch#sh ip route 10.5.180.0
% Subnet not in table

I do have two 4.7TB iSCSI instances configured. One that all VM's are running on today, albeit as ext4/no lvm, and the other that I was building to include an LVM and would eventually become the primary store for both nodes.

Initially when I found the two connections per node, I thought multipath too, however, after installing the multi-path tools, they also reported no multi-path present. So I uninstalled the tools.

Search

Search

Two Node Cluster with iSCSI

wdatkinson

New Member

dcsapak

Proxmox Staff Member

wdatkinson

New Member

Attachments

dcsapak

Proxmox Staff Member

wdatkinson

New Member

bbgeek17

Distinguished Member

wdatkinson

New Member

We value your privacy