Hello!
I'm running a PVE 5.2 cluster with 4 nodes. The cluster is attached to a SAN, an HP P2000 G3 iSCSI. VMs are hosted on the SAN.
The first controller of the SAN failed. Everything is running on the second controller, but I can't manage PVE anymore.
Although VMs are running, it seems that Proxmox is stuck on the failed controller, trying to manage virtual disks only via the first SAN IP address (which belongs to the failed controller), probably because it is the "portal" set during the first configuration.
In particular:
1) The web interface is locked (no login possible), even if I re-launch pvestatd or others (as suggested in other threads). The lock of the web interface affects 3 nodes over 4, and happened in sequence. I'm pretty sure that if I use the 4th node now, it will lock as well.
2) I can access via SSH and see the VMs running and manage them via qm monitor, but I cannot backup or migrate them because I receive the message
storage '<name>' is not online
3) I found in an old thread that PVE checks the availability of iSCSI storage with the command
iscsiadm -m session --rescan
In my case the command runs successfully.
Also ping works with all the 4 SAN IPs.
Does anyone know if there is a way to unlock Proxmox (web interface, backup, migration) without shutting everything down and restart? I'm going to replace the failed controller, but I'd like to do a backup before, but I can't...
Many thanks in advance!
Some command outputs follow.
There's no multipath.conf, as this SAN seems to be already pre-configured in the default configuration.
This is the cfg of one of the VMs hosted on the SAN, which I can't backup or migrate anymore.
I'm running a PVE 5.2 cluster with 4 nodes. The cluster is attached to a SAN, an HP P2000 G3 iSCSI. VMs are hosted on the SAN.
The first controller of the SAN failed. Everything is running on the second controller, but I can't manage PVE anymore.
Although VMs are running, it seems that Proxmox is stuck on the failed controller, trying to manage virtual disks only via the first SAN IP address (which belongs to the failed controller), probably because it is the "portal" set during the first configuration.
In particular:
1) The web interface is locked (no login possible), even if I re-launch pvestatd or others (as suggested in other threads). The lock of the web interface affects 3 nodes over 4, and happened in sequence. I'm pretty sure that if I use the 4th node now, it will lock as well.
2) I can access via SSH and see the VMs running and manage them via qm monitor, but I cannot backup or migrate them because I receive the message
storage '<name>' is not online
3) I found in an old thread that PVE checks the availability of iSCSI storage with the command
iscsiadm -m session --rescan
In my case the command runs successfully.
Also ping works with all the 4 SAN IPs.
Does anyone know if there is a way to unlock Proxmox (web interface, backup, migration) without shutting everything down and restart? I'm going to replace the failed controller, but I'd like to do a backup before, but I can't...
Many thanks in advance!
Some command outputs follow.
Code:
#pvecm status
Quorum information
------------------
Date: Tue Mar 16 11:25:37 2021
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000004
Ring ID: 1/3524
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 <IP1>
0x00000002 1 <IP2>
0x00000003 1 <IP3>
0x00000004 1 <IP4> (local)
Code:
#pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-1-pve)
pve-manager: 5.2-1 (running version: 5.2-1/0fcd7879)
pve-kernel-4.15: 5.2-1
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-31
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-3
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-5
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-26
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.8-pve1~bpo9
Code:
#pvesm status
storage '<name>' is not online
storage '<name>' is not online
storage '<name>' is not online
[Ctrl+C because it hangs]
Code:
#cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,iso,backup
lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir
iscsi: <name>
portal <SAN_IP1>
target iqn.1986-03.com.hp:storage.p2000g3.131819bad6
content none
lvm: <lvm_name>
vgname <lvm_group_name>
base <name>:0.0.100.scsi-3600c0ff00019cb32b080f65b01000000
content rootdir,images
shared 1
lvm: <lvm2_name>
vgname <lvm_group2_name>
base <name>:0.0.101.scsi-3600c0ff00019cc68a192f65b01000000
content images,rootdir
shared 1
nfs: <NAS_name>
export /vol_backup_vms_08
path /mnt/pve/netapp-backup-nfs08
server <NAS_IP>
content backup,images
maxfiles 1
options vers=3
Code:
#multipath -ll [only the part related to the HP SAN]
3600c0ff00019cc68333d415d01000000 dm-6 HP,P2000 G3 iSCSI
size=931G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 3:0:0:101 sdg 8:96 active ready running
| `- 5:0:0:101 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=0 status=enabled
|- 2:0:0:101 sdf 8:80 failed faulty running
`- 4:0:0:101 sdh 8:112 failed faulty running
3600c0ff00019cb32b080f65b01000000 dm-5 HP,P2000 G3 iSCSI
size=931G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 2:0:0:100 sdb 8:16 failed faulty running
| `- 4:0:0:100 sdd 8:48 failed faulty running
`-+- policy='service-time 0' prio=50 status=active
|- 3:0:0:100 sdc 8:32 active ready running
`- 5:0:0:100 sde 8:64 active ready running
There's no multipath.conf, as this SAN seems to be already pre-configured in the default configuration.
This is the cfg of one of the VMs hosted on the SAN, which I can't backup or migrate anymore.
Code:
#qm config 121
balloon: 0
bootdisk: virtio0
cores: 6
memory: 24576
name: <VM 121 NAME>
net0: virtio=76:D6:C3:9F:34:F9,bridge=vmbr0
net1: virtio=36:56:53:14:78:D9,bridge=vmbr1
numa: 0
ostype: win7
scsihw: virtio-scsi-pci
smbios1: uuid=eaed5964-aa1a-4c8e-8557-1317712a9df7
sockets: 2
virtio0: <lvm_name>:vm-121-disk-1,size=100G