Unable to connect to ISCSI, or otherwise read LVs?

jw6677 · Nov 12, 2019

Well, I was trying to get some outdated network cards (Connectx-2) working. Through the process, I had tried to compile and install Mellanox OFED, which failed.

I am fairly confident what happened was the installer removed packages which were otherwise required, before it failed. Specifically packages relating to networking
Now I am unable to connect to any of my LVs, ISCSI, or otherwise. It's the iscsi which is most important to me, containing the majority of my VMs

I am hopeful I can recover without needing to re-install proxmox. (While this is just a learning machine, not production, I do have enough time invested that I would prefer not to start over.) However, if there is a way to otherwise preserve the vms in my ISCSI target, than a fresh install isn't a major issue, but would still like to learn what went wrong, how to prevent, and how to repair.

Help!

Here is a snippet from my syslog which appears pertinent:


Nov 12 07:12:00 server systemd[1]: Starting Proxmox VE replication runner...
Nov 12 07:12:01 server systemd[1]: pvesr.service: Succeeded.
Nov 12 07:12:01 server systemd[1]: Started Proxmox VE replication runner.
Nov 12 07:13:00 server systemd[1]: Starting Proxmox VE replication runner...
Nov 12 07:13:01 server systemd[1]: pvesr.service: Succeeded.
Nov 12 07:13:01 server systemd[1]: Started Proxmox VE replication runner.
Nov 12 07:13:10 server kernel: [  858.785573] INFO: task lvdisplay:3800 blocked for more than 120 seconds.
Nov 12 07:13:10 server kernel: [  858.785713]       Tainted: P           O      5.0.15-1-pve #1
Nov 12 07:13:10 server kernel: [  858.785828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 07:13:10 server kernel: [  858.785976] lvdisplay       D    0  3800   3783 0x80000004
Nov 12 07:13:10 server kernel: [  858.785992] Call Trace:
Nov 12 07:13:10 server kernel: [  858.786019]  __schedule+0x2d4/0x870
Nov 12 07:13:10 server kernel: [  858.786026]  schedule+0x2c/0x70
Nov 12 07:13:10 server kernel: [  858.786032]  schedule_timeout+0x258/0x360
Nov 12 07:13:10 server kernel: [  858.786043]  ? call_rcu+0x10/0x20
Nov 12 07:13:10 server kernel: [  858.786054]  ? __percpu_ref_switch_mode+0xdb/0x180
Nov 12 07:13:10 server kernel: [  858.786059]  wait_for_completion+0xb7/0x140
Nov 12 07:13:10 server kernel: [  858.786068]  ? wake_up_q+0x80/0x80
Nov 12 07:13:10 server kernel: [  858.786077]  exit_aio+0xeb/0x100
Nov 12 07:13:10 server kernel: [  858.786086]  mmput+0x2b/0x130
Nov 12 07:13:10 server kernel: [  858.786091]  do_exit+0x28a/0xb30
Nov 12 07:13:10 server kernel: [  858.786095]  ? __schedule+0x2dc/0x870
Nov 12 07:13:10 server kernel: [  858.786098]  do_group_exit+0x43/0xb0
Nov 12 07:13:10 server kernel: [  858.786106]  get_signal+0x12e/0x6d0
Nov 12 07:13:10 server kernel: [  858.786114]  ? wait_woken+0x80/0x80
Nov 12 07:13:10 server kernel: [  858.786124]  do_signal+0x34/0x710
Nov 12 07:13:10 server kernel: [  858.786129]  ? do_io_getevents+0x81/0xd0
Nov 12 07:13:10 server kernel: [  858.786140]  exit_to_usermode_loop+0x8e/0x100
Nov 12 07:13:10 server kernel: [  858.786144]  do_syscall_64+0xf0/0x110
Nov 12 07:13:10 server kernel: [  858.786149]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 12 07:13:10 server kernel: [  858.786156] RIP: 0033:0x7fb20b2a5f59
Nov 12 07:13:10 server kernel: [  858.786168] Code: Bad RIP value.
Nov 12 07:13:10 server kernel: [  858.786170] RSP: 002b:00007ffd05fee078 EFLAGS: 00000246 ORIG_RAX: 00000000000000d0
Nov 12 07:13:10 server kernel: [  858.786173] RAX: fffffffffffffffc RBX: 00007fb20af46700 RCX: 00007fb20b2a5f59
Nov 12 07:13:10 server kernel: [  858.786175] RDX: 0000000000000040 RSI: 0000000000000001 RDI: 00007fb20b900000
Nov 12 07:13:10 server kernel: [  858.786177] RBP: 00007fb20b900000 R08: 0000000000000000 R09: 0000000205fee100
Nov 12 07:13:10 server kernel: [  858.786179] R10: 00007ffd05fee100 R11: 0000000000000246 R12: 0000000000000001
Nov 12 07:13:10 server kernel: [  858.786181] R13: 0000000000000000 R14: 0000000000000040 R15: 00007ffd05fee100
Nov 12 07:13:51 server systemd[1]: Starting Cleanup of Temporary Directories...
Nov 12 07:13:51 server systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Nov 12 07:13:51 server systemd[1]: Started Cleanup of Temporary Directories.
Nov 12 07:14:00 server systemd[1]: Starting Proxmox VE replication runner...
Nov 12 07:14:01 server systemd[1]: pvesr.service: Succeeded.
Nov 12 07:14:01 server systemd[1]: Started Proxmox VE replication runner.
Nov 12 07:14:35 server postfix/qmgr[2555]: 8FEB13A0648: from=<>, size=2874, nrcpt=1 (queue active)
Nov 12 07:14:35 server postfix/qmgr[2555]: D47043A1014: from=<root@server.joshuawest.ca>, size=732, nrcpt=1 (queue active)
Nov 12 07:14:35 server postfix/local[4816]: error: open database /etc/aliases.db: No such file or directory
Nov 12 07:14:35 server postfix/local[4816]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory

Nov 12 07:14:35 server postfix/local[4816]: error: open database /etc/aliases.db: No such file or directory
Nov 12 07:14:35 server postfix/local[4816]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory
Nov 12 07:14:35 server postfix/local[4816]: warning: hash:/etc/aliases: lookup of 'root' failed
Nov 12 07:14:35 server postfix/local[4819]: error: open database /etc/aliases.db: No such file or directory
Nov 12 07:14:35 server postfix/local[4819]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory
Nov 12 07:14:35 server postfix/local[4819]: warning: hash:/etc/aliases: lookup of 'root' failed
Nov 12 07:14:35 server postfix/local[4816]: 8FEB13A0648: to=<root@server.joshuawest.ca>, relay=local, delay=91906, delays=91906/0.08/0/0.04, dsn=4.3.0, status=deferred (alias database unavailable)
Nov 12 07:14:35 server postfix/local[4816]: using backwards-compatible default setting relay_domains=$mydestination to update fast-flush logfile for domain "server.joshuawest.ca"
Nov 12 07:14:35 server postfix/local[4819]: D47043A1014: to=<root@server.joshuawest.ca>, orig_to=<root>, relay=local, delay=262174, delays=262174/0.06/0/0.05, dsn=4.3.0, status=deferred (alias database unavailable)
Nov 12 07:14:35 server postfix/local[4819]: using backwards-compatible default setting relay_domains=$mydestination to update fast-flush logfile for domain "server.joshuawest.ca"
Nov 12 07:15:00 server systemd[1]: Starting Proxmox VE replication runner...
Nov 12 07:15:01 server systemd[1]: pvesr.service: Succeeded.
Nov 12 07:15:01 server systemd[1]: Started Proxmox VE replication runner.
Nov 12 07:15:01 server CRON[4859]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 12 07:15:10 server kernel: [  979.615650] INFO: task lvdisplay:3800 blocked for more than 120 seconds.
Nov 12 07:15:10 server kernel: [  979.615789]       Tainted: P           O      5.0.15-1-pve #1
Nov 12 07:15:10 server kernel: [  979.615908] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 07:15:10 server kernel: [  979.616054] lvdisplay       D    0  3800   3783 0x80000004
Nov 12 07:15:10 server kernel: [  979.616069] Call Trace:
Nov 12 07:15:10 server kernel: [  979.616098]  __schedule+0x2d4/0x870
Nov 12 07:15:10 server kernel: [  979.616104]  schedule+0x2c/0x70
Nov 12 07:15:10 server kernel: [  979.616111]  schedule_timeout+0x258/0x360
Nov 12 07:15:10 server kernel: [  979.616132]  ? call_rcu+0x10/0x20
Nov 12 07:15:10 server kernel: [  979.616142]  ? __percpu_ref_switch_mode+0xdb/0x180
Nov 12 07:15:10 server kernel: [  979.616147]  wait_for_completion+0xb7/0x140
Nov 12 07:15:10 server kernel: [  979.616157]  ? wake_up_q+0x80/0x80
Nov 12 07:15:10 server kernel: [  979.616170]  exit_aio+0xeb/0x100
Nov 12 07:15:10 server kernel: [  979.616180]  mmput+0x2b/0x130
Nov 12 07:15:10 server kernel: [  979.616184]  do_exit+0x28a/0xb30
Nov 12 07:15:10 server kernel: [  979.616189]  ? __schedule+0x2dc/0x870
Nov 12 07:15:10 server kernel: [  979.616192]  do_group_exit+0x43/0xb0
Nov 12 07:15:10 server kernel: [  979.616200]  get_signal+0x12e/0x6d0
Nov 12 07:15:10 server kernel: [  979.616210]  ? wait_woken+0x80/0x80
Nov 12 07:15:10 server kernel: [  979.616220]  do_signal+0x34/0x710
Nov 12 07:15:10 server kernel: [  979.616225]  ? do_io_getevents+0x81/0xd0
Nov 12 07:15:10 server kernel: [  979.616236]  exit_to_usermode_loop+0x8e/0x100
Nov 12 07:15:10 server kernel: [  979.616239]  do_syscall_64+0xf0/0x110
Nov 12 07:15:10 server kernel: [  979.616244]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 12 07:15:10 server kernel: [  979.616251] RIP: 0033:0x7fb20b2a5f59
Nov 12 07:15:10 server kernel: [  979.616263] Code: Bad RIP value.
Nov 12 07:15:10 server kernel: [  979.616265] RSP: 002b:00007ffd05fee078 EFLAGS: 00000246 ORIG_RAX: 00000000000000d0
Nov 12 07:15:10 server kernel: [  979.616269] RAX: fffffffffffffffc RBX: 00007fb20af46700 RCX: 00007fb20b2a5f59
Nov 12 07:15:10 server kernel: [  979.616271] RDX: 0000000000000040 RSI: 0000000000000001 RDI: 00007fb20b900000
Nov 12 07:15:10 server kernel: [  979.616273] RBP: 00007fb20b900000 R08: 0000000000000000 R09: 0000000205fee100
Nov 12 07:15:10 server kernel: [  979.616275] R10: 00007ffd05fee100 R11: 0000000000000246 R12: 0000000000000001
Nov 12 07:15:10 server kernel: [  979.616276] R13: 0000000000000000 R14: 0000000000000040 R15: 00007ffd05fee100

jw6677 · Nov 12, 2019

Edit / Part 2:

Just thinking, to get ahead of what I feel like is the inevitable, "You need to start over", a few questions down that path:

This is one of three machines in a cluster:
1) What considerations need to be made when reinstalling to not negatively impact the cluster?
2) How about the ceph share file storage?

3) The vms themselves are bound to this machine too, is there a method to "move" them to another machine, without needing to actually access the iscsi that the vms reside on, or what is the process there?

ndroftheline · Mar 14, 2020

just following up on this a little bit. apparently the ofed drivers do a LOT of automatic purging of files including things like pve-manager! (!?), so don't do that. i think the main drivers are already in proxmox for these cards though, because they're covered by the mlnx4_en drivers.

first thought, i would consider using zfs as your root filesystem and set up snapshots for situations like this. you could just roll back to an eralier snapshot.

In my case I was trying to first, put some

Code:

 Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)

into IPoIB mode, so I was following this thread https://forum.proxmox.com/threads/40gbs-mellanox-infinityband.57118/#post-276375 - where people said highly varying things and I couldn't confirm any of them.

However this post on Reddit helped me a lot: https://www.reddit.com/r/homelab/comments/cf0ae3/still_rocking_the_connectx2_qdr_and_proxmox_ve_v60/

I think this indicates you simply extract a *script* which is connectx_port_config from the extracted OFED drivers. here are the steps i worked out:

note: that OFED version is simply the currently available latest version intended for debian 10, nothign special about i

wget http://content.mellanox.com/ofed/ML..._OFED_LINUX-5.0-1.0.0.0-debian10.0-x86_64.tgz
tar xvf MLNX_OFED_LINUX-5.0-1.0.0.0-debian10.0-x86_64.tgz ~/
mkdir ~/ofed/
mkdir ~/ofed/utils
tar xvf MLNX_OFED_LINUX-5.0-1.0.0.0-debian10.0-x86_64.tgz -C ~/ofed/
dpkg -x ~/ofed/DEBS/Common/mlnx-ofed-kernel-utils_5.0-OFED.5.0.1.0.0.0.1.g34c46d3_amd64.deb ~/ofed/utils
mkdir /etc/infiniband
touch /etc/infiniband/connectx.conf
cp ~/ofed/utils/sbin/connectx_port_config /sbin/
chmod +x /sbin/connectx_port_config
./sbin/connectx_port_config

output looks like this:

ConnectX PCI devices :
|----------------------------|
| 1 0000:05:00.0 |
|----------------------------|

Before port change:
ib
ib

|----------------------------|
| Possible port modes: |
| 1: Infiniband |
| 2: Ethernet |
| 3: AutoSense |
|----------------------------|
Select mode for port 1 (1,2,3): 2
Select mode for port 2 (1,2,3): 2

After port change:
eth
eth

at this point you should be able to set an ip address, up the link, and ping other hosts on the network. i can.

however it doesn't persist on reboot, which i'm still working out.

jw6677 · Mar 16, 2020

Thanks for the reply,

To close the loop and share my own learnings since the OP:

I did end up reinstalling proxmox from scratch, and torched the VMs previously on the machine. Chalked it up to a learning opportunity (the whole point of this for me anyways), and set in place much more rigorous backup standards for myself.
--> This itself an opportunity to learn how to use a tape library, so that's cool.

I believe that instead of switching to eth mode, I ended up with IP over Infiniband (modprobe ib_ipoib ) being by "settled in" solution.

I have a file full of notes that slowly turned into a "Scripts to run when setting up a new node". Specific to getting these connectx2 running on proxmox, here are the related lines. There is some other stuff installed here, but this get's connectx cards working for me, consistently, even after reboots:

Code:

## Install drive and IB stuff --> forum note: some of this is likely not needed for your setup (some probably not for mine as well)
apt-get -y install mdadm libibverbs1 ibverbs-utils librdmacm1 rdmacm-utils libdapl2 ibsim-utils ibutils libcxgb3-1 libibmad5 libibumad3 libmlx4-1 libmthca1 libnes1 infiniband-diags mstflint opensm perftest srptools

## Example /etc/network/interfaces
auto ibp6s0
iface ibp6s0 inet static
    address  192.168.2.2
    netmask  255.255.255.0
    mtu 65520
    pre-up modprobe ib_ipoib
    pre-up echo connected > /sys/class/net/ibp6s0/mode

Search

Search

Unable to connect to ISCSI, or otherwise read LVs?

jw6677

Active Member

jw6677

Active Member

ndroftheline

Well-Known Member

jw6677

Active Member

We value your privacy