I tried to set up multipath today on a cluster of old servers that I've been using for testing.
I have a four bay nas with two raid-1 targets/luns. Currently there's one lun per target, and both interfaces are enabled for each target.
The multipathd.conf file I came up with is as follows:
I pulled the wwids using /lib/udev/scsi_id -g -u -d /dev.sdX. those are the two targets on my NAS. I copied this file to all the server and restarted the multipath and iscsi related services, and when that didn't work I restarted the servers.
The only thing I can say for sure is that this setup does not work. I get errors suggesting that LVs are showing up on multiple PVs and proxmox refuses to start them for that reason. I start getting block devices like sdm, and I don't have nearly that many disks in my system, so the disks may be getting duplicated.
Unfortunately, the entry in the PM GUI log is now blank, and I'm not able to find the entry in any of the server logs. Renaming the multipath.conf file to .bak and restarting the server solved that problem. If someone sees the problem with my config, please advise.
Also, and most confusingly, my NAS started crashing around the same time I made this change, and keeps crashing after I reverted it. Reverting the change did allow me to start my VMs, but the NAS kept going off line. Right now I shut down 3 of 5 nodes, and am running on two nodes and minimal vms with some success. I have 30 minutes of uptime, which is the most I've had all day.
I realize this is probably just a fantastic coincidence, but on the off chance that somehow my change in PM multipath setup is mrelated, I figured I'd mention it here. It seems that shutting down 3/5 nodes has helped my NAS stay online, but it could just be that the device is failing under load. There is nothing in the NAS' logs to indicate what is going on. The problem manifests as the NAS device simply becoming unreachable, until it is cold restarted. Nothing is connected to it, aside from PM VMs.
So, not having a known good NAS or a known good multipath config, I'm not sure what the problem might be.
This is my storage.cfg file:
Any thoughts or advice are much appreciated.
UPDATE: I ssh'ed into the NAS to see if I could identify anything in the device's logs. One thing I notice is a lot of this in dmesg:
The odd thing about that is that all my running nodes are successfully connected to their targets.
Speaking of running nodes, I have 3 nodes on right now, and the NAS still hasn't crashed. One node has always been offline, so that means 3/4 available nodes are online now, albeit only running a couple of VMs. I plan on methodically starting VMs, and then finally starting the last node, and seeing if the NAS starts crashing again.
I have a four bay nas with two raid-1 targets/luns. Currently there's one lun per target, and both interfaces are enabled for each target.
The multipathd.conf file I came up with is as follows:
Code:
defaults
{
polling_interval 2
path_selector "round-robin 0"
path_grouping_policy multibus
uuid_attribute ID_SERIAL
rr_min_io 100
failback immediate
no_path_retry queue
user_friendly_names yes
}
blacklist
{
wwid .*
}
blacklist_exceptions
{
wwid "35000c500a33316d0"
wwid "35000c500a33313e8"
}
I pulled the wwids using /lib/udev/scsi_id -g -u -d /dev.sdX. those are the two targets on my NAS. I copied this file to all the server and restarted the multipath and iscsi related services, and when that didn't work I restarted the servers.
The only thing I can say for sure is that this setup does not work. I get errors suggesting that LVs are showing up on multiple PVs and proxmox refuses to start them for that reason. I start getting block devices like sdm, and I don't have nearly that many disks in my system, so the disks may be getting duplicated.
Unfortunately, the entry in the PM GUI log is now blank, and I'm not able to find the entry in any of the server logs. Renaming the multipath.conf file to .bak and restarting the server solved that problem. If someone sees the problem with my config, please advise.
Also, and most confusingly, my NAS started crashing around the same time I made this change, and keeps crashing after I reverted it. Reverting the change did allow me to start my VMs, but the NAS kept going off line. Right now I shut down 3 of 5 nodes, and am running on two nodes and minimal vms with some success. I have 30 minutes of uptime, which is the most I've had all day.
I realize this is probably just a fantastic coincidence, but on the off chance that somehow my change in PM multipath setup is mrelated, I figured I'd mention it here. It seems that shutting down 3/5 nodes has helped my NAS stay online, but it could just be that the device is failing under load. There is nothing in the NAS' logs to indicate what is going on. The problem manifests as the NAS device simply becoming unreachable, until it is cold restarted. Nothing is connected to it, aside from PM VMs.
So, not having a known good NAS or a known good multipath config, I'm not sure what the problem might be.
This is my storage.cfg file:
Code:
dir: local
path /var/lib/vz
content backup,vztmpl,iso,images,rootdir
maxfiles 25
shared 0
zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1
iscsi: iscsi_backend
portal 10.5.0.20
target iqn.2000-01.com.synology:ADSNAS.Target-1.72d5c4b37e
content images
lvm: hapool
vgname HA
base iscsi_backend:0.0.0.scsi-3600140529fca684d358ad4823daeabd7
content images,rootdir
shared 1
zfspool: local-vmdata
pool local-vmdata
content rootdir,images
sparse 0
iscsi: iscsi_backend_2
portal 10.5.0.21
target iqn.2000-01.com.synology:ADSNAS.Target-2.72d5c4b37e
content images
lvm: hapool_2
vgname HA_2
base iscsi_backend_2:0.0.1.scsi-36001405e757b1a8d9d5ed47badaf55d3
content images,rootdir
shared 1
Any thoughts or advice are much appreciated.
UPDATE: I ssh'ed into the NAS to see if I could identify anything in the device's logs. One thing I notice is a lot of this in dmesg:
Code:
[ 4817.857939] iSCSI:iscsi_target_parameters.c:44:iscsi_login_rx_data rx_data returned 0, expecting 48.
[ 4817.867149] iSCSI_F:iscsi_target_login.c:1253:iscsi_target_login_sess_out iSCSI Login negotiation failed - I[][10.5.0.252:43228], T[][10.5.0.20:3260]
[ 4820.581714] iSCSI:iscsi_target_parameters.c:44:iscsi_login_rx_data rx_data returned 0, expecting 48.
[ 4820.590903] iSCSI_F:iscsi_target_login.c:1253:iscsi_target_login_sess_out iSCSI Login negotiation failed - I[][10.5.0.254:36212], T[][10.5.0.20:3260]
[ 4820.803964] iSCSI:iscsi_target_parameters.c:44:iscsi_login_rx_data rx_data returned 0, expecting 48.
[ 4820.813143] iSCSI_F:iscsi_target_login.c:1253:iscsi_target_login_sess_out iSCSI Login negotiation failed - I[][10.5.0.254:59680], T[][10.5.0.21:3260]
[ 4821.632146] iSCSI:iscsi_target_parameters.c:44:iscsi_login_rx_data rx_data returned 0, expecting 48.
[ 4821.641338] iSCSI_F:iscsi_target_login.c:1253:iscsi_target_login_sess_out iSCSI Login negotiation failed - I[][10.5.0.250:35730], T[][10.5.0.21:3260]
[ 4821.697788] iSCSI:iscsi_target_parameters.c:44:iscsi_login_rx_data rx_data returned 0, expecting 48.
[ 4821.706977] iSCSI_F:iscsi_target_login.c:1253:iscsi_target_login_sess_out iSCSI Login negotiation failed - I[][10.5.0.250:34676], T[][10.5.0.20:3260]
[ 4827.004090] iSCSI:iscsi_target_parameters.c:44:iscsi_login_rx_data rx_data returned 0, expecting 48.
[ 4827.013269] iSCSI_F:iscsi_target_login.c:1253:iscsi_target_login_sess_out iSCSI Login negotiation failed - I[][10.5.0.252:58856], T[][10.5.0.21:3260]
[ 4827.038026] iSCSI:iscsi_target_parameters.c:44:iscsi_login_rx_data rx_data returned 0, expecting 48.
[ 4827.047227] iSCSI_F:iscsi_target_login.c:1253:iscsi_target_login_sess_out iSCSI Login negotiation failed - I[][10.5.0.252:43232], T[][10.5.0.20:3260]
The odd thing about that is that all my running nodes are successfully connected to their targets.
Speaking of running nodes, I have 3 nodes on right now, and the NAS still hasn't crashed. One node has always been offline, so that means 3/4 available nodes are online now, albeit only running a couple of VMs. I plan on methodically starting VMs, and then finally starting the last node, and seeing if the NAS starts crashing again.
Last edited: