Hi,
I've run into very strange problem that happens to me.
In January I installed 6 dell servers and set-up one array as an experimental cluster.
Servers are directly connected to storage array. 3 to controller A, 3 to controller B. Each server has only one path, since redundancy is not needed because it's experimental cluster.
After some time (2 months) I found that 4 of 6 servers cannot access shared storage. So I rebooted them. Did upgrade of whole cluster from Proxmox 6.0 to 6.1 (via apt-get) and set up some scripts to let me know if error occurs again.
Yesterday (after approx a month after I fixed it first time) first server reported same error. I did a little investigation and some googling. But nothing relevant found. Nothing in storage array event log.
Today, second server reported that error. I did same investigation again. Only thing I was able to find is that if I restart multipathd with
Also I found this message in log, when multipathd was going down:
So I blame multipathd itself for the error. I found this mail where this error is mentioned in patch: https://www.mail-archive.com/dm-devel@redhat.com/msg08954.html
This was written in journalctl when problem begun:
Complete journalctl log is attached.
Output of
My
If anyone can point me right direction how to resolve this it would be great.
I also have another one experimental cluster with same setup, but only 2 nodes and HP Storage Array instead dell. Where
So maybe some problem when multipathd is relying on rdac is a problem? But setting that to rdac is what DELL guys recommends.
Maybe I will try some configuration mix-up of those two. But any ideas are welcome.
I've run into very strange problem that happens to me.
In January I installed 6 dell servers and set-up one array as an experimental cluster.
Servers are directly connected to storage array. 3 to controller A, 3 to controller B. Each server has only one path, since redundancy is not needed because it's experimental cluster.
After some time (2 months) I found that 4 of 6 servers cannot access shared storage. So I rebooted them. Did upgrade of whole cluster from Proxmox 6.0 to 6.1 (via apt-get) and set up some scripts to let me know if error occurs again.
Yesterday (after approx a month after I fixed it first time) first server reported same error. I did a little investigation and some googling. But nothing relevant found. Nothing in storage array event log.
Today, second server reported that error. I did same investigation again. Only thing I was able to find is that if I restart multipathd with
systemctl restart multipathd
it fixes the problem.Also I found this message in log, when multipathd was going down:
Code:
Apr 22 16:24:26 node04 multipathd[957]: exit (signal)
Apr 22 16:24:26 node04 multipathd[957]: uxsock: poll failed with 24
Apr 22 16:24:26 node04 systemd[1]: Stopping Device-Mapper Multipath Device Controller...
So I blame multipathd itself for the error. I found this mail where this error is mentioned in patch: https://www.mail-archive.com/dm-devel@redhat.com/msg08954.html
This was written in journalctl when problem begun:
Code:
Apr 22 15:56:01 node04 multipathd[957]: proxmox-storage: load table [0 4679106560 multipath 3 pg_init_retries 50 queue_if_no_path 1 rdac 1 1 round-robin 0 1 1 8:48 1]
Apr 22 15:56:21 node04 multipathd[957]: proxmox-storage: load table [0 4679106560 multipath 3 pg_init_retries 50 queue_if_no_path 1 rdac 1 1 round-robin 0 1 1 8:48 1]
Apr 22 15:56:21 node04 multipathd[957]: failed getting dm events: Bad file descriptor
Apr 22 16:07:01 node04 multipathd[957]: sdd: No SAS end device for 'end_device-3:0'
Apr 22 16:07:01 node04 multipathd[957]: proxmox-storage: load table [0 4679106560 multipath 3 pg_init_retries 50 queue_if_no_path 1 rdac 1 1 round-robin 0 1 1 8:48 1]
Apr 22 16:07:21 node04 multipathd[957]: checker failed path 8:48 in map proxmox-storage
Apr 22 16:07:21 node04 multipathd[957]: proxmox-storage: Entering recovery mode: max_retries=30
Apr 22 16:07:21 node04 multipathd[957]: proxmox-storage: remaining active paths: 0
Apr 22 16:07:21 node04 kernel: device-mapper: multipath: Failing path 8:48.
Complete journalctl log is attached.
Output of
multipath -ll
when status is OK:
Code:
proxmox-storage (36782bcb00053557b000008d05e39e10b) dm-2 DELL,MD32xx
size=2.2T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
`-+- policy='round-robin 0' prio=14 status=active
`- 3:0:0:0 sdd 8:48 active ready running
My
/etc/multipath.conf
:
Code:
defaults {
polling_interval 5
path_selector "round-robin 0"
path_grouping_policy group_by_prio
uid_attribute ID_SERIAL
rr_min_io 100
failback immediate
no_path_retry 30
max_fds 8192
user_friendly_names yes
find_multipaths no
}
blacklist {
wwid .*
device {
vendor DELL.*
product Universal.*
}
device {
vendor DELL.*
product Virtual.*
}
}
blacklist_exceptions {
wwid 36782bcb00053557b000008d05e39e10b
}
devices {
device {
vendor "DELL"
product "MD32xxi"
path_grouping_policy group_by_prio
prio rdac
path_checker rdac
path_selector "round-robin 0"
hardware_handler "1 rdac"
failback immediate
features "2 pg_init_retries 50"
no_path_retry 30
rr_min_io 100
}
device {
vendor "DELL"
product "MD32xx"
path_grouping_policy group_by_prio
prio rdac
path_checker rdac
path_selector "round-robin 0"
hardware_handler "1 rdac"
failback immediate
features "2 pg_init_retries 50"
no_path_retry 30
rr_min_io 100
}
}
multipaths {
multipath {
wwid 36782bcb00053557b000008d05e39e10b
alias proxmox-storage
}
}
If anyone can point me right direction how to resolve this it would be great.
I also have another one experimental cluster with same setup, but only 2 nodes and HP Storage Array instead dell. Where
path_checker
is set to tur
and hardware_handler "0"
as HP recommends in manual. And this one works OK.So maybe some problem when multipathd is relying on rdac is a problem? But setting that to rdac is what DELL guys recommends.
Maybe I will try some configuration mix-up of those two. But any ideas are welcome.