Reboot hanging in multipath environment

jhammer

Member
Dec 21, 2009
55
1
6
On a reboot, my proxmox server hangs at the following:

Code:
Cleaning up ifupdown....
Deactivating swap...done.
Unmounting local filesystems...done.
Shutting down LVM Volume Groupsdevice-mapper: multipath: Failing path 8:16.
device-mapper: multipath: Failing path 8:32.
device-mapper: multipath: Failing path 8:48.
It hangs there indefinitely until I do a hard reset. I'm not seeing any useful info in logs because kernel logging seems to be stopping before that point. The device in question is not in use at all for VM's. I have not even created a volume group on it yet.

Any ideas why this may be happening?

Here are some details on my multipath setup:

/etc/multipath.conf:
Code:
defaults {
        type = ["device-mapper",1]
        filter = ["a\|/dev/disk/by-id.*\|", "r\|.*\|"]
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/lib/udev/scsi_id -g -u -d /dev/%n"
        prio_callout            /bin/true
        path_checker            tur
        rr_min_io               1000
        rr_weight               uniform
        failback                immediate
        no_path_retry           queue
        user_friendly_names     yes
}
blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z][[0-9]*]"
        devnode "^sd[a-z][[0-9]*]"
        devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}
Output of various commands (I've only listed related info):

# multipath -ll

CYBERDISK_4 (20000000000000000000b560021b518e3) dm-3 CYBERNET,iSAN Vault
[size=18G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][active]
\_ 8:0:0:0 sdb 8:16 [active][ready]
\_ 9:0:0:0 sdc 8:32 [active][ready]
\_ 10:0:0:0 sdd 8:48 [active][ready]

# pvscan
PV /dev/dm-7 VG CYBERDISK5 lvm2 [10.00 GB / 10.00 GB free]
PV /dev/dm-6 VG VolumeTD2 lvm2 [20.00 GB / 19.00 GB free]
PV /dev/dm-5 VG TestVol lvm2 [10.00 GB / 3.00 GB free]
PV /dev/sda2 VG pve lvm2 [465.26 GB / 4.00 GB free]
Total: 4 [505.25 GB] / in use: 4 [505.25 GB] / in no VG: 0 [0 ]

*NOTE: /dev/dm-3 is not listed in pvscan output and should not be as I've not done anything with it yet.

Thanks.
 
I found that if I set no_path_retry to its default value of '0' instead of 'queue' then the machine reboots without a problem. Is it possible to get the reboot working with no_path_retry set to queue?
 
As far as I can tell, from my research on the issue in this post below, the system seems to be hanging on reboot because the iscsi node sessions are ending before the multipath device is turned off. Does anyone know how to tell the multipath device to go away before the iscsi node sessions in the shutdown sequence?

http://forum.proxmox.com/threads/3578-Removal-of-iSCSI-disk-causes-system-to-hang?p=20160#post20160

Thanks.

Hi,
look at rc0.d - links startet with S runs when changing to runlevel 0. K-scripts runs if you leave the runlevel - im not sure by shutdown, perhaps you must try.
Code:
ls -l /etc/rc0.d
If you change the number you change the order.
But i think the iscsi must shutdown before multipath. Are you sure that your network is aktive till multipath stops?

Udo
 
Thanks Udo. I ran the whole shutdown sequence manually, i.e. running each script in /etc/rc0.d manually in order (with no_path_retry set to queue). Between each shutdown script, I ran 'multipath -f mpath5' to try and remove the multipath device manually. Each time I got this result:

Code:
mpath5: map in use
All the way down until I got to the last 3 scripts:

Code:
S50lvm2 -> ../init.d/lvm2
S60umountroot -> ../init.d/umountroot
S90halt -> ../init.d/halt
When that lvm2 script gets run to shutdown lvm2, I again get the "multipath: Failing path" results:

Code:
Shutting down LVM Volume Groupsdevice-mapper: multipath: Failing path 8:48.
device-mapper: uevent: dm_send_uevents: kobject_uevent_env failed
device-mapper: multipath: Failing path 8:80.
device-mapper: uevent: dm_send_uevents: kobject_uevent_env failed
device-mapper: multipath: Failing path 8:64.
device-mapper: uevent: dm_send_uevents: kobject_uevent_env failed
That hangs indefinitely.

Now, if I do the same thing with no_path_retry set to fail the sequence goes similarly, except that when I run /etc/init.d/lvm2 stop I get the same as above followed by a few of these lines:

Code:
/dev/dm-9: read failed after 0 of 2048 at 0: Input/output error
end_request: I/O error, dev dm-9, sector 20971776
Then the script finishes and the reboot can proceed.

So the key seems to be the no_path_retry setting.

From my tests, things seem to go so much better if no_path_retry is set to queue and the connection to the iSCSI server is interrupted.

Any other thoughts on what I can do to get reboot working with no_path_retry set to queue?

Thanks!
 
This is what I was told on the open-iscsi mailing list:

<snip>
This is a reported bug of the device-mapper on debian.
There's a patch at debians bugtracker available, but as far as I remember,
it has been refused by upstream developers.

We're also running open-iscsi/dm-multipath/lvm/clvm stack on virtualization
Hosts. Due to this behavior one big point is to never ever let multipath loose
all pathes.
Try to add
features "1 queue_if_no_path"
to your related multipath.conf device section.
</snip>
 
I received another suggestion on the open-iscsi mailing list:

What you/(the debian scripts) want to do is shutdown multipath first, so the higher level queues have flushed they data out. Then shut down iscsi.

Or do something to flush the multipath queues and shut that down, then shutdown iscsi.

The multipath daemon was being shutdown before iscsi in the init scripts. However, the multipath queues were not being flushed and shutdown. I added a script to run 'multipath -F' between the multipath and iscsi shutdown scripts. That seems to flush the queues and shutdown multipath devices OK. The server no longer hangs on reboot and I see no glaring errors.
 
Hey jhammer

I have been trying this morning to replicate what you put in your last post.

Essentially I have created a script which literally contains:

Code:
[COLOR=#34BBC7][FONT=Menlo]#!/bin/sh[/FONT][/COLOR]
[COLOR=#34BBC7][FONT=Menlo]#Flush multipath[/FONT][/COLOR]
[FONT=Menlo]
[/FONT]
[COLOR=#AFAD24][FONT=Menlo][COLOR=#5330e1][B]echo[/B][/COLOR][COLOR=#000000] [/COLOR][B]"flushing multipath"[/B][/FONT][/COLOR]
[FONT=Menlo]multipath -F[/FONT]
[COLOR=#AFAD24][FONT=Menlo][COLOR=#5330e1][B]echo[/B][/COLOR][COLOR=#000000] [/COLOR][B]"multipath flushed"[/B][/FONT][/COLOR]
[FONT=Menlo]
[/FONT]
[FONT=Menlo][COLOR=#34bd26]exit[/COLOR][/FONT][FONT=Menlo][COLOR=#000000] 0[/COLOR][/FONT]
[SIZE=2][COLOR=#000000][FONT=verdana]
[/FONT][/COLOR]

Is this correct? and how do I add this to the shutdown sequence between the iscsi kill and multi path kill scripts in /etc/rc0.d/

I have tried to symlink it manually, and tried using rd-update.d but I can't seem to get it right.

Any help would be greatly appreciated.
[/SIZE]
 
Hey..

Managed to figure this out.. I tried with your command but I needed to unmount my /dev/mapper/mpath* device also..

Created a .sh script in /etc/init.d/ and symlinked this in /etc/rc0.d and /etc/rc6.d

Thanks anyway everyone :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!