md raid 1 + lvm2 + snapshot => volumes hang

philten

New Member
Feb 5, 2009
26
0
1
www.philten.com
Hello,

I am aware that raid soft was discussed many times and that proxmox
team do not support it.

However, I know it does interest many proxmox users and therefore
I would like to report my difficult experience with
md raid 1 + lvm2 + snapshot.

Also, this may be a lvm2 related issue :)

I use a proxmox 1.3:
2.6.24-7-pve #1 SMP PREEMPT Fri Aug 21 09:07:39 CEST 2009 x86_64 GNU/Linux

The system was installed a couple weeks ago and
Snapshot worked fine until the problem occured
despite no changes to the disks configuration.

In short the main symptom is that the lvm group
of volumes completly stalled after reading about
1GB while executing a vzdump snapshot backup.

My configuration:
md1: md RAID 1 + ext3 mounted as /
md0: md RAID 1 + lvm2 divided in 2 x ext3 volumes vmdata and vmbackups, mounted as /var/lib/vz and /backups.

Symptoms:
- snapshot vzsnap creation OK
- backups on vmbackups started OK
- after backing up about 1Gb the snapshot stalled. I mean that all requests
to read files on any lvm volume will hang. Including /backups not involved
in the snapshot.
However "ls" and "cd" commands do work and I can get directories listing.
Any command to read a file content stall the ssh session (ex cat,cp,mv).
A simple "cat /backups/phil.log" will also stall the ssh session.
- smartctl do not report any problem (including the long test)
- "wa" in "top" is blocked at 99%, cpu is near zero.
- the snapshot is visible in /dev/mapper
- the snapshot cannot be removed (lvremove -f). again no error reported. just hanged with no output at all.
- the system seem to work fine as long as nothing tries to read on one of the
2 lvm2 volumes.
- no error reported in messages or syslog.
- it seem a md check started after the snapshot creation. This check process
also stalled at 29% (speed=0K/sec). again no error reported.
- soft reboot did not work
- hard reboot worked. But a md resync started and stalled at 0.1% leaving the system in the same context as before the hard reboot.

To recover a working system I set sdb3 as faulty, removed sdb3 from the raid1
and hard rebooted. That worked and I could remove the snapshot and access the data
on both lvm volumes. Since then, I did not try to create a snapshot and system
seem to work fine.

Any comments or suggestiosn would be very appreciated.

Greetings,

Phil Ten
 
I was hoping some posts from other proxmox users.

In particular, is someone running the same configuration (raid soft+lvm+snapshot) without problems ?

Phil Ten
 
I could gather more informations on this problem,
and it seem I am now able to reproduce it.

Each time I tried the simple scenario below
it hang the lvm volume :

I launch checkarray (as executed by cron each first sunday):

# /usr/share/mdadm/checkarray --cron --all --quiet

and while checkarray is running I start a snapshot
backup:

# vzdump --snapshot --dumpdir /backups/ 201

after a short delay the lvm volume
is hanged (before the backup completed, but after
the creation of the snapshot).

To recover access to the lvm partition, I proceed this way:

mark faulty one hdd member of the raid array
# mdadm --manage /dev/md64 --fail /dev/sdb3

power cycle reboot (soft reboot won't work)

remove the snapshot
# lvremove /dev/data/vzsnap

Re-add the disk to the raid array
# mdadm --manage /dev/md0 --re-add /dev/sdb3

at this point, the raid volume seem back in order.

--------------

I mention that I had the exact same symptom on two
different servers with the same configuration,
therefore I exclude a hardware failure.

Each time I seen the symptom I could verify that a checkarray
was running (listed in /proc/mdstat)

--------------

Dear promox team, I know you do not support soft raid,
but since I provide a procedure to reproduce the symptom,
I was hoping you could check it ? Is there any chance ?

In all cases, thanks for the great job with Promox. I love it.

Phil Ten
 
Dear promox team, I know you do not support soft raid,
but since I provide a procedure to reproduce the symptom,
I was hoping you could check it ? Is there any chance ?

No, because we do not support software raid (we do not have the resources to debug software raid).
 
No, because we do not support software raid (we do not have the resources to debug software raid).

Bad news but I understand. Thanks.

For now, I will manage to verify that the raid is not in "check"
or "recover" state before launching a snapshot backup.
It's a far from perfect solution, but should help.

Also, if I have spare time, I will try to setup a server with
latest debian and try to reproduce the symptom.

(Just for your information, at OVH, you get a server with software raid for 50E/month, and at least 135E/month for hardware raid!
In many cases, I can't convince my clients to go for a hardware raid
if they want an OVH server...)
 
[Update]:

I just found the symptom is probably related to this known bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/212684

Referring to the thread, it should be fixed in kernel
Linux-image-2.6.27-11-generic 2.6.27-11.26.

In short, it is hardware related and seem to occur with
Intel SATA AHCI controller.

Indeed, my server generating the symptom runs:

00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller [8086:2922] (rev 02) (prog-if 01 [AHCI 1.0])
Subsystem: Intel Corporation Device [8086:5044]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 2298
Region 0: I/O ports at 2428
Region 1: I/O ports at 243c
Region 2: I/O ports at 2420
Region 3: I/O ports at 2438
Region 4: I/O ports at 2020
Region 5: Memory at e83a1000 (32-bit, non-prefetchable) [size=2K]
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/4 Enable+
Address: fee0300c Data: 4199
Capabilities: [70] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [a8] SATA HBA <?>
Capabilities: [b0] Vendor Specific Information <?>
Kernel driver in use: ahci
Kernel modules: ahci
 
I've seen this same issue in many servers w/ proxmox, "random" lvm-on-mdadm hangs. 2.6.32 kernel does not solve.
 
Hrm... after doing further testing, this seems to occur only when the mdadm array is rebuilding or checking for consistency (monthly, the pattern was how I found this)... And occurs randomly, not only when snapshots exist.
 
obrienmd,

That's bad news about kernel 2.6.32 :(

What HDD controller do you have ? Intel 82801 ?

I administer a few servers running this configuration. I seen a few times the problem (I think it's the same issue)
while ending a VM backup with suspend mode. The (may) hang occurs while processing the "resume" command.

I never seen the problem outside these two contexts (snapshot +rebuild/check and while VM resume).

Could you provide more information about these "random" cases you mentioned ?

Phil Ten
 
Drives are on:

00:1f.2 SATA controller: Intel Corporation Ibex Peak 6 port SATA AHCI Controller (rev 05)
01:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)

I have never seen the problem outside of a rebuild / check, but the vast majority of hangs occur when there's no snapshot or VM suspend or resume action.

What happens is this: The md system starts its monthly consistency check on the array md0. Most of the time, at some point during this check, if kvm machines are turned on, a couple of the lvm volumes that host kvm vm storage will shoot up to 100% usage in iostat (without any blocks being read/written), and /proc/mdstat will slowly slow down to 0/1KBps. At this point, the host cannot be shut down or restarted without physically powering down. The kvm processes that use the 100% usage lvm volumes cannot be killed, even with kill -9. Even a remote magic sysrq will not restart the machine. This is not ideal because unless I want to spend an hour to run to the datacenter, I have to pay remote hands to manually restart the machine :)
*When the iostat for the lvm volumes shows 100% usage, the sdX drives and md0 device show vew low / 0% usage.

Once the machine is manually restarted, the md array rebuilds itself because it detects inconsistency. Because of this, I fear bringing up any kvm machines w/ lvm storage, and the host is essentially useless for the 3-5 hours a rebuild takes.

I prefer mdadm to hardware raid, but I can't possibly move our production vm hosts to mdadm until I get this figured out!
 
I prefer mdadm to hardware raid, but I can't possibly move our production vm hosts to mdadm until I get this figured out!
Have you tried to use LVM2 mirror functionality instead of using MD RAID? I ask because LVM2 is already used on Proxmox VE and getting MD RAID/SoftRAID to work on Proxmox VE is involving to change so many things in Proxmox VE that every time you do an update of Proxmox VE you will be forced to redo those changes. And you write production vm host so if this is production then why all this extra hassle with something that needs so much changes when you can have it almost the same way by using basic LVM2 functionality?

Beside that using software RAID is often a cheep way of doing RAID and production grade systems should be using hardware RAID instead.
 
Have you tried to use LVM2 mirror functionality instead of using MD RAID? I ask because LVM2 is already used on Proxmox VE and getting MD RAID/SoftRAID to work on Proxmox VE is involving to change so many things in Proxmox VE that every time you do an update of Proxmox VE you will be forced to redo those changes. And you write production vm host so if this is production then why all this extra hassle with something that needs so much changes when you can have it almost the same way by using basic LVM2 functionality?
LVM2 don't perform nearly as well as mdadm, especially in large stripe+mirror arrays. And actually, because we boot off SSD and use LVM on top of mdadm for VM storage / backup, there are _no_ configuration changes required upon update.

Beside that using software RAID is often a cheep way of doing RAID and production grade systems should be using hardware RAID instead.
That's an odd statement. Other than Proxmox, we use mdadm without issue in many mission-critical storage roles. It's extremely mature, and until we ran into this issue, we'd never had a stability issue with it. "Software RAID", as in fakeraid, deserves its reputation as "cheap", but Linux mdadm is no consumer toy.

That being said, I respect the Proxmox team's right to choose whatever technologies they would like to support, of course. I just wish they took mdadm more seriously, as its use is quite widespread in the industry (even internally on many SAN boxes).
 
LVM2 don't perform nearly as well as mdadm, especially in large stripe+mirror arrays.
Most users use a RAID system because they want to survive a disk failure. I don't know for what you need to use RAID? If you insist using a RAID and can not go the hardware path then looking at the LVM2 mirror/stripe option could be an possibility, since MD RAID has the issue on your Proxmox VE installation. So why not trying it? You should have a look at LVM mirror/stripe option. I would not be surprised if it would not be faster the MD RAID.

And actually, because we boot off SSD and use LVM on top of mdadm for VM storage / backup, there are _no_ configuration changes required upon update.
That is good.

That's an odd statement. Other than Proxmox, we use mdadm without issue in many mission-critical storage roles. It's extremely mature, and until we ran into this issue, we'd never had a stability issue with it. "Software RAID", as in fakeraid, deserves its reputation as "cheap", but Linux mdadm is no consumer toy.
Don't get me wrong. I use MD RAID since ages. You don't need to advertise the benefit of MD RAID to me.

That being said, I respect the Proxmox team's right to choose whatever technologies they would like to support, of course. I just wish they took mdadm more seriously, as its use is quite widespread in the industry (even internally on many SAN boxes).
I don't use Proxmox VE. I just have installed it on a Hetzner box for a friend. And he insisted in mirroring the storage. Off course it is easy as 1-2-3 to convert that Proxmox installation to be on top of MD RAID. But since I read here that MD RAID is not well supported I decided to go the LVM2 path. He wanted to have the additional security against a disk failure. And since MD RAID is somehow not a option with Proxmox VE and a HW RAID is not something that he wants to go (for now) then using LVM2 mirror/stripe is at least better then no RAID at all.