shared storage: LVM2 or CLVM ?

jinjer

Renowned Member
Oct 4, 2010
204
7
83
I have a doubt regarding LVM (for KVM machines) and shared storage.

I've seen that LVM on top of shared storage is recommended routinely for KVM live migration and so, but it's my understanding (please correct me if I'm wrong) that shared-lvm needs an extension to LVM (clustered-LVM) so that metadata is propagated from one cluster node to the others.

This is done by means of the clvmd daemon which is responsible for replicating meta altered from one node to the others. Failure to do this would represent a "split brain" condition for LVM with obvious consequences.

I fail to understand why LVM (VG/LV) is constructed routinely on top of iSCSI shared storage and how this setup is able to run across cluster nodes with no problems (i.e. is metadata stored on the iscsi underlying device and is current lvm aware of this) ?

jinjer
 
we implemented our own mechanism for 1.x to make sure that only one VM accesses the volume, so no clvm is needed. but this will change for 2.x series.
 
we implemented our own mechanism for 1.x to make sure that only one VM accesses the volume, so no clvm is needed. but this will change for 2.x series.
Would you like to elaborate a little more?

The PV is on shared storage (drbd/iscsi etc) but the VG and LV need syncronization between the nodes... well unless you only create the LV from a single node (the master). But even in this case the updated metadata need to be migrated to other nodes (or is it enough to pvscan/vgscan/lvscan on the other nodes to get updates) ?

I'm referencing to this doc: http://www.centos.org/docs/5/html/Cluster_Logical_Volume_Manager/LVM_Cluster_Overview.html

jinger
 
I had problems with shared storage and plain proxmox solution, I've got cluster of few proxmox machines and a bunch of vms, after I migrated on virtual machine to another box it booted up with disk image of another machine, I had to dd all volumes every thing to another LUN (this time with CLVM) as I could not risk that after proxmox boxes are rebooted I will loose some volumes because it seemed that few nodes got different lvm metadata than others.
If I remeber correctly my assumption was that I deleted few virtual machines while vzdump backup was dumping volumes used by those machines (using lvm snapshots) and on nodes that were running those vzdump jobs lvm refused to remove volumes as it's snapshot was held open. That is just my lucky guess, I didn't dig up anything from logs as I don't know when the lvm corruption happened.
On the downsides of using CLVM with proxmox is that it uses cman 2 (that's what You get with lenny) which gives me troubles and I don't like working with it.

P.S. You can't use lvm snapshots with CLVM.
 
Last edited:
The PV is on shared storage (drbd/iscsi etc) but the VG and LV need syncronization between the nodes...

Well, it is on 'shared' storage.

well unless you only create the LV from a single node (the master).

That is not necesary - you just need a proper lock.

But even in this case the updated metadata need to be migrated to other nodes

why?

(or is it enough to pvscan/vgscan/lvscan on the other nodes to get updates) ?

AFAIK yes.
 
@dietmar: I've not read the source for LVM, however:

The point is that the kernel (almost certainly) caches some information about the VG/LV even if the source is stored on shared storage. This information can get outdated by the operations on another node and hence shall be refreshed on each change. This is what clvmd does (again a supposition but a likely one).

The official docs on clustered-lvm seem to support this and also l.mierzwa's experience agrees (i.e.... some nodes had outdated information).

This is the same situation that you get with shared storage and a plain FS (ext3 etc). If one node updates metadata or disk contents that are cached in another nodes cache you're in trouble. That's why you need a cluster filesystem.

I have scratched my head over this issue and I think there's a safer (altough less efficient) way for doing this, namely the Oracle's way.

What Oracle do in their Oracle VM product is to share a mount point via OCFS2. Server images are simply files on top of this shared filesystem.

The filesystem is optimized for huge file handling and while not supporting snapshots it does support REFLINKs. A reflink is a copy-on-write snapshot of a disk file, or in other words a very efficient snapshot of big files. The reflink is oracle's answer to backup and quick cloning of (running) VM.

I'm currently experimenting with ocfs2 cluster and KVM VM with raw files on top of it. I have no benchmarks.... but VM feel pretty responsive.

There's an issue with ocfs2 in the proxmox kernel so I'll probably resort to compiling my own clean kernel and use proxmox only as a gui .

I'd like some feedback on this...

jinjer
 
@dietmar: I've not read the source for LVM, however:

The point is that the kernel (almost certainly) caches some information about the VG/LV even if the source is stored on shared storage. This information can get outdated by the operations on another node and hence shall be refreshed on each change. This is what clvmd does (again a supposition but a likely one).

The lvm tools writes metadata without using a cache (O_DIRECT). Before we do any changes or migrate, we rescan. So there is no problem.
 
Thanks for the explanation. I see the potential problem... as soon as you use lvm tools by hand. Cluster can get out of sync. But knowing how you handle things it's easy to avoid problems (manually rescan on all nodes).

care to comment on the ocfs2 way of doing things?

jinjer
 
Ive a proxmox cluster with shared drbd-storage up an running and I would like to take a snapshot-backup of individual logical volumes manually. How should I achieve this backup on a cluster-node step by step? What means rescan? Just "lvscan" before and after taking and removing the snapshot?

Sascha
 
Last edited: