Shared stored, two node cluster, FS corruption. :(

rahman

Renowned Member
Nov 1, 2010
63
1
73
Hi,

I have two clustered v1.7 nodes (with all KVM VMs). The SAN storage connected to both with FC. Multipath configurations done on both machines and the LUN are mounted on both machines with same place (/virtual/4x2000).

For a few weeks everything was working OK. Live Migration was working without any problems. Today I did some network configuration on master and restarted it via web interface. After reboot while it was mounting partions, it started to give file system errors about LUN partition. I couldn't note the exact fs error but I needed to manually fsck the partition. I shutdown the slave cluster node (as there are some VMs on the same shared LUN) and start a fsck on master server. I took more than an hour to comlete the fsck. I mounted it with success and started the VMs on master node. All VMs was ok except one. The oracle linux vm refused to boot with FS errors. I started it witha live cd and run fsck on the VMs sda1, it tooked a few hours to finish fsck with a lot of errors. But the VM didn't come to life again.

Fortunately it was a fresh install with oracle DB without any data.

So, what was the problem with shared storage? Is it not OK to reboot Master? Pve has its own lock system to prevent corruption on shared storage while nodes are in sync. But what if the master decides to do a fsck on boot while the slave using the shared storage?

So how safe if the proxmox locking system? Did I do something wrong that caused this huge FS corruption?

Thanks.
 
Ex4. Wiki says if I check "shared" while creating storage on master, it should take care locking. Thus no need to use any cluster file system. Did I get it wrong?
 
Ex4. Wiki says if I check "shared" while creating storage on master, it should take care locking. Thus no need to use any cluster file system. Did I get it wrong?

But only if You use lvm on shared storage, it won't do any magic on filesystems making them "shared storage aware". You can't have non-clustered file system on shared block device and proxmox is no exception.
 
Ext4 is not a shared filesystem! Never, never, never use that on shared storage!!

Instead simply use LVM (without a filesystem on top).

Then I really dodge a bullet, I am lucky didn't lost anything important. It seems I need to study a bit more. Thanks for your hints.