Hi,
I have two clustered v1.7 nodes (with all KVM VMs). The SAN storage connected to both with FC. Multipath configurations done on both machines and the LUN are mounted on both machines with same place (/virtual/4x2000).
For a few weeks everything was working OK. Live Migration was working without any problems. Today I did some network configuration on master and restarted it via web interface. After reboot while it was mounting partions, it started to give file system errors about LUN partition. I couldn't note the exact fs error but I needed to manually fsck the partition. I shutdown the slave cluster node (as there are some VMs on the same shared LUN) and start a fsck on master server. I took more than an hour to comlete the fsck. I mounted it with success and started the VMs on master node. All VMs was ok except one. The oracle linux vm refused to boot with FS errors. I started it witha live cd and run fsck on the VMs sda1, it tooked a few hours to finish fsck with a lot of errors. But the VM didn't come to life again.
Fortunately it was a fresh install with oracle DB without any data.
So, what was the problem with shared storage? Is it not OK to reboot Master? Pve has its own lock system to prevent corruption on shared storage while nodes are in sync. But what if the master decides to do a fsck on boot while the slave using the shared storage?
So how safe if the proxmox locking system? Did I do something wrong that caused this huge FS corruption?
Thanks.
I have two clustered v1.7 nodes (with all KVM VMs). The SAN storage connected to both with FC. Multipath configurations done on both machines and the LUN are mounted on both machines with same place (/virtual/4x2000).
For a few weeks everything was working OK. Live Migration was working without any problems. Today I did some network configuration on master and restarted it via web interface. After reboot while it was mounting partions, it started to give file system errors about LUN partition. I couldn't note the exact fs error but I needed to manually fsck the partition. I shutdown the slave cluster node (as there are some VMs on the same shared LUN) and start a fsck on master server. I took more than an hour to comlete the fsck. I mounted it with success and started the VMs on master node. All VMs was ok except one. The oracle linux vm refused to boot with FS errors. I started it witha live cd and run fsck on the VMs sda1, it tooked a few hours to finish fsck with a lot of errors. But the VM didn't come to life again.
Fortunately it was a fresh install with oracle DB without any data.
So, what was the problem with shared storage? Is it not OK to reboot Master? Pve has its own lock system to prevent corruption on shared storage while nodes are in sync. But what if the master decides to do a fsck on boot while the slave using the shared storage?
So how safe if the proxmox locking system? Did I do something wrong that caused this huge FS corruption?
Thanks.