KVM disk corruption on glusterfs

jinjer

Renowned Member
Oct 4, 2010
204
7
83
Hi,

I'm having issues with disk corruption for glusterfs hosted kvm images.

Simply, if a VM is running and I reboot one of the replicas holding the VM disk image, the running VM will start throwing disk errors and eventually die. On a reboot, I see disk corruption.

I'm using cache=writetrough and aio=native.

the back-end is gluster 3.4.2 with distributed-replicated storage of 2xN bricks.

Help needed :'(

Jinjer

# pveversion -v
proxmox-ve-2.6.32: 3.1-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.1-43 (running version: 3.1-43/1d4b0dfb)
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-13
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-2
 
Last edited:
Hi Jinjer,
Not sure what is happening to you, but only corruptions that I´ve had, have occurred when I was starting with gluster and I was very 'nervious'.

Rebooting a gluster node in no way should be a problem, I do it every week, just to test.
May be it can be a problem if you do not let gluster enough time to rebuild bricks when reboot of a node occurs...

Do you have tools to know/monitor how gluster is performing during rebuilding of bricks ?

If not, here you will find two liitle scripts that I run in my two gluster servers, just to know that thing are going ok:

I'm running watch -n5 ./testgluster.sh on one node
and ./verheal.sh on the other node.

They give you information about gluster status...

Hope it helps ;)
 

Attachments

  • GlusterTestFiles.zip
    1.2 KB · Views: 25
Last edited:
I think I may be seeing this same issue.

I have 3 proxmox nodes, all running glusterfs 3.4.2, with replica2 and quorum-type auto and server-quorum-type server (actually all the options use with 'group virt'). I use 2 bricks on each server to comply with the replica requirements, but ensure 2 bricks on the same node are NOT listed back to back and verified the vm files end up on bricks on different servers.

I create some basic CentOS 6.5 Guests/VMs that do nothing (just test vms, no active use), and shut down nodes one by one waiting for gluster volume heal info to show that the heal is complete between restarting the next node (can also tell by CPU usage on the machine). Then at some point I'll notice in the console of a VM that the root filesystem has been remounted readonly. When rebooting that node, it won't come back up, a kernel panic occurs presumably due to corruption. My VMs use virtio and have cache disabled, and I've tested with both qcow2 and raw disk images.

I know this post is a couple of months old, just wondering if any resolution was found. I noticed that post made by cesarpk indicated that aio=native may be bad, and to instead use aio=threads, but I don't see any obvious way to do that with Proxmox.
 
It turns out this wasn't actually disk corruption, the kernel panic is due to / not being writable. So the volume went into a read-only state on glusterfs. It turns out, when using replica 2, but with 3 nodes, the loss of just one node when enabling client-side quorum will result in the VM image becoming read-only as apparently client-side quorum operates on the "sub-volume" level. Once I disabled client-side quorum and relied only on server-side quorum, the issue appears to have gone away.
 
I'm observing similar corruption as well. I know it is corruption because the VM is using ZFS inside it, and ZFS is detecting the errors via checksum failures. Running Proxmox 4.1 this has happened to me. Just now it happened again on the 4.1 => 4.2 upgrade. I have a three node cluster as well, replication factor of 2 with only two data nodes. The third node is small just for quorum.

At first, It thought it was related to migrating the VM from node 2 to node 1 then back. This time it happened on a shut-down VM.

The sequence went as follows. Both vm 101 and 103 ended up with disk corruption. Both were running on node2 at the start of the upgrade and use glusterfs as the storage.
  1. upgrade node0 to PVE 4.2; reboot
  2. zfs snapshot the disks inside vm 101; snapshot vm 101 in PVE (I was expecting corruption because of the migration)
  3. migrate vm101 to node1 (success. no corruption. i checked the disks)
  4. shutdown vm 103 (the only remaining VM on node2)
  5. upgrade node2 to PVE 4.2; reboot
  6. vm 103 boots up. unfortunately i did not log in to check it to see what was happening with it, but the console window reported no errors.
  7. migrate vm 101 back to node2
  8. check gluster volume status. it shows no tasks running, so I presume it is sync'd up.
  9. stop all VMs on node1 and upgrade node1 to PVE 4.2; reboot
  10. boom.
Both vm101 and 103 report ZFS corruption in their file systems.

On vm 101 doing a zfs rollback to the internal snapshot did not help. there was still corruption. Doing the qemu snapshot rollback was successful, and zfs internally reports no corruption on a scrub.

On vm 103, I had to do a restore from backup from a few hours prior. Since restore to gluster is currently broken for a large disk (see Bug 932), I just restored to the local ZFS pool and will likely just leave it that way. It is faster anyhow.

I have no idea if the VMs on node1 are corrupted, but they seem to be running ok. These all have FreeBSD UFS file systems and one is running an embedded linux-based appliance.

My new procedures are to do qm snapshot on all vm's prior to rebooting anything.
 
So the definitive answer I have gotten is that with 2 data nodes, if the "first" node goes down, the cluster stops working, and you get this corruption because the clients don't seem to stop trying to write. You'd think that the client would get a write error and panic or something, but it does not.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!