GlusterFS Woes

JasonMHall

New Member
Jun 26, 2013
12
0
1
Hello all,

I have been around and around trying to google my way out of issues that I have been having with ProxmoxVE and GlusterFS. I get the same io-error and KVM lockups that many people here have talked about before, but no fix has yet worked for me. I have a 4 node HA configuration using distributed-replicated volumes. At random intervals, KVM instances on the nodes will lock (as in, no further writes to the volume). The only remedy is to stop the VM and then start it again. All nodes are using GlusterFS version 3.7.6.
Here's pveversion -v:
Code:
proxmox-ve: 4.1-26 (running kernel: 4.2.6-1-pve)
pve-manager: 4.0-64 (running version: 4.0-64/fc76ac6c)
pve-kernel-4.2.6-1-pve: 4.2.6-26
pve-kernel-3.19.8-1-pve: 3.19.8-3
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-40
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-37
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-17
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.7.6-1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie

storage.conf
Code:
dir: local
	path /var/lib/vz
	content images,iso,rootdir,vztmpl
	maxfiles 0

glusterfs: data
	volume data
	path /mnt/data
	server ppat-pdnhvs001
	maxfiles 1
	server2 ppat-pdnhvs002
	content images,iso,backup,rootdir,vztmpl

dir: security
	path /security
	nodes ppat-pdnhvs004
	content images
	maxfiles 1

VM 105.conf from /etc/pve/qemu-server
Code:
bootdisk: virtio0
cores: 4
hotplug: disk,network,usb
memory: 6144
name: vpat-pdnpdc002
net0: virtio=AE:48:8B:87:83:AC,bridge=vmbr0
net1: virtio=E6:2B:FA:A7:1E:63,bridge=vmbr10
net2: virtio=36:78:8C:C9:C9:1A,bridge=vmbr20
net3: virtio=32:D9:12:A7:B4:17,bridge=vmbr100
numa: 0
ostype: l26
parent: happyNewYear
smbios1: uuid=f3108ae0-64f3-4066-abfd-337970daccff
sockets: 1
vga: qxl
virtio0: data:105/vm-105-disk-1.qcow2,cache=writeback,size=52G

df -h from node 2
Code:
Filesystem                 Size  Used Avail Use% Mounted on
udev                        10M     0   10M   0% /dev
tmpfs                      3.2G  9.1M  3.2G   1% /run
/dev/dm-0                   95G  2.8G   87G   4% /
tmpfs                      7.9G   63M  7.8G   1% /dev/shm
tmpfs                      5.0M     0  5.0M   0% /run/lock
tmpfs                      7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/mapper/pve-data       197G  5.3G  182G   3% /var/lib/vz
/dev/mapper/pve-gluster    5.2T  739G  4.5T  14% /gluster
ppat-pdnhvs002:/flexshare   11T  1.4T  9.0T  13% /mnt/flexshare
ppat-pdnhvs002:/homes       11T  1.4T  9.0T  13% /mnt/homes
cgmfs                      100K     0  100K   0% /run/cgmanager/fs
tmpfs                      100K     0  100K   0% /run/lxcfs/controllers
/dev/fuse                   30M   56K   30M   1% /etc/pve
ppat-pdnhvs001:data         11T  1.4T  9.0T  13% /mnt/data

/etc/fstab from node 2
Code:
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/pve/root / ext4 errors=remount-ro 0 1
/dev/pve/data /var/lib/vz ext4 defaults 0 1
/dev/pve/swap none swap sw 0 0
proc /proc proc defaults 0 0
/dev/mapper/pve-gluster     /gluster     xfs     defaults,allocsize=4096,inode64,logbsize=256K,logbufs=8,noatime     1 2
ppat-pdnhvs002:/data		/mnt/data	glusterfs	defaults,_netdev,acl,log-level=WARNING,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=ppat-pdnhvs001:ppat-pdnhvs003:ppat-pdnhvs004		1 2
ppat-pdnhvs002:/flexshare	/mnt/flexshare	glusterfs	defaults,_netdev,acl,log-level=WARNING,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=ppat-pdnhvs001:ppat-pdnhvs003:ppat-pdnhvs004		1 2
ppat-pdnhvs002:/homes		/mnt/homes	glusterfs	defaults,_netdev,acl,log-level=WARNING,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=ppat-pdnhvs001:ppat-pdnhvs003:ppat-pdnhvs004		1 2
ppat-pdnhvs002:/spool		/mnt/spool	glusterfs	defaults,_netdev,acl,log-level=WARNING,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=ppat-pdnhvs001:ppat-pdnhvs003:ppat-pdnhvs004		1 2

Notice from fstab that the "spool" volume isn't mounted. That was from some other tinkering in the past.

I've also include a redirected "ip a" from all nodes, gluster volume info, and glusterfs logs from the first node.

You will notice the common error of
Code:
All subvolumes are down. Going offline until atleast one of them comes back up.

At one point I had seen topology change errors showing on one of my switch logs at around the same time that the above error would display on the nodes. I'm unable to get to my switches currently to grab the logs, though. People have talked about changing the ping-timeout, which is something that I haven't tried yet.

I have tried setting:
Code:
gluster volume set data cluster.self-heal-daemon off
On the gluster volume where the qcow2 disks live.

You'll likely notice that I've taken quite the sloppy approach to addressing the issue but, admittedly, I'm not an expert.
I've yet to find a definitive approach to handling the delicate balance between ProxmoxVE and GlusterFS, and I would really appreciate any guidance to narrow down and pinpoint the culprit behind this issue.

Thanks in advance!
 

Attachments

  • glustervolumeinfo.txt
    2.6 KB · Views: 1
  • ppat-pdnhvs001-ipa.txt
    4 KB · Views: 0
  • ppat-pdnhvs002-ipa.txt
    4 KB · Views: 0
  • ppat-pdnhvs003-ipa.txt
    3.8 KB · Views: 0
  • ppat-pdnhvs004-ipa.txt
    4.7 KB · Views: 0
Are the VM you are using all have .qcow2? Have you tried using .raw instead ?
If you must use .qcow2 then enable Cache=writethrough for all the disk images then power cycle all the VM. I believe you will see a difference.
 
It doesn't appear to have helped. I've attached pertinent information from two log files.

With hardware raid controllers, what should caching be on the physical disk? All nodes have raid 10.

pvdisplay
Code:
  --- Physical volume ---
  PV Name               /dev/sda3
  VG Name               pve
  PV Size               5.46 TiB / not usable 3.98 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              1430367
  Free PE               0
  Allocated PE          1430367
  PV UUID               TM1wZX-gkgQ-qkuw-OFxC-Q0bQ-n0Lv-6QbF05

vgdisplay
Code:
  --- Volume group ---
  VG Name               pve
  System ID           
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  10
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                4
  Open LV               4
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               5.46 TiB
  PE Size               4.00 MiB
  Total PE              1430367
  Alloc PE / Size       1430367 / 5.46 TiB
  Free  PE / Size       0 / 0 
  VG UUID               teIkR3-b3Ro-tWNu-p08z-kCNQ-hqH0-TlVUbd

lvdisplay
Code:
  --- Logical volume ---
  LV Path                /dev/pve/swap
  LV Name                swap
  VG Name                pve
  LV UUID                4YVSO1-eboQ-imYs-Im32-z0Cx-c6C3-0DjsfL
  LV Write Access        read/write
  LV Creation host, time proxmox, 2015-07-14 09:14:27 -0400
  LV Status              available
  # open                 2
  LV Size                15.00 GiB
  Current LE             3840
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:1
 
  --- Logical volume ---
  LV Path                /dev/pve/root
  LV Name                root
  VG Name                pve
  LV UUID                uCnOXN-5443-e9cI-0b2W-hwIL-dySX-NjJ478
  LV Write Access        read/write
  LV Creation host, time proxmox, 2015-07-14 09:14:27 -0400
  LV Status              available
  # open                 1
  LV Size                96.00 GiB
  Current LE             24576
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:0
 
  --- Logical volume ---
  LV Path                /dev/pve/data
  LV Name                data
  VG Name                pve
  LV UUID                kGeapR-Pu7A-o9cz-54Ox-1ilO-3Hnu-aHkJOf
  LV Write Access        read/write
  LV Creation host, time ppat-pdnhvs001, 2015-07-14 11:42:54 -0400
  LV Status              available
  # open                 1
  LV Size                200.00 GiB
  Current LE             51200
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:2
 
  --- Logical volume ---
  LV Path                /dev/pve/gluster
  LV Name                gluster
  VG Name                pve
  LV UUID                y9XfPT-36KN-FYQW-HaMs-MTyy-Ac8b-hzREkp
  LV Write Access        read/write
  LV Creation host, time ppat-pdnhvs001, 2015-07-14 11:52:56 -0400
  LV Status              available
  # open                 1
  LV Size                5.15 TiB
  Current LE             1350751
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:3
 

Attachments

  • ppat-pdnhvs002-syslog.txt
    5.5 KB · Views: 1
  • ppat-pdnhvs002-mnt-data.log.1.txt
    9 KB · Views: 0
In my setup i do not use hardware raid. My replicated Gluster is sitting on top of ZFS Raid-Z3 of 35TB usable space in each node. I personally avoid all hardware raid. The raid card used in the nodes are all on JBOD.
If you have battery backed RAID controller i believe Writeback is the cache to use if not then it should Writethrough to avoid data loss.

Did you try to convert the .qcow2 image to .raw to check the performance?
 
I have not yet attempted to convert the images to .raw. All nodes have battery backed RAID controllers. I'll have to wait until after hours to perform changes, but I will first verify the caching of the physical disks, and then attempt to convert the images to .raw.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!