[SOLVED] Unable to Allocate Image on GlusterFS

JonathonFS

Member
Mar 8, 2022
10
11
8
Throwing this solution up in case someone else runs into the issue. Scroll down to the bottom for the key Take Aways and dev recommendations.

Environment​

  • 3x PVE 7.0-11 nodes clustered together
  • Every node has a ZFS pool with a GlusterFS brick on it
  • Glusterd version 9.2
  • Gluster is configured in a 3 node Dispersed volume
    • gluster volume create Dsp-Rz2-300GB disperse 3 redundancy 1 10.x.x.1:/pve1-300GB-Rz2/proxmox 10.x.x.2:/pve2-300GB-Rz2/proxmox 10.x.x.3:/pve3-300GB-Rz2/proxmox

The Problem​

I was testing out some performance tuning setting on another volume in the Gluster pool. That pool was deleted and settings were reset. But shortly afterwards I noticed I was unable to migrate a disk to the Gluster volume, or create a new VM on the Gluster volume.

Task Output when migrating a disk:
create full clone of drive sata0 (pvexx-Volume:vm-105-disk-0)
Formatting 'gluster://10.x.x.1/Dsp-Rz2-300GB/images/105/vm-105-disk-0.raw', fmt=raw size=85899345920
TASK ERROR: storage migration failed: unable to create image: 'storage-Dsp-Rz2-300GB'-locked command timed out - aborting

Task Output when creating a VM:
Formatting 'gluster://10.x.x.1/Dsp-Rz2-300GB/images/109/vm-109-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=85899345920 lazy_refcounts=off refcount_bits=16
TASK ERROR: unable to create VM 109 - unable to create image: 'storage-Dsp-Rz2-300GB'-locked command timed out - aborting

Troubleshooting​

I ran pvesm status and Quorate was yes. I also ran these gluster commands, and everything showed active and connected. No issues.
# gluster pool list
# gluster peer status
# gluster volume info
# gluster volume status

The error output indicates there was an issue establishing the cluster lock on storage or file. But as we'll see later this was a red herring. Not know that, I proceeded to spend several hours trying to understand and fix this supposed locking issue. Part way into this troubleshooting effort, the gluster volume status command began to hang and report no information. It would stop after hitting the 120 second timeout for Gluster CLI. The CLI log file (/var/log/glusterfs/cli.log) was not helpful either.

When I ran the migrate/create disk command, the cluster lock would appear in /etc/pve/priv/lock/. I could also see the directory was created at /mnt/pve/Dsp-Rz2-300GB/images/109. When the task failed, the directory (109) would be removed and the cluster lock cleared. I then noticed that the task would run for exactly 60 seconds before failing. This was key in understanding that the problem wasn't necessarily related to locking. A Staff Member posted in another thread that commands run with a cluster lock are limited to 60 seconds, because they don't want to be locking files indefinitely. In my case, the underlying command was pvesm alloc, and it was clearly hitting that 60 second limit. Running that command with even an extremely small disk size would generate the same errors as before.
# pvesm alloc Dsp-Rz2-300GB 1234 vm-1234-deleteme.raw 10M

Since I could see the cluster lock being created and removed, I began to think this might not be a locking issue. To get more information, I dug around in the source code and found that for Gluster volumes, this command is actually invoking qemu-img create. So I built a test command and ran it.
# mkdir /mnt/pve/Dsp-Rz2-300GB/images/1234
# /usr/bin/qemu-img create gluster://10.x.x.1/Dsp-Rz2-300GB/images/1234/vm-1234-deleteme.raw 10M
Formatting 'gluster://10.x.x.1/Dsp-Rz2-300GB/images/1234/vm-1234-deleteme.raw', fmt=raw size=10485760
[2022-03-07 22:29:55.640552 +0000] E [rpc-clnt.c:329:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x195)[0x7fc572cfdbc5] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x72fc)[0x7fc572c942fc] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x109)[0x7fc572c9c3d9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x102c8)[0x7fc572c9d2c8] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7fc572c987e6] ))))) 0-gfapi: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2022-03-07 22:28:52.661711 +0000 (xid=0x2)
[2022-03-07 22:29:55.640588 +0000] E [MSGID: 104021] [glfs-mgmt.c:725:glfs_mgmt_getspec_cbk] 0-glfs-mgmt: failed to get the volume file [{key=Dsp-Rz2-300GB}, {errno=22}, {error=Invalid argument}]
qemu-img: gluster://10.x.x.1/Dsp-Rz2-300GB/images/1234/vm-1234-deleteme.raw: Gluster connection for volume Dsp-Rz2-300GB, path images/1234/vm-1234-deleteme.raw failed to connect
hint: failed on host 10.x.x.1 and port 24007 Please refer to gluster logs for more info

This makes it seem like a networking issue. But if that's the case, why would gluster peer status and gluster volume info show connected/active? Not sure, I decided to run the command against the second node in the cluster, and that actually worked!
# /usr/bin/qemu-img create gluster://10.x.x.2/Dsp-Rz2-300GB/images/1234/vm-1234-deleteme.raw 10M
Formatting 'gluster://10.x.x.2/Dsp-Rz2-300GB/images/1234/vm-1234-deleteme.raw', fmt=raw size=10485760
[2022-03-07 22:31:48.769216 +0000] I [io-stats.c:3706:ios_sample_buf_size_configure] 0-Dsp-Rz2-300GB: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0
[2022-03-07 22:32:00.796115 +0000] I [io-stats.c:4038:fini] 0-Dsp-Rz2-300GB: io-stats translator unloaded
# ls /mnt/pve/Dsp-Rz2-300GB/images/1234
vm-1234-deleteme.raw

Thinking this may just be a glitch in the gluster matrix, I got approval to shutdown all the VMs and restart all the nodes. But after all the nodes came up, none of the VMs could start. The Gluster volume was completely inaccessible. Now the qemu-img create command failed on every node. I ran pvesm status again, and Quorate was yes. All the gluster commands showed the same as before the restart:
# gluster pool list (connected)
# gluster peer status (connected)
# gluster volume info (active)
# gluster volume status (hangs and exits after 120 sec with no output)

Checking the gluster logs didn't give me a smoking gun, but there were clearly some communication issues:

/var/log/glusterfs/glusterd.log
[2022-03-07 22:47:31.313372 +0000] I [MSGID: 106496] [glusterd-handshake.c:968:__server_getspec] 0-management: Received mount request for volume Dsp-Rz2-300GB
The message "I [MSGID: 106496] [glusterd-handshake.c:968:__server_getspec] 0-management: Received mount request for volume Dsp-Rz2-300GB" repeated 3 times between [2022-03-07 22:47:31.313372 +0000] and [2022-03-07 22:47:43.006851 +0000]
[2022-03-07 22:48:36.849309 +0000] I [MSGID: 106496] [glusterd-handshake.c:968:__server_getspec] 0-management: Received mount request for volume Dsp-Rz2-300GB
/var/log/glusterfs/cli.log had nothing good

/var/log/glusterfs/bricks/pve1-300GB-Rz2-proxmox.log
[2022-03-07 22:47:29.235486 +0000] E [rpc-clnt.c:329:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x195)[0x7f19164b0bc5] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x72fc)[0x7f19164492fc] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x109)[0x7f19164513d9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x102c8)[0x7f19164522c8] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7f191644d7e6] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2022-03-07 22:46:26.689942 +0000 (xid=0x3a)
[2022-03-07 22:47:29.235592 +0000] E [rpc-clnt.c:329:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x195)[0x7f19164b0bc5] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x72fc)[0x7f19164492fc] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x109)[0x7f19164513d9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x102c8)[0x7f19164522c8] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7f191644d7e6] ))))) 0-glusterfs: forced unwinding frame type(Gluster Portmap) op(SIGNIN(4)) called at 2022-03-07 22:46:26.689960 +0000 (xid=0x3b)
[2022-03-07 22:48:14.291129 +0000] I [glusterfsd-mgmt.c:2661:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
/var/log/glusterfs/glustershd.log had one of these every 3 seconds
[2022-03-07 22:49:32.711221 +0000] E [socket.c:3534:socket_connect] 0-glusterfs: connection attempt on 127.0.0.1:24007 failed, (Invalid argument)
/var/log/glusterfs/mnt-pve-Dsp-Rz2-300GB.log
[2022-03-07 22:47:29.235646 +0000] E [rpc-clnt.c:329:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x195)[0x7f390e76ebc5] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x72fc)[0x7f390e7072fc] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x109)[0x7f390e70f3d9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x102c8)[0x7f390e7102c8] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7f390e70b7e6] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2022-03-07 22:46:26.689918 +0000 (xid=0xe9)
[2022-03-07 22:48:34.771129 +0000] I [glusterfsd-mgmt.c:2661:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2022-03-07 22:48:34.771417 +0000] E [rpc-clnt.c:329:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x195)[0x7f390e76ebc5] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x72fc)[0x7f390e7072fc] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x109)[0x7f390e70f3d9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x102c8)[0x7f390e7102c8] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7f390e70b7e6] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2022-03-07 22:47:32.236146 +0000 (xid=0xea)

The Solution​

After racking my brain for a while, I remembered that about 4 days ago I had changed the SAN NIC's MTU from 1500 to 6400, to try and boost performance. Changing the MTU back to 1500 and restarting the NICs brought Gluster back online instantly. Everything started working again. Praise God!

Take Aways​

If PVE is reporting a "lock" error, and the task exits after 60 seconds (120 seconds if it's unresponsive) then the problem might not be due to a cluster locking issue. It may be that the command is simply exceeding the 60 second timeout for "cluster lock" commands.

The PVE devs may want to consider adjusting the logic on Storage.pm, so that output from qemu-img create can be displayed within the PVE task output window. Getting the error "Failed to connect" would be a lot more helpful than getting the error "-locked". Starting on Line 923, it seems the function that allocates the disk (alloc_image) is embedded as a parameter within the function that creates the cluster lock (cluster_lock_storage). Not sure why it was done this way (maybe to enforce the 60 second timeout), but it would be nice if the Gluster plugin's alloc_image function could return an error message back to the PVE task output window, in the case when it's unable to return a $volname.
 
  • Like
Reactions: dbuenoparedes
Very nice troubleshooting case, thanks for taking the time to post it here.

@JonathonFS I wanted to ask you regarding the gluster volume configured as disperse 3 redundancy 1:

We're currently running a 2-node cluster in redundancy 3 arbiter 1 mode with 2x bricks + 1x arbiter but need to expand the Proxmox VE cluster to a 3-node cluster so the first thought (and more straight-forward) was to convert the gluster volume live to a redundancy 3 adding the 3rd brick (3rd PVE cluster node) to gluster.

Doing some research I found that a good option could also be the one you mentioned as disperse 3 redundancy 1, this way we would benefit in size from an additional brick and still maintain a resiliency of 1 node (could lose 1 node and still keep everything running).

Do you have any recommendation of 1 option vs the other? Any caveats you faced while running the PVE cluster with a disperse 3 redundancy 1 gluster volume?

Thanks in advance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!