Proxmox with GlusterFS = Machines crashing randomly...

JuergenNl

New Member
Jul 3, 2015
2
0
1
Hello!

We have just set up a new proxmox environment before a few months. All was running fine for about 2 months but now we got problems with it - linux machines randomly crashing. Sometimes their filesystem get read-only or they load increasing extremly fast above 50 without any cpu usage. Looks like its IO-related.
But i can't find the problem.

Thats our server configuration. Its a server of 3 proxmox nodes in a cluster, using glusterfs over all of the 3 nodes with a replica of 2 (1st server replicate its brick to 2nd one, 2nd one to 3rd one and 3rd one to 1st one; so on every one are 2 bricks).

proxmox-ve-2.6.32: 3.4-150 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-3 (running version: 3.4-3/2fc72fee)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-18-pve: 2.6.32-88
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.4-3
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-32
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Volume Name: gluster-volume
Type: Distributed-Replicate
Volume ID: 24e2888b-d540-4228-8e17-6e3e8c452335
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: vcl01:/var/lib/vz/brick1
Brick2: vcl02:/var/lib/vz/brick1
Brick3: vcl02:/var/lib/vz/brick2
Brick4: vcl03:/rpool/glusterfs/brick2
Brick5: vcl03:/rpool/glusterfs/brick1
Brick6: vcl01:/var/lib/vz/brick2
Options Reconfigured:
server.allow-insecure: on
performance.write-behind: off
cluster.quorum-type: none
network.ping-timeout: 2
performance.md-cache-timeout: 1
performance.cache-max-file-size: 2MB
performance.write-behind-window-size: 4MB
performance.read-ahead: off
performance.quick-read: off
performance.cache-size: 512MB
performance.io-thread-count: 64

The machines hdd configuration looks like that:
virtio0: gluster-volume:103/vm-103-disk-1.qcow2,format=qcow2,size=100G
virtio1: gluster-volume:103/vm-103-disk-2.qcow2,format=qcow2,size=250G

> gluster vol heal gluster-volume info
Brick vcl01:/var/lib/vz/brick1/
Number of entries: 0

Brick vcl02:/var/lib/vz/brick1/
Number of entries: 0

Brick vcl02:/var/lib/vz/brick2/
Number of entries: 0

Brick vcl03:/rpool/glusterfs/brick2/
Number of entries: 0

Brick vcl03:/rpool/glusterfs/brick1/
Number of entries: 0

Brick vcl01:/var/lib/vz/brick2/
Number of entries: 0

Everything looks ok.
I tried everything i could find in related threads but can't get the system back to a stable system :(

Which logs could i check?

The cli.log displaying that once a minute:
[2015-07-03 09:46:37.304631] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled
[2015-07-03 09:46:37.304678] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread
[2015-07-03 09:46:37.376926] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled
[2015-07-03 09:46:37.376940] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread
[2015-07-03 09:46:37.454291] I [socket.c:2238:socket_event_handler] 0-transport: disconnecting now
[2015-07-03 09:46:37.455901] I [cli-rpc-ops.c:518:gf_cli_get_volume_cbk] 0-cli: Received resp to get vol: 0
[2015-07-03 09:46:37.456049] I [cli-rpc-ops.c:779:gf_cli_get_volume_cbk] 0-cli: Returning: 0
[2015-07-03 09:46:37.456058] I [input.c:36:cli_batch] 0-: Exiting with: 0

But those messages appearing since a long time (how to get rid of it?). They don't seem to be the reason of the problem...

glustershd.log:
[2015-07-03 07:32:24.883854] I [afr-self-heald.c:1690:afr_dir_exclusive_crawl] 0-gluster-volume-replicate-2: Another crawl is in progress for gluster-volume-client-5
[2015-07-03 09:05:42.055861] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.242792] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.243635] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.247140] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:06:02.247181] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:07:19.150924] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:07:35.981441] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:07:35.987741] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:07:35.989568] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:26:02.132316] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:26:08.649973] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:26:08.650823] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:26:08.652070] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing

Anyone got an idea on how to tune to get it more stable? How to track the bottleneck/issues?
I can't find any tools giving me the info on whats going wrong in our setup... :(

Best regards,
Juergen
 
Last edited:
The mnt-pve-gluster-volume.log.1 tells me the following:
Looks like thats the reason... Is it configuration based or is the switch going to die??

[2015-07-03 01:40:52.892169] C [client-handshake.c:127:rpc_client_ping_timer_expired] 2-gluster-volume-client-2: server 10.1.1.52:49153 has not responded in the last 2 seconds, disconnecting.
[2015-07-03 01:40:52.992051] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x168) [0x7fa46b65af08] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7fa46b6591c3] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fa46b6590de]))) 2-gluster-volume-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-07-03 01:40:50.630002 (xid=0x78d3)
[2015-07-03 01:40:52.992087] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 2-gluster-volume-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2015-07-03 01:40:52.992359] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x168) [0x7fa46b65af08] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7fa46b6591c3] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fa46b6590de]))) 2-gluster-volume-client-2: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2015-07-03 01:40:50.630016 (xid=0x78d4)
[2015-07-03 01:40:53.155691] I [socket.c:3060:socket_submit_request] 2-gluster-volume-client-2: not connected (priv->connected = 0)
[2015-07-03 01:40:53.155751] W [rpc-clnt.c:1542:rpc_clnt_submit] 2-gluster-volume-client-2: failed to submit rpc-request (XID: 0x78d5 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (gluster-volume-client-2)
[2015-07-03 01:40:53.206605] W [client-handshake.c:276:client_ping_cbk] 2-gluster-volume-client-2: timer must have expired
[2015-07-03 01:40:53.206611] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 2-gluster-volume-client-2: remote operation failed: Transport endpoint is not connected. Path: /images (037820a2-d531-4472-a4de-f0d2559721b5)
[2015-07-03 01:40:53.246978] I [client.c:2229:client_rpc_notify] 2-gluster-volume-client-2: disconnected from 10.1.1.52:49153. Client process will keep trying to connect to glusterd until brick's port is available
[2015-07-03 01:40:53.409051] I [rpc-clnt.c:1729:rpc_clnt_reconfig] 2-gluster-volume-client-2: changing port to 49153 (from 0)
[2015-07-03 01:40:53.422796] I [client-handshake.c:1677:select_server_supported_programs] 2-gluster-volume-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-07-03 01:40:53.423266] I [client-handshake.c:1462:client_setvolume_cbk] 2-gluster-volume-client-2: Connected to 10.1.1.52:49153, attached to remote volume '/var/lib/vz/brick2'.
[2015-07-03 01:40:53.423281] I [client-handshake.c:1474:client_setvolume_cbk] 2-gluster-volume-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2015-07-03 01:40:53.423649] I [client-handshake.c:450:client_set_lk_version_cbk] 2-gluster-volume-client-2: Server lk version = 1