Proxmox with GlusterFS = Machines crashing randomly...

JuergenNl

New Member
Jul 3, 2015
2
0
1
Hello!

We have just set up a new proxmox environment before a few months. All was running fine for about 2 months but now we got problems with it - linux machines randomly crashing. Sometimes their filesystem get read-only or they load increasing extremly fast above 50 without any cpu usage. Looks like its IO-related.
But i can't find the problem.

Thats our server configuration. Its a server of 3 proxmox nodes in a cluster, using glusterfs over all of the 3 nodes with a replica of 2 (1st server replicate its brick to 2nd one, 2nd one to 3rd one and 3rd one to 1st one; so on every one are 2 bricks).

proxmox-ve-2.6.32: 3.4-150 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-3 (running version: 3.4-3/2fc72fee)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-18-pve: 2.6.32-88
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.4-3
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-32
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Volume Name: gluster-volume
Type: Distributed-Replicate
Volume ID: 24e2888b-d540-4228-8e17-6e3e8c452335
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: vcl01:/var/lib/vz/brick1
Brick2: vcl02:/var/lib/vz/brick1
Brick3: vcl02:/var/lib/vz/brick2
Brick4: vcl03:/rpool/glusterfs/brick2
Brick5: vcl03:/rpool/glusterfs/brick1
Brick6: vcl01:/var/lib/vz/brick2
Options Reconfigured:
server.allow-insecure: on
performance.write-behind: off
cluster.quorum-type: none
network.ping-timeout: 2
performance.md-cache-timeout: 1
performance.cache-max-file-size: 2MB
performance.write-behind-window-size: 4MB
performance.read-ahead: off
performance.quick-read: off
performance.cache-size: 512MB
performance.io-thread-count: 64

The machines hdd configuration looks like that:
virtio0: gluster-volume:103/vm-103-disk-1.qcow2,format=qcow2,size=100G
virtio1: gluster-volume:103/vm-103-disk-2.qcow2,format=qcow2,size=250G

> gluster vol heal gluster-volume info
Brick vcl01:/var/lib/vz/brick1/
Number of entries: 0

Brick vcl02:/var/lib/vz/brick1/
Number of entries: 0

Brick vcl02:/var/lib/vz/brick2/
Number of entries: 0

Brick vcl03:/rpool/glusterfs/brick2/
Number of entries: 0

Brick vcl03:/rpool/glusterfs/brick1/
Number of entries: 0

Brick vcl01:/var/lib/vz/brick2/
Number of entries: 0

Everything looks ok.
I tried everything i could find in related threads but can't get the system back to a stable system :(

Which logs could i check?

The cli.log displaying that once a minute:
[2015-07-03 09:46:37.304631] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled
[2015-07-03 09:46:37.304678] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread
[2015-07-03 09:46:37.376926] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled
[2015-07-03 09:46:37.376940] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread
[2015-07-03 09:46:37.454291] I [socket.c:2238:socket_event_handler] 0-transport: disconnecting now
[2015-07-03 09:46:37.455901] I [cli-rpc-ops.c:518:gf_cli_get_volume_cbk] 0-cli: Received resp to get vol: 0
[2015-07-03 09:46:37.456049] I [cli-rpc-ops.c:779:gf_cli_get_volume_cbk] 0-cli: Returning: 0
[2015-07-03 09:46:37.456058] I [input.c:36:cli_batch] 0-: Exiting with: 0

But those messages appearing since a long time (how to get rid of it?). They don't seem to be the reason of the problem...

glustershd.log:
[2015-07-03 07:32:24.883854] I [afr-self-heald.c:1690:afr_dir_exclusive_crawl] 0-gluster-volume-replicate-2: Another crawl is in progress for gluster-volume-client-5
[2015-07-03 09:05:42.055861] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.242792] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.243635] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.247140] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:06:02.247181] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:07:19.150924] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:07:35.981441] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:07:35.987741] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:07:35.989568] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:26:02.132316] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:26:08.649973] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:26:08.650823] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:26:08.652070] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing

Anyone got an idea on how to tune to get it more stable? How to track the bottleneck/issues?
I can't find any tools giving me the info on whats going wrong in our setup... :(

Best regards,
Juergen
 
Last edited:
The mnt-pve-gluster-volume.log.1 tells me the following:
Looks like thats the reason... Is it configuration based or is the switch going to die??

[2015-07-03 01:40:52.892169] C [client-handshake.c:127:rpc_client_ping_timer_expired] 2-gluster-volume-client-2: server 10.1.1.52:49153 has not responded in the last 2 seconds, disconnecting.
[2015-07-03 01:40:52.992051] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x168) [0x7fa46b65af08] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7fa46b6591c3] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fa46b6590de]))) 2-gluster-volume-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-07-03 01:40:50.630002 (xid=0x78d3)
[2015-07-03 01:40:52.992087] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 2-gluster-volume-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2015-07-03 01:40:52.992359] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x168) [0x7fa46b65af08] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7fa46b6591c3] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fa46b6590de]))) 2-gluster-volume-client-2: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2015-07-03 01:40:50.630016 (xid=0x78d4)
[2015-07-03 01:40:53.155691] I [socket.c:3060:socket_submit_request] 2-gluster-volume-client-2: not connected (priv->connected = 0)
[2015-07-03 01:40:53.155751] W [rpc-clnt.c:1542:rpc_clnt_submit] 2-gluster-volume-client-2: failed to submit rpc-request (XID: 0x78d5 Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (gluster-volume-client-2)
[2015-07-03 01:40:53.206605] W [client-handshake.c:276:client_ping_cbk] 2-gluster-volume-client-2: timer must have expired
[2015-07-03 01:40:53.206611] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 2-gluster-volume-client-2: remote operation failed: Transport endpoint is not connected. Path: /images (037820a2-d531-4472-a4de-f0d2559721b5)
[2015-07-03 01:40:53.246978] I [client.c:2229:client_rpc_notify] 2-gluster-volume-client-2: disconnected from 10.1.1.52:49153. Client process will keep trying to connect to glusterd until brick's port is available
[2015-07-03 01:40:53.409051] I [rpc-clnt.c:1729:rpc_clnt_reconfig] 2-gluster-volume-client-2: changing port to 49153 (from 0)
[2015-07-03 01:40:53.422796] I [client-handshake.c:1677:select_server_supported_programs] 2-gluster-volume-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-07-03 01:40:53.423266] I [client-handshake.c:1462:client_setvolume_cbk] 2-gluster-volume-client-2: Connected to 10.1.1.52:49153, attached to remote volume '/var/lib/vz/brick2'.
[2015-07-03 01:40:53.423281] I [client-handshake.c:1474:client_setvolume_cbk] 2-gluster-volume-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2015-07-03 01:40:53.423649] I [client-handshake.c:450:client_set_lk_version_cbk] 2-gluster-volume-client-2: Server lk version = 1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!