Hi Proxmox community,
Our problem probably has more to do with Glusterfs than with proxmox but we didn't wanted to miss the opertunity to also ask this community for help.
We have a PVE cluster with two nodes. These two nodes each have 4 HDDs over which we have a glusterfs to migrate VMs live.
A few days ago we had the problem that some disk files in the glusterfs got into a split-brain condition. We were able to secure the corresponding logfiles and resolve the split brain condition, but don't know how it happened. In the appendix you can find the Glusterfs log files.
Maybe one of you can tell us what caused the problem:
Here is the network setup of the PVE Cluster
192.168.231.0/24 --> Serverlan (reach PVE Gui port 8006)
10.10.11.0 /24 --> Cluster Ha Lan
10.10.12.0 /24 --> Glusterfs Storage lan
Glusterfs Lan
.) PVEServer1 - 10.10.12.31
.) PVEServer2 - 10.10.12.32
What we've seen in the mnt-pve-GlusterVol01.log log file:
Server1:
[2019-05-13 04:25:01.509716] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2019-05-13 09:47:48.277650] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.10.12.31:24007 failed (No data available)
[2019-05-13 09:47:48.277696] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.10.12.31 (No data available)
[2019-05-13 09:47:48.277704] I [glusterfsd-mgmt.c:1926:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2019-05-13 09:47:50.926948] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7fe58a1eb494] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5) [0x55a8728115e5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55a872811444] ) 0-: received signum (15), shutting down
[2019-05-13 09:47:50.926977] I [fuse-bridge.c:5794:fini] 0-fuse: Unmounting '/mnt/pve/GlusterVol01'.
[2019-05-13 09:47:50.950381] I [fuse-bridge.c:5086:fuse_thread_proc] 0-fuse: unmounting /mnt/pve/GlusterVol01
[2019-05-13 09:49:43.823117] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.8 (args: /usr/sbin/glusterfs --volfile-server=10.10.12.31 --volfile-id=vol0 /mnt/pve/GlusterVol01)
[2019-05-13 09:49:43.828117] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-05-13 09:49:43.869885] W [MSGID: 108003] [afr.c:102:fix_quorum_options] 0-vol0-replicate-0: quorum-type none overriding quorum-count 1
[2019-05-13 09:49:43.871644] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2019-05-13 09:49:43.880208] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-0: parent translators are ready, attempting connect on transport
[2019-05-13 09:49:43.880609] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-1: parent translators are ready, attempting connect on transport
[2019-05-13 09:49:43.880816] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0)
Final graph:
+------------------------------------------------------------------------------+
1: volume vol0-client-0
2: type protocol/client
3: option ping-timeout 5
4: option remote-host pvetau01-storage
5: option remote-subvolume /var/lib/glusterfs/data01/brick1/vol0
6: option transport-type socket
7: option transport.address-family inet
8: option username 4ccc2234-fba7-40f9-b97b-26d3fa8ab401
9: option password cef1b5f5-b16c-4a3c-b49f-f814901a3252
10: option filter-O_DIRECT enable
11: option send-gids true
12: end-volume
13:
14: volume vol0-client-1
15: type protocol/client
16: option ping-timeout 5
17: option remote-host pvetau02-storage
18: option remote-subvolume /var/lib/glusterfs/data01/brick1/vol0
19: option transport-type socket
20: option transport.address-family inet
21: option username 4ccc2234-fba7-40f9-b97b-26d3fa8ab401
22: option password cef1b5f5-b16c-4a3c-b49f-f814901a3252
23: option filter-O_DIRECT enable
24: option send-gids true
25: end-volume
26:
27: volume vol0-replicate-0
28: type cluster/replicate
29: option eager-lock enable
30: option quorum-count 1
31: subvolumes vol0-client-0 vol0-client-1
32: end-volume
33:
34: volume vol0-dht
35: type cluster/distribute
36: option lock-migration off
37: subvolumes vol0-replicate-0
38: end-volume
39:
40: volume vol0-write-behind
41: type performance/write-behind
42: subvolumes vol0-dht
43: end-volume
44:
45: volume vol0-readdir-ahead
46: type performance/readdir-ahead
47: subvolumes vol0-write-behind
48: end-volume
49:
50: volume vol0-open-behind
51: type performance/open-behind
52: subvolumes vol0-readdir-ahead
53: end-volume
54:
55: volume vol0
56: type debug/io-stats
57: option log-level INFO
58: option latency-measurement off
59: option count-fop-hits off
60: subvolumes vol0-open-behind
61: end-volume
62:
63: volume meta-autoload
64: type meta
65: subvolumes vol0
66: end-volume
67:
+------------------------------------------------------------------------------+
[2019-05-13 09:49:43.881243] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 09:49:43.881434] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-1: changing port to 49154 (from 0)
[2019-05-13 09:49:43.881906] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 09:49:43.882213] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 09:49:43.882222] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 09:49:43.882249] I [MSGID: 108005] [afr-common.c:4382:afr_notify] 0-vol0-replicate-0: Subvolume 'vol0-client-1' came back up; going online.
[2019-05-13 09:49:43.882360] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1
[2019-05-13 09:49:43.886625] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 09:49:43.886633] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 09:49:43.890995] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1
[2019-05-13 09:49:43.891049] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.26
[2019-05-13 09:49:43.891067] I [fuse-bridge.c:4838:fuse_graph_sync] 0-fuse: switched to graph 0
[2019-05-13 09:49:43.891625] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0
[2019-05-13 10:20:38.998246] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-vol0-client-1: server 10.10.12.32:49154 has not responded in the last 5 seconds, disconnecting.
[2019-05-13 10:20:38.998657] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f69df41fe83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f69df1e7b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69df1e7c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7f69df1e92e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f69df1e9bb4] ))))) 0-vol0-client-1: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2019-05-13 10:20:33.237111 (xid=0x492)
[2019-05-13 10:20:38.998681] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-vol0-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2019-05-13 10:20:38.998829] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f69df41fe83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f69df1e7b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69df1e7c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7f69df1e92e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f69df1e9bb4] ))))) 0-vol0-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-05-13 10:20:33.237115 (xid=0x493)
[2019-05-13 10:20:38.998843] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-vol0-client-1: socket disconnected
[2019-05-13 10:20:38.998854] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-1: disconnected from vol0-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-13 10:20:43.355917] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0
[2019-05-13 10:21:20.850030] E [socket.c:2309:socket_connect_finish] 0-vol0-client-1: connection to 10.10.12.32:24007 failed (No route to host)
[2019-05-13 10:22:07.026615] E [MSGID: 114058] [client-handshake.c:1534:client_query_portmap_cbk] 0-vol0-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2019-05-13 10:22:07.026663] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-1: disconnected from vol0-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-13 10:22:10.010421] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-1: changing port to 49154 (from 0)
[2019-05-13 10:22:10.011105] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 10:22:10.011558] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 10:22:10.011609] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 10:22:10.011622] I [MSGID: 114042] [client-handshake.c:1054:client_post_handshake] 0-vol0-client-1: 2 fds open - Delaying child_up until they are re-opened
[2019-05-13 10:22:10.032258] I [MSGID: 114041] [client-handshake.c:676:client_child_up_reopen_done] 0-vol0-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2019-05-13 10:22:10.032492] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1
[2019-05-13 10:22:13.790586] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0
[2019-05-13 11:12:57.300347] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 11:12:57.305284] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 4 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 11:12:57.305712] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 11:12:57.306277] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 11:12:57.306938] I [MSGID: 114024] [client-helpers.c:99:this_fd_set_ctx] 0-vol0-client-0: /images/103/vm-103-disk-0.qcow2 (5f9490a8-ec56-410e-9c70-653e0da77174): trying duplicate remote fd set.
[2019-05-13 11:12:57.306973] I [MSGID: 114024] [client-helpers.c:99:this_fd_set_ctx] 0-vol0-client-1: /images/103/vm-103-disk-0.qcow2 (5f9490a8-ec56-410e-9c70-653e0da77174): trying duplicate remote fd set.
[2019-05-13 11:12:57.310052] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2698: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error)
[2019-05-13 11:12:57.310137] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2697: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error)
[2019-05-13 11:12:57.311543] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2699: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error)
The message "E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]" repeated 2 times between [2019-05-13 11:12:57.305712] and [2019-05-13 11:12:57.310816]
The message "W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)" repeated 2 times between [2019-05-13 11:12:57.306277] and [2019-05-13 11:12:57.311184]
The message "W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 4 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)" repeated 6 times between [2019-05-13 11:12:57.305284] and [2019-05-13 11:12:57.311274]
The message "E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]" repeated 5 times between [2019-05-13 11:12:57.300347] and [2019-05-13 11:12:57.311531]
Server 2:
[2019-05-13 04:25:01.338790] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2019-05-13 09:47:59.443328] E [socket.c:2309:socket_connect_finish] 0-glusterfs: connection to 10.10.12.31:24007 failed (Connection refused)
[2019-05-13 09:48:17.426580] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-vol0-client-0: server 10.10.12.31:49155 has not responded in the last 5 seconds, disconnecting.
[2019-05-13 09:48:17.426872] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7efebd3f9e83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7efebd1c1b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efebd1c1c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7efebd1c32e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7efebd1c3bb4] ))))) 0-vol0-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2019-05-13 09:48:12.180579 (xid=0x5663a4)
[2019-05-13 09:48:17.426899] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-vol0-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2019-05-13 09:48:17.427056] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7efebd3f9e83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7efebd1c1b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efebd1c1c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7efebd1c32e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7efebd1c3bb4] ))))) 0-vol0-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-05-13 09:48:12.180591 (xid=0x5663a5)
[2019-05-13 09:48:17.427067] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-vol0-client-0: socket disconnected
[2019-05-13 09:48:17.427077] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-0: disconnected from vol0-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-13 09:48:21.479100] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1
[2019-05-13 09:48:59.219302] E [socket.c:2309:socket_connect_finish] 0-vol0-client-0: connection to 10.10.12.31:24007 failed (No route to host)
[2019-05-13 09:49:41.468469] I [glusterfsd-mgmt.c:1600:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2019-05-13 09:49:42.505174] E [MSGID: 114058] [client-handshake.c:1534:client_query_portmap_cbk] 0-vol0-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2019-05-13 09:49:42.505225] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-0: disconnected from vol0-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-13 09:49:45.442003] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0)
[2019-05-13 09:49:45.442523] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 09:49:45.442802] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 09:49:45.442812] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 09:49:45.442820] I [MSGID: 114042] [client-handshake.c:1054:client_post_handshake] 0-vol0-client-0: 2 fds open - Delaying child_up until they are re-opened
[2019-05-13 09:49:45.443244] I [MSGID: 114041] [client-handshake.c:676:client_child_up_reopen_done] 0-vol0-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2019-05-13 09:49:45.443353] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1
[2019-05-13 09:49:49.622255] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1
[2019-05-13 10:20:06.060045] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7efebc254494] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5) [0x55dba7a3b5e5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55dba7a3b444] ) 0-: received signum (15), shutting down
[2019-05-13 10:20:06.068969] I [fuse-bridge.c:5794:fini] 0-fuse: Unmounting '/mnt/pve/GlusterVol01'.
[2019-05-13 10:20:06.103235] I [fuse-bridge.c:5086:fuse_thread_proc] 0-fuse: unmounting /mnt/pve/GlusterVol01
[2019-05-13 10:22:08.842734] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.8 (args: /usr/sbin/glusterfs --volfile-server=10.10.12.31 --volfile-id=vol0 /mnt/pve/GlusterVol01)
[2019-05-13 10:22:08.853935] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-05-13 10:22:08.944855] W [MSGID: 108003] [afr.c:102:fix_quorum_options] 0-vol0-replicate-0: quorum-type none overriding quorum-count 1
[2019-05-13 10:22:08.946502] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2019-05-13 10:22:08.972020] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-0: parent translators are ready, attempting connect on transport
[2019-05-13 10:22:08.972395] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-1: parent translators are ready, attempting connect on transport
[2019-05-13 10:22:08.972832] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0)
[2019-05-13 10:22:08.973142] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 10:22:08.973231] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 10:22:08.973544] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 10:22:08.973544] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 10:22:08.973566] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 10:22:08.973567] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 10:22:08.973616] I [MSGID: 108005] [afr-common.c:4382:afr_notify] 0-vol0-replicate-0: Subvolume 'vol0-client-1' came back up; going online.
[2019-05-13 10:22:08.973639] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1
[2019-05-13 10:22:08.977940] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1
[2019-05-13 10:22:08.978055] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.26
[2019-05-13 10:22:08.978075] I [fuse-bridge.c:4838:fuse_graph_sync] 0-fuse: switched to graph 0
[2019-05-13 10:22:08.978603] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1
[2019-05-13 10:53:46.573894] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.573992] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 10:53:46.574253] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.574949] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:53:46.575526] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1380: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.577820] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1381: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.596838] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.597759] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.598916] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
The message "W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)" repeated 2 times between [2019-05-13 10:53:46.574949] and [2019-05-13 10:53:46.599257]
[2019-05-13 10:53:46.599525] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.599797] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.599825] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1389: READ => -1 gfid=609bb8be-3ae8-470d-9f88-2b65095fbed4 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.599876] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.600149] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.600193] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.600417] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.600775] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:53:46.601071] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.601537] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.601577] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1390: READ => -1 gfid=609bb8be-3ae8-470d-9f88-2b65095fbed4 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.619830] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.620701] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain)
[2019-05-13 10:53:46.621098] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.621455] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:53:46.621732] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain)
[2019-05-13 10:53:46.623509] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.624891] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.625212] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:53:46.625314] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain)
[2019-05-13 10:53:46.625721] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.625754] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1399: READ => -1 gfid=79423c92-0338-4dc9-bafc-091172e8d845 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.576286] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:56:28.176786] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:56:28.177684] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 10:56:28.178782] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:56:28.179128] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:56:28.180634] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1533: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:56:28.179439] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 10:56:28.180620] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:59:25.278595] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:59:25.279517] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 10:59:25.280605] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:59:25.281649] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1685: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:59:25.281250] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
-------------------------------------------------
What we can't explain is why server 1 does the following:
[2019-05-13 09:47:48.277650] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.10.12.31:24007 failed (No data available)
[2019-05-13 09:47:48.277696] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.10.12.31 (No data available)
[2019-05-13 09:47:48.277704] I [glusterfsd-mgmt.c:1926:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
then the volume will be unmounted and re-mounted with another port again.
In further consequence server 2 behaves exactly like this which consequences in aa a split-brain condition of the disk files of the VMs.
we would be glad if someone could explain these behaviors to us.
BR
René
Our problem probably has more to do with Glusterfs than with proxmox but we didn't wanted to miss the opertunity to also ask this community for help.
We have a PVE cluster with two nodes. These two nodes each have 4 HDDs over which we have a glusterfs to migrate VMs live.
A few days ago we had the problem that some disk files in the glusterfs got into a split-brain condition. We were able to secure the corresponding logfiles and resolve the split brain condition, but don't know how it happened. In the appendix you can find the Glusterfs log files.
Maybe one of you can tell us what caused the problem:
Here is the network setup of the PVE Cluster
192.168.231.0/24 --> Serverlan (reach PVE Gui port 8006)
10.10.11.0 /24 --> Cluster Ha Lan
10.10.12.0 /24 --> Glusterfs Storage lan
Glusterfs Lan
.) PVEServer1 - 10.10.12.31
.) PVEServer2 - 10.10.12.32
What we've seen in the mnt-pve-GlusterVol01.log log file:
Server1:
[2019-05-13 04:25:01.509716] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2019-05-13 09:47:48.277650] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.10.12.31:24007 failed (No data available)
[2019-05-13 09:47:48.277696] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.10.12.31 (No data available)
[2019-05-13 09:47:48.277704] I [glusterfsd-mgmt.c:1926:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2019-05-13 09:47:50.926948] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7fe58a1eb494] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5) [0x55a8728115e5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55a872811444] ) 0-: received signum (15), shutting down
[2019-05-13 09:47:50.926977] I [fuse-bridge.c:5794:fini] 0-fuse: Unmounting '/mnt/pve/GlusterVol01'.
[2019-05-13 09:47:50.950381] I [fuse-bridge.c:5086:fuse_thread_proc] 0-fuse: unmounting /mnt/pve/GlusterVol01
[2019-05-13 09:49:43.823117] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.8 (args: /usr/sbin/glusterfs --volfile-server=10.10.12.31 --volfile-id=vol0 /mnt/pve/GlusterVol01)
[2019-05-13 09:49:43.828117] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-05-13 09:49:43.869885] W [MSGID: 108003] [afr.c:102:fix_quorum_options] 0-vol0-replicate-0: quorum-type none overriding quorum-count 1
[2019-05-13 09:49:43.871644] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2019-05-13 09:49:43.880208] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-0: parent translators are ready, attempting connect on transport
[2019-05-13 09:49:43.880609] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-1: parent translators are ready, attempting connect on transport
[2019-05-13 09:49:43.880816] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0)
Final graph:
+------------------------------------------------------------------------------+
1: volume vol0-client-0
2: type protocol/client
3: option ping-timeout 5
4: option remote-host pvetau01-storage
5: option remote-subvolume /var/lib/glusterfs/data01/brick1/vol0
6: option transport-type socket
7: option transport.address-family inet
8: option username 4ccc2234-fba7-40f9-b97b-26d3fa8ab401
9: option password cef1b5f5-b16c-4a3c-b49f-f814901a3252
10: option filter-O_DIRECT enable
11: option send-gids true
12: end-volume
13:
14: volume vol0-client-1
15: type protocol/client
16: option ping-timeout 5
17: option remote-host pvetau02-storage
18: option remote-subvolume /var/lib/glusterfs/data01/brick1/vol0
19: option transport-type socket
20: option transport.address-family inet
21: option username 4ccc2234-fba7-40f9-b97b-26d3fa8ab401
22: option password cef1b5f5-b16c-4a3c-b49f-f814901a3252
23: option filter-O_DIRECT enable
24: option send-gids true
25: end-volume
26:
27: volume vol0-replicate-0
28: type cluster/replicate
29: option eager-lock enable
30: option quorum-count 1
31: subvolumes vol0-client-0 vol0-client-1
32: end-volume
33:
34: volume vol0-dht
35: type cluster/distribute
36: option lock-migration off
37: subvolumes vol0-replicate-0
38: end-volume
39:
40: volume vol0-write-behind
41: type performance/write-behind
42: subvolumes vol0-dht
43: end-volume
44:
45: volume vol0-readdir-ahead
46: type performance/readdir-ahead
47: subvolumes vol0-write-behind
48: end-volume
49:
50: volume vol0-open-behind
51: type performance/open-behind
52: subvolumes vol0-readdir-ahead
53: end-volume
54:
55: volume vol0
56: type debug/io-stats
57: option log-level INFO
58: option latency-measurement off
59: option count-fop-hits off
60: subvolumes vol0-open-behind
61: end-volume
62:
63: volume meta-autoload
64: type meta
65: subvolumes vol0
66: end-volume
67:
+------------------------------------------------------------------------------+
[2019-05-13 09:49:43.881243] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 09:49:43.881434] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-1: changing port to 49154 (from 0)
[2019-05-13 09:49:43.881906] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 09:49:43.882213] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 09:49:43.882222] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 09:49:43.882249] I [MSGID: 108005] [afr-common.c:4382:afr_notify] 0-vol0-replicate-0: Subvolume 'vol0-client-1' came back up; going online.
[2019-05-13 09:49:43.882360] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1
[2019-05-13 09:49:43.886625] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 09:49:43.886633] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 09:49:43.890995] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1
[2019-05-13 09:49:43.891049] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.26
[2019-05-13 09:49:43.891067] I [fuse-bridge.c:4838:fuse_graph_sync] 0-fuse: switched to graph 0
[2019-05-13 09:49:43.891625] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0
[2019-05-13 10:20:38.998246] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-vol0-client-1: server 10.10.12.32:49154 has not responded in the last 5 seconds, disconnecting.
[2019-05-13 10:20:38.998657] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f69df41fe83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f69df1e7b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69df1e7c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7f69df1e92e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f69df1e9bb4] ))))) 0-vol0-client-1: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2019-05-13 10:20:33.237111 (xid=0x492)
[2019-05-13 10:20:38.998681] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-vol0-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2019-05-13 10:20:38.998829] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f69df41fe83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f69df1e7b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69df1e7c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7f69df1e92e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f69df1e9bb4] ))))) 0-vol0-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-05-13 10:20:33.237115 (xid=0x493)
[2019-05-13 10:20:38.998843] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-vol0-client-1: socket disconnected
[2019-05-13 10:20:38.998854] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-1: disconnected from vol0-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-13 10:20:43.355917] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0
[2019-05-13 10:21:20.850030] E [socket.c:2309:socket_connect_finish] 0-vol0-client-1: connection to 10.10.12.32:24007 failed (No route to host)
[2019-05-13 10:22:07.026615] E [MSGID: 114058] [client-handshake.c:1534:client_query_portmap_cbk] 0-vol0-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2019-05-13 10:22:07.026663] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-1: disconnected from vol0-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-13 10:22:10.010421] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-1: changing port to 49154 (from 0)
[2019-05-13 10:22:10.011105] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 10:22:10.011558] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 10:22:10.011609] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 10:22:10.011622] I [MSGID: 114042] [client-handshake.c:1054:client_post_handshake] 0-vol0-client-1: 2 fds open - Delaying child_up until they are re-opened
[2019-05-13 10:22:10.032258] I [MSGID: 114041] [client-handshake.c:676:client_child_up_reopen_done] 0-vol0-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2019-05-13 10:22:10.032492] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1
[2019-05-13 10:22:13.790586] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0
[2019-05-13 11:12:57.300347] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 11:12:57.305284] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 4 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 11:12:57.305712] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 11:12:57.306277] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 11:12:57.306938] I [MSGID: 114024] [client-helpers.c:99:this_fd_set_ctx] 0-vol0-client-0: /images/103/vm-103-disk-0.qcow2 (5f9490a8-ec56-410e-9c70-653e0da77174): trying duplicate remote fd set.
[2019-05-13 11:12:57.306973] I [MSGID: 114024] [client-helpers.c:99:this_fd_set_ctx] 0-vol0-client-1: /images/103/vm-103-disk-0.qcow2 (5f9490a8-ec56-410e-9c70-653e0da77174): trying duplicate remote fd set.
[2019-05-13 11:12:57.310052] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2698: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error)
[2019-05-13 11:12:57.310137] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2697: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error)
[2019-05-13 11:12:57.311543] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2699: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error)
The message "E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]" repeated 2 times between [2019-05-13 11:12:57.305712] and [2019-05-13 11:12:57.310816]
The message "W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)" repeated 2 times between [2019-05-13 11:12:57.306277] and [2019-05-13 11:12:57.311184]
The message "W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 4 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)" repeated 6 times between [2019-05-13 11:12:57.305284] and [2019-05-13 11:12:57.311274]
The message "E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]" repeated 5 times between [2019-05-13 11:12:57.300347] and [2019-05-13 11:12:57.311531]
Server 2:
[2019-05-13 04:25:01.338790] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2019-05-13 09:47:59.443328] E [socket.c:2309:socket_connect_finish] 0-glusterfs: connection to 10.10.12.31:24007 failed (Connection refused)
[2019-05-13 09:48:17.426580] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-vol0-client-0: server 10.10.12.31:49155 has not responded in the last 5 seconds, disconnecting.
[2019-05-13 09:48:17.426872] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7efebd3f9e83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7efebd1c1b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efebd1c1c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7efebd1c32e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7efebd1c3bb4] ))))) 0-vol0-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2019-05-13 09:48:12.180579 (xid=0x5663a4)
[2019-05-13 09:48:17.426899] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-vol0-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2019-05-13 09:48:17.427056] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7efebd3f9e83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7efebd1c1b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efebd1c1c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7efebd1c32e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7efebd1c3bb4] ))))) 0-vol0-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-05-13 09:48:12.180591 (xid=0x5663a5)
[2019-05-13 09:48:17.427067] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-vol0-client-0: socket disconnected
[2019-05-13 09:48:17.427077] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-0: disconnected from vol0-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-13 09:48:21.479100] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1
[2019-05-13 09:48:59.219302] E [socket.c:2309:socket_connect_finish] 0-vol0-client-0: connection to 10.10.12.31:24007 failed (No route to host)
[2019-05-13 09:49:41.468469] I [glusterfsd-mgmt.c:1600:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2019-05-13 09:49:42.505174] E [MSGID: 114058] [client-handshake.c:1534:client_query_portmap_cbk] 0-vol0-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2019-05-13 09:49:42.505225] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-0: disconnected from vol0-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-13 09:49:45.442003] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0)
[2019-05-13 09:49:45.442523] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 09:49:45.442802] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 09:49:45.442812] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 09:49:45.442820] I [MSGID: 114042] [client-handshake.c:1054:client_post_handshake] 0-vol0-client-0: 2 fds open - Delaying child_up until they are re-opened
[2019-05-13 09:49:45.443244] I [MSGID: 114041] [client-handshake.c:676:client_child_up_reopen_done] 0-vol0-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2019-05-13 09:49:45.443353] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1
[2019-05-13 09:49:49.622255] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1
[2019-05-13 10:20:06.060045] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7efebc254494] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5) [0x55dba7a3b5e5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55dba7a3b444] ) 0-: received signum (15), shutting down
[2019-05-13 10:20:06.068969] I [fuse-bridge.c:5794:fini] 0-fuse: Unmounting '/mnt/pve/GlusterVol01'.
[2019-05-13 10:20:06.103235] I [fuse-bridge.c:5086:fuse_thread_proc] 0-fuse: unmounting /mnt/pve/GlusterVol01
[2019-05-13 10:22:08.842734] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.8 (args: /usr/sbin/glusterfs --volfile-server=10.10.12.31 --volfile-id=vol0 /mnt/pve/GlusterVol01)
[2019-05-13 10:22:08.853935] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-05-13 10:22:08.944855] W [MSGID: 108003] [afr.c:102:fix_quorum_options] 0-vol0-replicate-0: quorum-type none overriding quorum-count 1
[2019-05-13 10:22:08.946502] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2019-05-13 10:22:08.972020] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-0: parent translators are ready, attempting connect on transport
[2019-05-13 10:22:08.972395] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-1: parent translators are ready, attempting connect on transport
[2019-05-13 10:22:08.972832] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0)
[2019-05-13 10:22:08.973142] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 10:22:08.973231] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2019-05-13 10:22:08.973544] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 10:22:08.973544] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'.
[2019-05-13 10:22:08.973566] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 10:22:08.973567] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2019-05-13 10:22:08.973616] I [MSGID: 108005] [afr-common.c:4382:afr_notify] 0-vol0-replicate-0: Subvolume 'vol0-client-1' came back up; going online.
[2019-05-13 10:22:08.973639] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1
[2019-05-13 10:22:08.977940] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1
[2019-05-13 10:22:08.978055] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.26
[2019-05-13 10:22:08.978075] I [fuse-bridge.c:4838:fuse_graph_sync] 0-fuse: switched to graph 0
[2019-05-13 10:22:08.978603] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1
[2019-05-13 10:53:46.573894] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.573992] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 10:53:46.574253] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.574949] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:53:46.575526] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1380: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.577820] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1381: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.596838] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.597759] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.598916] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
The message "W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)" repeated 2 times between [2019-05-13 10:53:46.574949] and [2019-05-13 10:53:46.599257]
[2019-05-13 10:53:46.599525] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.599797] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.599825] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1389: READ => -1 gfid=609bb8be-3ae8-470d-9f88-2b65095fbed4 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.599876] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.600149] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.600193] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.600417] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.600775] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:53:46.601071] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain)
[2019-05-13 10:53:46.601537] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.601577] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1390: READ => -1 gfid=609bb8be-3ae8-470d-9f88-2b65095fbed4 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.619830] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.620701] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain)
[2019-05-13 10:53:46.621098] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.621455] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:53:46.621732] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain)
[2019-05-13 10:53:46.623509] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.624891] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.625212] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:53:46.625314] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain)
[2019-05-13 10:53:46.625721] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error]
[2019-05-13 10:53:46.625754] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1399: READ => -1 gfid=79423c92-0338-4dc9-bafc-091172e8d845 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:53:46.576286] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:56:28.176786] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:56:28.177684] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 10:56:28.178782] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:56:28.179128] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)
[2019-05-13 10:56:28.180634] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1533: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:56:28.179439] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 10:56:28.180620] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:59:25.278595] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:59:25.279517] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
[2019-05-13 10:59:25.280605] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]
[2019-05-13 10:59:25.281649] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1685: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error)
[2019-05-13 10:59:25.281250] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)
-------------------------------------------------
What we can't explain is why server 1 does the following:
[2019-05-13 09:47:48.277650] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.10.12.31:24007 failed (No data available)
[2019-05-13 09:47:48.277696] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.10.12.31 (No data available)
[2019-05-13 09:47:48.277704] I [glusterfsd-mgmt.c:1926:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
then the volume will be unmounted and re-mounted with another port again.
In further consequence server 2 behaves exactly like this which consequences in aa a split-brain condition of the disk files of the VMs.
we would be glad if someone could explain these behaviors to us.
BR
René