Hello!
We have just set up a new proxmox environment before a few months. All was running fine for about 2 months but now we got problems with it - linux machines randomly crashing. Sometimes their filesystem get read-only or they load increasing extremly fast above 50 without any cpu usage. Looks like its IO-related.
But i can't find the problem.
Thats our server configuration. Its a server of 3 proxmox nodes in a cluster, using glusterfs over all of the 3 nodes with a replica of 2 (1st server replicate its brick to 2nd one, 2nd one to 3rd one and 3rd one to 1st one; so on every one are 2 bricks).
The machines hdd configuration looks like that:
Everything looks ok.
I tried everything i could find in related threads but can't get the system back to a stable system
Which logs could i check?
The cli.log displaying that once a minute:
But those messages appearing since a long time (how to get rid of it?). They don't seem to be the reason of the problem...
glustershd.log:
Anyone got an idea on how to tune to get it more stable? How to track the bottleneck/issues?
I can't find any tools giving me the info on whats going wrong in our setup...
Best regards,
Juergen
We have just set up a new proxmox environment before a few months. All was running fine for about 2 months but now we got problems with it - linux machines randomly crashing. Sometimes their filesystem get read-only or they load increasing extremly fast above 50 without any cpu usage. Looks like its IO-related.
But i can't find the problem.
Thats our server configuration. Its a server of 3 proxmox nodes in a cluster, using glusterfs over all of the 3 nodes with a replica of 2 (1st server replicate its brick to 2nd one, 2nd one to 3rd one and 3rd one to 1st one; so on every one are 2 bricks).
proxmox-ve-2.6.32: 3.4-150 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-3 (running version: 3.4-3/2fc72fee)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-22-pve: 2.6.32-107
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-18-pve: 2.6.32-88
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.4-3
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-32
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
Volume Name: gluster-volume
Type: Distributed-Replicate
Volume ID: 24e2888b-d540-4228-8e17-6e3e8c452335
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: vcl01:/var/lib/vz/brick1
Brick2: vcl02:/var/lib/vz/brick1
Brick3: vcl02:/var/lib/vz/brick2
Brick4: vcl03:/rpool/glusterfs/brick2
Brick5: vcl03:/rpool/glusterfs/brick1
Brick6: vcl01:/var/lib/vz/brick2
Options Reconfigured:
server.allow-insecure: on
performance.write-behind: off
cluster.quorum-type: none
network.ping-timeout: 2
performance.md-cache-timeout: 1
performance.cache-max-file-size: 2MB
performance.write-behind-window-size: 4MB
performance.read-ahead: off
performance.quick-read: off
performance.cache-size: 512MB
performance.io-thread-count: 64
The machines hdd configuration looks like that:
virtio0: gluster-volume:103/vm-103-disk-1.qcow2,format=qcow2,size=100G
virtio1: gluster-volume:103/vm-103-disk-2.qcow2,format=qcow2,size=250G
> gluster vol heal gluster-volume info
Brick vcl01:/var/lib/vz/brick1/
Number of entries: 0
Brick vcl02:/var/lib/vz/brick1/
Number of entries: 0
Brick vcl02:/var/lib/vz/brick2/
Number of entries: 0
Brick vcl03:/rpool/glusterfs/brick2/
Number of entries: 0
Brick vcl03:/rpool/glusterfs/brick1/
Number of entries: 0
Brick vcl01:/var/lib/vz/brick2/
Number of entries: 0
Everything looks ok.
I tried everything i could find in related threads but can't get the system back to a stable system
Which logs could i check?
The cli.log displaying that once a minute:
[2015-07-03 09:46:37.304631] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled
[2015-07-03 09:46:37.304678] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread
[2015-07-03 09:46:37.376926] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled
[2015-07-03 09:46:37.376940] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread
[2015-07-03 09:46:37.454291] I [socket.c:2238:socket_event_handler] 0-transport: disconnecting now
[2015-07-03 09:46:37.455901] I [cli-rpc-ops.c:518:gf_cli_get_volume_cbk] 0-cli: Received resp to get vol: 0
[2015-07-03 09:46:37.456049] I [cli-rpc-ops.c:779:gf_cli_get_volume_cbk] 0-cli: Returning: 0
[2015-07-03 09:46:37.456058] I [input.c:36:cli_batch] 0-: Exiting with: 0
But those messages appearing since a long time (how to get rid of it?). They don't seem to be the reason of the problem...
glustershd.log:
[2015-07-03 07:32:24.883854] I [afr-self-heald.c:1690:afr_dir_exclusive_crawl] 0-gluster-volume-replicate-2: Another crawl is in progress for gluster-volume-client-5
[2015-07-03 09:05:42.055861] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.242792] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.243635] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:06:02.247140] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:06:02.247181] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:07:19.150924] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:07:35.981441] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:07:35.987741] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:07:35.989568] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:26:02.132316] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:26:08.649973] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2015-07-03 09:26:08.650823] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2015-07-03 09:26:08.652070] I [glusterfsd-mgmt.c:1307:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
Anyone got an idea on how to tune to get it more stable? How to track the bottleneck/issues?
I can't find any tools giving me the info on whats going wrong in our setup...
Best regards,
Juergen
Last edited: