i/o error in VM on glusterfs when one replica is down

foxriver76

New Member
Mar 15, 2021
2
0
1
32
Hi all,

recently I changed my Proxmox setup to run a HA cluster based on GlusterFS. Now I noticed, that whenever I put one host of the GlusterFS down (e.g. restarting it after upgrades), the VM's running on the other Host gets a corrupted filesystem. The corrupted VM is then showing the following errors:

Bildschirmfoto von 2021-03-15 12-04-18.png

My GlusterFS configuration looks like:
Code:
root@proxmox-nuc:~# gluster volume info

Volume Name: pve
Type: Replicate
Volume ID: f2e8a3f0-b73f-4354-adbe-21a87f24b981
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.178.141:/data/proxmox/gv0
Brick2: 192.168.178.130:/data/proxmox/gv0
Brick3: 192.168.178.28:/data/proxmox/gv0 (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
features.shard: on
cluster.self-heal-daemon: on

Has anyone an idea why this happens?

best regards

fox
 
Last edited:
Not sure how arbiters work in GlusterFS, but I think quorum/arbiter might be the problem.

When I have 2 of 3 Gluster servers online than I can use dd successfully on a VM that I have stored on corresponding GlusterFS.

Code:
root@glusterHci143:~# gluster volume info
 
Volume Name: gv0
Type: Replicate
Volume ID: b7c62cc6-1635-488a-bbc5-70aa70322926
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.25.172:/data/brick1/gv0
Brick2: 192.168.25.173:/data/brick1/gv0
Brick3: 192.168.25.174:/data/brick1/gv0
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet



root@glusterHci143:~# gluster peer status
Number of Peers: 2

Hostname: 192.168.25.174
Uuid: 0900b5ff-3674-46a2-8741-06fc8d66d048
State: Peer in Cluster (Connected)

Hostname: 192.168.25.172
Uuid: d88a20b8-349b-4a07-99c7-d5edded7f458
State: Peer in Cluster (Disconnected)

As soon as I shut down another Gluster node (so only 1 of 3 left), I get the same I/O errors when running dd.
 
Thanks for your reply. As far as I understand, the arbiter should keep the FS alive by maintaining the quorum even if one of the 2 hosts is down.
Maybe someone with more knowledge of GlusterFS or someone who has also a 2 + 1 setup running can provide me information if it's possible to keep the VM's alive.

What's really bad is that even if the second host is up again, the VM still is corrupted until I manually reboot the affected VMs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!