[SOLVED] GlulsterFS problems after temporary network loss

n8ur

New Member
Oct 10, 2024
16
3
3
I just built a three-node Proxmox V8.2.7 cluster using glusterfs on the RAIDZ1 storage on each node. While awaiting 10gbe network hardware, they are communicating via shared 1gb network, but the current data is quite static and there is not much write activity. Total data usage is only about 35GB.

Everything was working with data replicated on each node. But today something went wrong with the network infrastructure causing all sorts of problems. After resolving that and rebooting switches and the cluster, the LXCs and VMs wouldn't restart. The log showed errors about glusterfs volume ending in ".raw" not existing.

After the reboot the glusterfs filesystem was mounted on all three nodes (at /mnt/glusterfs). But only node 1 has any actual data. On the other two nodes there is no container data though they have the top-level directory structure. I was able to get the LXCs and VMs started again by migrating them all to node 1, and they are now running properly there. But I need to get the data replicated onto the other nodes so I can migrate back.

UPDATE -- looking at the status for glusterfs on nodes 2 and 3, I noticed that they report the volume size as 100GB, which is very wrong. Node 1 reports it as 2.81TB, which is correct.

My questions are -- why would this happen; how can I get the node 1 data replicated data back onto nodes 2 and 3; and how can I prevent this from happening again? (It was an unpleasant failure as the cluster hosts my DNS, so things were very broken!)

Thanks!
 
Last edited:
I just built a three-node Proxmox V8.2.7 cluster using glusterfs on the RAIDZ1 storage on each node. While awaiting 10gbe network hardware, they are communicating via shared 1gb network, but the current data is quite static and there is not much write activity. Total data usage is only about 35GB.

Everything was working with data replicated on each node. But today something went wrong with the network infrastructure causing all sorts of problems. After resolving that and rebooting switches and the cluster, the LXCs and VMs wouldn't restart. The log showed errors about glusterfs volume ending in ".raw" not existing.

After the reboot the glusterfs filesystem was mounted on all three nodes (at /mnt/glusterfs). But only node 1 has any actual data. On the other two nodes there is no container data though they have the top-level directory structure. I was able to get the LXCs and VMs started again by migrating them all to node 1, and they are now running properly there. But I need to get the data replicated onto the other nodes so I can migrate back.

UPDATE -- looking at the status for glusterfs on nodes 2 and 3, I noticed that they report the volume size as 100GB, which is very wrong. Node 1 reports it as 2.81TB, which is correct.

My questions are -- why would this happen; how can I get the node 1 data replicated data back onto nodes 2 and 3; and how can I prevent this from happening again? (It was an unpleasant failure as the cluster hosts my DNS, so things were very broken!)

Thanks!

Problem found, and hopefully resolved. For some reason the glusterfs did not mount on reboot, although the proper line was present in /etc/fstab. Doing that manually on each node got everything working, and then doing another reboot everything came up OK. I can only guess that at the time of the reboot the network was still acting up and the nodes weren't able to talk to each other.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!