I just built a three-node Proxmox V8.2.7 cluster using glusterfs on the RAIDZ1 storage on each node. While awaiting 10gbe network hardware, they are communicating via shared 1gb network, but the current data is quite static and there is not much write activity. Total data usage is only about 35GB.
Everything was working with data replicated on each node. But today something went wrong with the network infrastructure causing all sorts of problems. After resolving that and rebooting switches and the cluster, the LXCs and VMs wouldn't restart. The log showed errors about glusterfs volume ending in ".raw" not existing.
After the reboot the glusterfs filesystem was mounted on all three nodes (at /mnt/glusterfs). But only node 1 has any actual data. On the other two nodes there is no container data though they have the top-level directory structure. I was able to get the LXCs and VMs started again by migrating them all to node 1, and they are now running properly there. But I need to get the data replicated onto the other nodes so I can migrate back.
UPDATE -- looking at the status for glusterfs on nodes 2 and 3, I noticed that they report the volume size as 100GB, which is very wrong. Node 1 reports it as 2.81TB, which is correct.
My questions are -- why would this happen; how can I get the node 1 data replicated data back onto nodes 2 and 3; and how can I prevent this from happening again? (It was an unpleasant failure as the cluster hosts my DNS, so things were very broken!)
Thanks!
Everything was working with data replicated on each node. But today something went wrong with the network infrastructure causing all sorts of problems. After resolving that and rebooting switches and the cluster, the LXCs and VMs wouldn't restart. The log showed errors about glusterfs volume ending in ".raw" not existing.
After the reboot the glusterfs filesystem was mounted on all three nodes (at /mnt/glusterfs). But only node 1 has any actual data. On the other two nodes there is no container data though they have the top-level directory structure. I was able to get the LXCs and VMs started again by migrating them all to node 1, and they are now running properly there. But I need to get the data replicated onto the other nodes so I can migrate back.
UPDATE -- looking at the status for glusterfs on nodes 2 and 3, I noticed that they report the volume size as 100GB, which is very wrong. Node 1 reports it as 2.81TB, which is correct.
My questions are -- why would this happen; how can I get the node 1 data replicated data back onto nodes 2 and 3; and how can I prevent this from happening again? (It was an unpleasant failure as the cluster hosts my DNS, so things were very broken!)
Thanks!
Last edited: