Small homelab cluster

codedmind

New Member
Oct 19, 2020
9
0
1
34
Hello
I'm recently start in proxmox, i have 3 small servers that i put together to make a cluster using proxmox, as all servers are limited to 16GB the ceph solution is very bad performance wise... all the vm/ct are laggy because after some time the mem isn't enought and then the swap is used.

So i'm search for other solutions instead of ceph... try zfs+glusterfs but with that i loose the lxc, however the performance is much better.


See some post in the forum about the glusterfs 4 and with that maybe will be possible.. but the thread is from 2018 and glusterfs now is 5.5 but still not possible to run lxc over glusterfs.


What other, if exists, storage solutions that allow live migration as ceph but without the demanding resources that ceph requires?

Thanks
 
GlusterFS is even higher if you take their official repositories btw

What you need for live migration is any form of shared storage. NFS for example would do.
What some people do is mouting GlusterFS and creating a directory storage of PVE on it. Note that this is not really recommended. There might be small problems with, for example, dependencies during autostart on node reboot IIRC. But it should give you LXC containers that start at least.

Note #2: There is a bugzilla entry for this problem https://bugzilla.proxmox.com/show_bug.cgi?id=2690
Admittedly, this issue is going forwards rather slowly. Might be because for many people the preferred solution is to get more RAM and use Ceph.
 
@Dominic thanks or your reply. When i try the "performance" using dd the values are bigger when using glusterfs hover zfs and at least appear to consume less memory when compared with ceph. Other thing is that with ceph... don't know why but always get alerts from netdata that nics/disks backlog... are with issues.

And yes, i'm talking about shared storage so i can run containers inside that storage and do snapshots, also important is be able to do a live migration change the container moving to other cluster node.. from the wiki of the shared storage i guess only ceph or zfs over isci will allow that.


About get more ram, i understand that is what everyone do, but in my case isn't possible as the hardware i get only allows 16gb per node.
 

Attachments

  • Screenshot 2020-11-02 at 08.31.55.png
    Screenshot 2020-11-02 at 08.31.55.png
    106.6 KB · Views: 7
i try the "performance" using dd
Try fio

don't know why but always get alerts from netdata that nics/disks backlog... are with issues.
If you get package drops on your network, that will make problems.
Try to find out if that happens only with Ceph.

from the wiki of the shared storage i guess only ceph or zfs over isci will allow that.
NFS with qcow2 images should, too.
 
@Dominic is it only happens with ceph, maybe because the full mesh network using broadcast?

How can i create a lxc container with qcow2 ?
 
Sorry, that was confusing.

How can i create a lxc container with qcow2 ?

This doesn't work. But migration for containers cannot be real live migration either.

@Dominic is it only happens with ceph, maybe because the full mesh network using broadcast?
Do you have one single network with all the traffic?
 
@Dominic live migration in containers, not really live as the container will be shutdown but is possible move the volume from one storage to another, when using nfs/glusterfs i'm not able to move the root disk to none of that storages... don't know i make it clear...?

About the network i have 3 1gb network cards in each server, two for ceph and 1 for proxmox vms
 
1GB Ceph is not ideal.

when using nfs/glusterfs i'm not able to move the root disk to none of that storages... don't know i make it clear...?

I think I don't understand you right. When container 105 is off, I can move its rootfs to and from NFS?
Code:
 pct move_volume 105 rootfs local
 pct move_volume 105 rootfs nfs_storage
And I can also (restart-) migrate it to other nodes with
Code:
pct migrate 105 pveA --restart
 
Well so something is wrong... when i want create a new lxc container i cannot create the roofs on my nfs_share :/ Also in resourses tabe i cannot move the rootfs to nfs.
And snapshots will not be avaiable also...
 
Last edited:
You have to make sure that the word container images (content type is name vztmpl) is allowed on your NFS storage. That means the file /etc/pve/storage.cfg should look like
Code:
nfs: storagename
         ...
         content vztmpl
         ...
See also the wiki about NFS
 
Hello again, some time have past and i have upgrade somethings in the cluster. Right now i have 3 nodes like this
microserver gen8
1 500gb ssd hdd1 (for proxmox)
1 400gb ssd hdd2 (for ceph)
16gb ram
10GB nic for ceph
1gb nic for proxmox
Running proxmox 7

I still get hight io delay and i cannot figure what is the problem! The VM with most disk write is homeassistant and the graph is like this
1626266061187.png

I don't think that is a lot... but the others nodes have high io delay beucase of that, don't know why... but the delay is like 15 to 20% delay but the iostat for instance show "low" values in the write... only kb

1626266188195.png


my pveperf results i like this.. but this i guess is only for the hdd1

1626266250837.png

@Dominic any suggestion what more can i do to try understand if something is miss configured or if simply the hardware or something else?

Thanks
 
467 IOPS (fsync/seconds) are quite bad. I got 1857 with a pair of 100GB SATA SSDs in SW raid1. So maybe your storage can't handle the IOPS. But would be more interesting how busy your ceph ssd and NICs are. Looks like the ceph OSD is causing the IO delay.
 
@Dunuin thank you for your time and reply.

My ssd are like "home grade" https://personal.kioxia.com/en-emea/ssd/exceria-sata-ssd.html i have only one disk in each node, so no raid.
Can you please point me how to retrive that information? For instance in osd i get this
1626268725486.png
Some times the latency go up in two of the nodes into 100/100 the nic have 10GB and i can see that speed wise it go up ok...
1626268433042.png
Here when the io go up to 30% is when i shutdown the VM running in the cep and move it to local lvm, them i start it in local lvm and the i/o delay is completly different as you can see, in the same time frame the nic (bond0 two nic of 10gb one to each node)
1626268536192.png

For compare purposes in the same time frame one of the other nodes have this graph
1626268586836.png

I understand the ssd are not server grade... but this is a small lab, i now i only have 16gb of ram in each node, bul also i only run homeassistant, pihole, mariadb, grafana and influxdb nothing high usage...
 

Attachments

  • 1626268314910.png
    1626268314910.png
    19.9 KB · Views: 4
  • 1626268337297.png
    1626268337297.png
    24.8 KB · Views: 4

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!