Single Server Vmware -> proxmox zfs/Ceph?

DeepB · 2024-11-22T11:00:18+0100

Hello,
we currently run a single server vmware host for a handfull of VMs. 3 NASs and Veeam for Backup (100gb ethernet local, and one offsite (wireless link, ~1gbit) )
Workload is pretty light, DC, a fileserver, a sql server for 50 users (light workload) and a few other services. Our Server currently has an EPYC 7543P and 256gb of ram, all flash, but is highly overpowered. Current datastore usage is a tad under 3TB (though about 60-70% of that is cool or cold storage)

We want to migrate from vmware, and I am thinking about proxmox. (I have been using it on my homeserver for a long time).

Requirements:

easy to maintain. I am the lone IT-Person in the company (linux know-how available)
no HA requirement. Downtime outside of working hours is permitted. Downtime during working hours is not catastrophic. Recovery time of 6-24h for catastrophic hardware failure is accepted. (Lower is better of course)

Nice to have:

easy failover to backup server (either automatic or doable for a technical minded person with a written guide or when guided over phone)
Higher availability than a few hours recovery time

So my thoughts were to try out one of the two options:

proxmox + ZFS, 2 additional (older) servers, one on-site, one off-site and zfs replication every x minutes for fail-over
proxmox+CEPH on-site (2 additional servers), because why not? All kidding aside we would get (much) better recovery times, failover without admin intervention (good when I am not on site) for not much additional cost. However I am unsure about the additional maintenance component. Would 2 be harder to maintain than 1?

Goal would be that if I am on holidays for 3 weeks I would not have to come home if shit hits the fan.
Any suggestions/input?
Thanks
Daniel

Johannes S · 2024-11-22T12:01:47+0100

DeepB said:
Hello,
we currently run a single server vmware host for a handfull of VMs. 3 NASs and Veeam for Backup (100gb ethernet local, and one offsite (wireless link, ~1gbit) )
Workload is pretty light, DC, a fileserver, a sql server for 50 users (light workload) and a few other services. Our Server currently has an EPYC 7543P and 256gb of ram, all flash, but is highly overpowered. Current datastore usage is a tad under 3TB (though about 60-70% of that is cool or cold storage)

We want to migrate from vmware, and I am thinking about proxmox. (I have been using it on my homeserver for a long time).

Requirements:

easy to maintain. I am the lone IT-Person in the company (linux know-how available)

no HA requirement. Downtime outside of working hours is permitted. Downtime during working hours is not catastrophic. Recovery time of 6-24h for catastrophic hardware failure is accepted. (Lower is better of course)

Nice to have:

easy failover to backup server (either automatic or doable for a technical minded person with a written guide or when guided over phone)

DeepB said:
Higher availability than a few hours recovery time

So my thoughts were to try out one of the two options:

proxmox + ZFS, 2 additional (older) servers, one on-site, one off-site and zfs replication every x minutes for fail-over

This most likely won't work since corosync needs low latency for the cluster network which will propably difficult to achieve.
You could however setup the remote server as standalone-node and use zfs-sync together with qm remote-migrate to achieve something similiar:
https://pve.proxmox.com/pve-docs/qm.1.html
https://pve.proxmox.com/wiki/PVE-zsync

Please note that I never tried this myself but to my understanding it should work (in theory). And of course you would need a qdevice for you local two-node cluster. With this (two-nodes plus qdevice) you would even have an automatic failover. One benefit would be that you would do without the complexity of Ceph (which due to your light workload seems to me a little bit like overkill, but one might beg to differ).
On the other hand: With Ceph you woudn't have a loss of data, with ZFS you will always loose the data written since the last sync. This can be mitigated by reducing the sync schedule to one minute but nut completly avoided. In case of a planned migration or maintenance you won't have this problem though. I don't know whether this might be a problem with SQL Server (lacking realworld experience hopefully somebody else will be able to say something on it).

DeepB said:
proxmox+CEPH on-site (2 additional servers), because why not? All kidding aside we would get (much) better recovery times, failover without admin intervention (good when I am not on site) for not much additional cost. However I am unsure about the additional maintenance component. Would 2 be harder to maintain than 1?

Goal would be that if I am on holidays for 3 weeks I would not have to come home if shit hits the fan.

I fear that this goal is not possible no matter which route you go. As soon as the cluster (be it based on Ceph or ZFS) is sufficiently broken, manual clean-up work will have to be carried out. If you are lucky you can do it remote (via ssh or another solutions) but even that might not be available in a worst case scenario.

For backup I would suggest to setup a local dedicated server as Proxmox backup server so if something bad (e.G. ransomware, dataloss etc) happens to your cluster your backups are still available. And of course you would need another server offsite, also for PBS. Then the remote PBS ( a cheap vserver should be enough just for PBS) can be configured to pull it's data from the local one. The benefits: With the right permissions you will get ransomware protection and the actual backups (since they are done local before getting synched to the offsite PBS) won't take very long:
https://pbs.proxmox.com/docs/storage.html#ransomware-protection-recovery
The permissions would be something like that Proxmox VE is allowed to write backups to your local PBS and restore from it but not remove them.
The remote PBS is allowed to pull the backups from the local PBS, but not to delete anything on it. The remote PBS now wouldn't allow anything by default but in case of an emergency you would change the permissions so your local servers can restore from the remote PBS.
If you budget allows you could even use a dedicated server at a cloud provider (e.G. Hetzner) with a PBS AND PVE instance. Normally you woudn't run anything on it but in case your local cluster is broken you could restore the offsite backups on the remote PBS to the remote PVE and have an interim solution until your local infrastructure is ready again. A cheaper alternative might be to colocate one of your old used servers in a coloc data center. Restore should be fast then since the restore would be from the same host. It would still need a manual intervention though.

In any case: I would suggest that you get professional support, for the migration project as well as the operating. A service provider will not only consult you which solution fits your companys usecases better but will also be able to provide support when you are away or something really bad happens. This is not only true for Proxmox but also the rest of your infrastructure.
I'm aware that you company propably don't have a big budget (you being the only IT person is a strong indicator in that regard) but if something bad happens it will cost much more to have a longer downtime because there is only one person who can fix it. Even if you are not on vacation the day only has 24 hours.

DeepB · 2024-11-22T14:15:22+0100

Johannes S said:
I fear that this goal is not possible no matter which route you go. As soon as the cluster (be it based on Ceph or ZFS) is sufficiently broken, manual clean-up work will have to be carried out. If you are lucky you can do it remote (via ssh or another solutions) but even that might not be available in a worst case scenario.

Yes, that is clear. However if it is achievable to make it hard to be "sufficiently broken" without breaking the bank and without introducing massive administration overhead I am leaning that way. If that is possible is the question.

I think I need to stress that I do not need HA. A few hours downtime is accepted. In case of massive hardware failure 12 or even 24h downtime are accepted too. However if I can easily do better I will.

So let me elaborate on the two possibilities that I thought were possible (I was not clear above, and also included your (and others) input):

1) 2 Servers+Qdevice or 3 Servers in-house, ZFS replication. Additional Server Off-Site (not in the cluster), ZFS replication.
Advantage: Easier to manage (?)
Disadvantage: potential data loss of (zfs-replication-interval, >=1min)

2) 3 Servers in-House. CEPH. Additional Backup-Server offsite, NOT in cluster, no CEPH
Advantage: Automatic failover with no data loss.
Disadvantage: lower IOPS on write (I would have a copy on each server)

3) leave it as is (one Server) maybe migrate to proxmox
Advantage: Cheapest (no additional hardware needed)
Disadvantage: looong recovery times on a hardware failure

My Idea for 1&2 would be to buy 2 cheaper Servers (like Lenovo SR665 with 16 cores, throw in 128-256gb of ECC ram and a ~4TB read-intensive SSD with PLP).

If I understand correctly I would get the resilience to lose one server with 1&2 with only a few thousand euros of cost.

Thanks
Daniel

Search

Search

Single Server Vmware -> proxmox zfs/Ceph?

DeepB

Renowned Member

Johannes S

Active Member

DeepB

Renowned Member