Proxmox HA and Ceph

rbeard.js

Member
Aug 11, 2022
53
2
13
Hi there!
Im trying to learn more about Ceph storage so we can use it in an upcoming installation.
We have a database running on windows server that most of the company relies upon. I was looking into getting a 4 blade server and running proxmox ve on 3 of the blades and pbs on the last blade as a slim install to save on colo costs as this server is offsite.

On site where I am, we have a proxmox cluster running on zfs and replication setup on our more critical vms. Replication runs every 15 minutes to catch any changes. If a vm goes down, we only lose 15 minutes of data.

For this new build, the database and server are much more critical. Im liking that ceph copies the data on write and Im not relying on replication.
However, If network or a node goes down, connections still drop off and it takes a few minutes for the VM to come back up and reboot on the next node and I still loose everything that was in ram.

My question is if there is a solution here that makes the HA instant with no drop off? If the node dies, I would like the VM to pick up on the other node instantly so users dont notice anything changed. Im not sure if this is in the capabilities of ceph or if there is another storage system I should be looking at.
I would also like to ask if there is a way to increase the write speeds when doing a migration. I only have 6GB of ram on my test setup and the speed is not amazing. I know ceph has to write to each node on every write but if users are in the system during a migration, that ram information is changing alot by the time the migration completes.

Any and all information is welcome and thank you
 
May want to post your question at /r/ceph since they can answer DB/VM questions.
 
If the node dies, I would like the VM to pick up on the other node instantly so users dont notice anything changed.
This is not possible.

(Without having a malfunction it is possible to live migrate a VM from one node to another - so users don't notice anything.)

A normal application is running in a VM on one node. When this current node dies the KVM process of that VM dies. It is gone. It can not get resurrected on another node in a microsecond - including the current state of all user sessions.

The HA mechanism we have will start a fresh instance of that VM on another node. This VM needs to re-instantiate the services of the now dead node resp. those lost VMs.


Then there is the idea of a FT (Fault Tolerant) service. This would require two (or more) nodes to sync a KVM process every microsecond or so. In this vision a user of the service would actually not notice a died node.

As far as I know the current feature set of KVM / PVE does not include this.

I would be happy if someone could tell us that I am wrong...

Best regards
 
  • Like
Reactions: ucholak
Gotcha so I'll always potentially loose what's in ram at most using ceph. Theoretically all writes are done replicated across my servers so if the node goes down, it will reboot on another node and hopefully only loose what was in ram at that time.

I think I'll post this on /r/ceph too to see if something you mentioned would be a possibility but even this level of fault tolerance is a lot better than what we are doing now on our less critical systems.
 
  • Like
Reactions: UdoB
I think I'll post this on /r/ceph too to see if something you mentioned would be a possibility
My question is if there is a solution here that makes the HA instant with no drop off?
These things are unrelated. the ceph backend provides just the storage. What you ask is not currently available with Proxmox; you would need vmware (vsphere replication.)

The proxmox HA engine does not have hot replica functionality at this time. I'm not sure if its even on the roadmap.
 
Hello,

As explained in https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pveceph, it is recommended to setup at least 3 monitors in your ceph cluster, if so if a node goes down you enter a degraded state but the cluster is still functional and can recover itself once the node goes back up.

Regarding downtime, if node A goes down it will take 2 minutes for the VMs to be recovered to another node when HA is enabled, thats because it is not possible to tell what happened to node A or if it will come back up quicker than the time required recover the VMs.

If the HA timeout is too much for you, consider setting up HA on the application level. So if one node goes down, the second VM on another node can take over. You can utilize the HA groups with the "restricted" option to make sure that the two VMs are never on the same physical nodes.
 
Last edited:
Hello there,

I know this post is a year old, but I'm curious if things have changed. My situation is that the VM is running critical user interfaces with default states, and I don't want my users to lose their progress or configurations. This could lead to sudden increases in the amplifier, things booting up, and so on. The state of my VM is crucial. A downtime of 1 or 2 minutes is acceptable, but what about the state of the VM, especially the RAM? Until today, we still don't have a way to replicate the RAM state on Ceph for seamless recovery?
 
Hello there,

I know this post is a year old, but I'm curious if things have changed. My situation is that the VM is running critical user interfaces with default states, and I don't want my users to lose their progress or configurations. This could lead to sudden increases in the amplifier, things booting up, and so on. The state of my VM is crucial. A downtime of 1 or 2 minutes is acceptable, but what about the state of the VM, especially the RAM? Until today, we still don't have a way to replicate the RAM state on Ceph for seamless recovery?
No, HA still means 'restart VM on a surviving host in the cluster', meaning you will lose the contents of your RAM.

This is exactly the same as the way VMwares 'HA' or Hyper-V Failover Role works.

VMware offers an (Enterprise) feature called Fault Tolerance, which keeps a clone of the VM running with continous synching of the RAM, but I rarely see anyone use it, so I guess it comes with some downsides. Afaik there is a similar feature for KVM in the works, but as of now it can't be done in Proxmox.

There's just a physical constraint because RAM is way faster than any network connection, so you just cannot synchronously mirror the RAM of a machine to another node. So I guess you could replicate a lot of tiny RAM snapshots, but are those really consistent?

In the end most VMs should be configured so that they can back to a workable state after a crash, and critical applications that really need 100% uptime usually have software-side solutions for availability.
 
Last edited:
A downtime of 1 or 2 minutes is acceptable, but what about the state of the VM, especially the RAM? Until today, we still don't have a way to replicate the RAM state on Ceph for seamless recovery?
Doing this is difficult, both in terms of software orchestration and hardware requirements. If you need this level of uptime, consider deploying your software in a truly multiheaded fashion instead of monolithic- Kubernetes is one way to get there.
 
  • Like
Reactions: OnyxEmeraude
Hello there,

I know this post is a year old, but I'm curious if things have changed. My situation is that the VM is running critical user interfaces with default states, and I don't want my users to lose their progress or configurations. This could lead to sudden increases in the amplifier, things booting up, and so on. The state of my VM is crucial. A downtime of 1 or 2 minutes is acceptable, but what about the state of the VM, especially the RAM? Until today, we still don't have a way to replicate the RAM state on Ceph for seamless recovery?

If a host loses quorum it is because it cannot communicate with the other hosts, in such a case it is not obvious that the host will be able to reach any shared storage. Getting more uptime is not an easy task considering the cluster cannot communicate with the affected node.
 
If a host loses quorum it is because it cannot communicate with the other hosts, in such a case it is not obvious that the host will be able to reach any shared storage.

That would be for the RAM part I guess.

Getting more uptime is not an easy task considering the cluster cannot communicate with the affected node.

The issue for some deployments is that 2 minutes actually is a long downtime. Getting more uptime means tweaking the fencing values, something PVE does not support, arbitrarily.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!