Two node setup grey question marks everywhere error

DDPF

New Member
Jul 29, 2023
9
1
3
Hi there

So I have two physical devices. Both are running Proxmox Virtual Environment Hypervisor. But only one node is up 99% of the time. I decided to have two installations of Proxmox VE and not Proxmox VE and Proxmox Backup Server, because I wanted to be able to do maintenance work or try something new on one server without compromising my services. (I can just migrate them to the first server and do my thing on the second one)

Now the prolem is that my main server has grey question marks on all the icons in the webui approximately once every month. (Usually after my second server turned on to get replication data of the LXC's, VM's and ZFS syncoid sync, a restart of my main node that is up 99% of the time fixes the issue.) I personally only notice some weirdness, becasue my services stop working properly and this is a problem. To make the Quorum work I have configured my nodes like this. Might this be the cause of my issue?

Code:
    nodelist { 
      node { 
        name: node1 
        nodeid: 1 
        *quorum_votes: 2* 
        ring0_addr: 192.168.0.20 
      } 
      node { 
        name: node2   
        nodeid: 2 
        *quorum_votes: 1* 
        ring0_addr: 192.168.0.21 
      } 
    }

And if so do you have any suggestion on how to run the Qdevice on OpenWrt x86 machine, because I preferably do not want to add another pyhsical device, which runs 24/7.
Thanks for your Feedback and time!
 
Hi there

So I have two physical devices. Both are running Proxmox Virtual Environment Hypervisor. But only one node is up 99% of the time. I decided to have two installations of Proxmox VE and not Proxmox VE and Proxmox Backup Server, because I wanted to be able to do maintenance work or try something new on one server without compromising my services. (I can just migrate them to the first server and do my thing on the second one)

Now the prolem is that my main server has grey question marks on all the icons in the webui approximately once every month. (Usually after my second server turned on to get replication data of the LXC's, VM's and ZFS syncoid sync, a restart of my main node that is up 99% of the time fixes the issue.) I personally only notice some weirdness, becasue my services stop working properly and this is a problem. To make the Quorum work I have configured my nodes like this. Might this be the cause of my issue?

Code:
    nodelist {
      node {
        name: node1
        nodeid: 1
        *quorum_votes: 2*
        ring0_addr: 192.168.0.20
      }
      node {
        name: node2  
        nodeid: 2
        *quorum_votes: 1*
        ring0_addr: 192.168.0.21
      }
    }

And if so do you have any suggestion on how to run the Qdevice on OpenWrt x86 machine, because I preferably do not want to add another pyhsical device, which runs 24/7.
Thanks for your Feedback and time!


I do not think that is a problem that you gave you primary node 2 votes. I did this for a long time also. Now i run qdevice on a vm running on my primary server. When I need to shutdown my primray, i migrate the qdevice first. And make sure that qdevice starts first when i reboot my nodes.

Your problem could be that the corosync becomes outdated. Maybe you could try this:

Code:
service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

Code:
service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start

I execute it to get nodes synced again. If that does not work, then you may have a different problem. Another reason could be kernel version differences then the secondary does not get updated regulary like your primary?
 
  • Like
Reactions: Darkk
I do not think that is a problem that you gave you primary node 2 votes. I did this for a long time also. Now i run qdevice on a vm running on my primary server. When I need to shutdown my primray, i migrate the qdevice first. And make sure that qdevice starts first when i reboot my nodes.

Your problem could be that the corosync becomes outdated. Maybe you could try this:

Code:
service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

Code:
service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start

I execute it to get nodes synced again. If that does not work, then you may have a different problem. Another reason could be kernel version differences then the secondary does not get updated regulary like your primary?

Thanks for your response. I will try out your suggestoin if it occurs again. For that both nodes need to be online, right? If that doens't work I will post my results here and try the qdevice in vm or lxc method. I usually update both nodes at the same time, so the kernel version is the same, currently at Linux 6.5.11-8-pve. I will keep you updated. Have a nice day.
 
Thanks for your response. I will try out your suggestoin if it occurs again. For that both nodes need to be online, right? If that doens't work I will post my results here and try the qdevice in vm or lxc method. I usually update both nodes at the same time, so the kernel version is the same, currently at Linux 6.5.11-8-pve. I will keep you updated. Have a nice day.

they should both be online, yes. And I would suggest minimal installation of a vm so you can do live migration and not have quorom problem
 
Hi there

So I have two physical devices. Both are running Proxmox Virtual Environment Hypervisor. But only one node is up 99% of the time. I decided to have two installations of Proxmox VE and not Proxmox VE and Proxmox Backup Server, because I wanted to be able to do maintenance work or try something new on one server without compromising my services. (I can just migrate them to the first server and do my thing on the second one)

Now the prolem is that my main server has grey question marks on all the icons in the webui approximately once every month. (Usually after my second server turned on to get replication data of the LXC's, VM's and ZFS syncoid sync, a restart of my main node that is up 99% of the time fixes the issue.) I personally only notice some weirdness, becasue my services stop working properly and this is a problem. To make the Quorum work I have configured my nodes like this. Might this be the cause of my issue?

This should be unrelated, nevertheless:

Code:
    nodelist {
      node {
        name: node1
        nodeid: 1
        *quorum_votes: 2*
        ring0_addr: 192.168.0.20
      }
      node {
        name: node2  
        nodeid: 2
        *quorum_votes: 1*
        ring0_addr: 192.168.0.21
      }
    }

And if so do you have any suggestion on how to run the Qdevice on OpenWrt x86 machine, because I preferably do not want to add another pyhsical device, which runs 24/7.

Do not bother (with the QD), use auto_tie_breaker, have a look here and at the man page:
https://forum.proxmox.com/threads/two-servers-on-cluster-making-second-server-a-master.142339/#post-638706

Now i run qdevice on a vm running on my primary server. When I need to shutdown my primray, i migrate the qdevice first. And make sure that qdevice starts first when i reboot my nodes.

Hi current design choice is less bad. QDevice as a VM within the same cluster defeats the whole purpose of a QD.
 
I took the liberty of filing:

Bug 5273 - Add note to QDevice docs NOT TO run qnetd on any node of the same cluster
 
Hi current design choice is less bad. QDevice as a VM within the same cluster defeats the whole purpose of a QD.

I would say it all depends on the use case, resources and your plan on implemeting it. Best practice is of course use another device. But i do not want to use another device and @DDPF does not also want to run another machine which consumes extra energy. I mean its just a homelab aynway. For a bussiness or critical data then yes you should not use qdevice as a vm. This setup meets my goal. Your goal might be different and that's okay. :) I mean worse case is that you change the expected vote becuase the machine and with it's qdevice suddely not available this is done in a few seconds.
 
  • Like
Reactions: DDPF
To make the Quorum work I have configured my nodes like this. Might this be the cause of my issue?

Code:
    nodelist {
      node {
        name: node1
        nodeid: 1
        *quorum_votes: 2*
        ring0_addr: 192.168.0.20
      }
      node {
        name: node2  
        nodeid: 2
        *quorum_votes: 1*
        ring0_addr: 192.168.0.21
      }
    }

Side question, do you have HA active on the "cluster"?
 
I would say it all depends on the use case, resources and your plan on implemeting it. Best practice is of course use another device. But i do not want to use another device and @DDPF does not also want to run another machine which consumes extra energy. I mean its just a homelab aynway. For a bussiness or critical data then yes you should not use qdevice as a vm. This setup meets my goal. Your goal might be different and that's okay. :) I mean worse case is that you change the expected vote becuase the machine and with it's qdevice suddely not available this is done in a few seconds.

I did not mean it in a bad way, I just was essentially replying similar yesterday. I think people should be more pointed towards tie breaker and two node options. Note the two node implies wait for all, which is not what one wants most of the time. There's an elegant way to do two node clusters with tie breaker specifying the node that wins. If you have another device around already, QD is even better, if not, tie breaker is in fact, much better.

EDIT: In my opinion this is consequence of insufficient docs, they do point out some requirements on not running more-than-one-vote nodes in quorum and then provide only QDevice as the alternative. You end up doing it by the docs, sort of, but you have another piece of virtual infrastructure to maintain and no benefit (as opposed to the votequorum supported options for two node clusters).
 
Last edited:
  • Like
Reactions: DDPF
I did not mean it in a bad way, I just was essentially replying similar yesterday. I think people should be more pointed towards tie breaker and two node options. Note the two node implies wait for all, which is not what one wants most of the time. There's an elegant way to do two node clusters with tie breaker specifying the node that wins. If you have another device around already, QD is even better, if not, tie breaker is in fact, much better.

I know :) me too :) I would be pleased to learn about that. I mean you know when we run homelab then we are the master of architecture (more like master of disaster :p) thinking we are doing it all perfect lol. You may be right that tie breaker is better. I never used it, so my question is, can it achieve "HA"? Becuase one of my goals is i have master device running qdevice and when the other node comes online during the day when my solar produces enough power it moves to vms to that machine and vice versa when i need to shutdown my master node for node dissection lol
 
Last edited:
I know :) me too :) I would be pleased to learn about that. I mean you know when we run homelab then we are the master of architecture (more like master of disaster :p) thinking we are doing it all perfect lol. You may be right that tie breaker is better. I never used it, so my question is, can it achieve "HA"?

That's big quotation marks right there. ;)

Becuase one of my goals is i have master device running qdevice and when the other node comes online during the day when my solar produces enough power it moves to vms to that machine and vice versa when i need to shutdown my master node for node dissection lol

I think you need to clarify the part "it moves". I think you move it, I don't think you have any affinity sort of set up that it auto-moves, correct?

I think you are wondering about doing maintenance on any of the 2 machines at any given time without losing quorum.

1) First of all, why would you need that (VMs keep running while you are inquorate) and;
2) second it would be possible, not with the tie breaker (because that really resembles more the master/slave logic), but with two node option, this would be best used with wait for all (see the docs), but;
3) that would be an issue if you are starting up just 1 node (when there were none running prior).

For your desired scenario you might even consider two node option without wait for all, any disasters caused by that could be, after all, manually resolved afterwards. But see (1) above one more time.
 
I think you need to clarify the part "it moves". I think you move it, I don't think you have any affinity sort of set up that it auto-moves, correct?
What i meant by that is that i have HA setup and the vms that i want to run on the secondary node are "attched" to it via priority. So when the secondary node is offline, the vms are migrated to the primary. And when the secondary is back online the nextday, they automatically go back there.
I think you are wondering about doing maintenance on any of the 2 machines at any given time without losing quorum.

1) First of all, why would you need that (VMs keep running while you are inquorate) and;
The is true that they are running. However, you cannot start or stop a vm/lxc when you are inquorate. This is the part i want to avoid. Sure you can change the expected votes but that is my last resort. This is why I created qdevice so i just live migrate it when i need to.

2) second it would be possible, not with the tie breaker (because that really resembles more the master/slave logic), but with two node option, this would be best used with wait for all (see the docs), but;
3) that would be an issue if you are starting up just 1 node (when there were none running prior).
Yep exactly my point.
 
What i meant by that is that i have HA setup and the vms that i want to run on the secondary node are "attched" to it via priority. So when the secondary node is offline, the vms are migrated to the primary. And when the secondary is back online the nextday, they automatically go back there.
I see, so you indeed were talking about VMs in general, I had originally thought you were referring to the QD. So two notes, I think you are aware of:

1) HA setup for VMs will continue to work as before so as long the node is quorate;
2) The QD in a VM on that very node (your current setup) is the weakest weak point with HA active because in case your other node is down (which is most of the time as I understood) and anything happens to that QD VM, you are going to experience a reboot on that inquorate node, upon reboot, since you start inquorate, that QD VM does not come up you will experience ... a reboot, upon reboot, upon ...

However, you cannot start or stop a vm/lxc when you are inquorate.

Correct.

This is the part i want to avoid.

I understand that, you want to be quorate, but having a QD on a node itself does not really give you anything more than "2 votes" for that node (which the OP had already), but see (2) above.

Sure you can change the expected votes but that is my last resort.

If you know the other (of the sole two) node is down, it is absolutely safe to do, or you can even just corosync-quorumtool -v 2
and then back, but for what you are doing (i.e. shuffling around which of the two nodes is "quorate on its own anyhow", you could temporarily change the auto_tie_breaker_node: 1 to auto_tie_breaker_node: 2 or vice versa (this relies on auto_tie_breaker: 1 option being set). I do not think I would bother doing that for maintenance, but you could if it feels better. It's still better than shuffling around a VM with QD.

This is why I created qdevice so i just live migrate it when i need to.

How would you "live migrate" if you lost quorum though? ;)

Yep exactly my point.

If you are casually running only one node (at all times, except maintenance), I think your use case is for tie breaker. If you wanted to have it more resilient, the next better setup would be a QD off the cluster, it could be even in a VM, just not within the same cluster. NB QD can be even outside your network, e.g. running as an AWS VM.
 
I see, so you indeed were talking about VMs in general, I had originally thought you were referring to the QD. So two notes, I think you are aware of:

1) HA setup for VMs will continue to work as before so as long the node is quorate;
2) The QD in a VM on that very node (your current setup) is the weakest weak point with HA active because in case your other node is down (which is most of the time as I understood) and anything happens to that QD VM, you are going to experience a reboot on that inquorate node, upon reboot, since you start inquorate, that QD VM does not come up you will experience ... a reboot, upon reboot, upon ...
Yes I am aware of that. But it is good that you mentioned it anyways because there might be other users who are not aware.
I understand that, you want to be quorate, but having a QD on a node itself does not really give you anything more than "2 votes" for that node (which the OP had already), but see (2) above.
if something happens to that VM where qdevice is running then i just turn on both nodes per wol. Then they are quorate again :) After that i restore my VM qdevice.
If you know the other (of the sole two) node is down, it is absolutely safe to do, or you can even just corosync-quorumtool -v 2
and then back, but for what you are doing (i.e. shuffling around which of the two nodes is "quorate on its own anyhow", you could temporarily change the auto_tie_breaker_node: 1 to auto_tie_breaker_node: 2 or vice versa (this relies on auto_tie_breaker: 1 option being set). I do not think I would bother doing that for maintenance, but you could if it feels better. It's still better than shuffling around a VM with QD.
I should try tie breaker when I notice my setup it not ideal for me anymore :) I'll keep your suggestion in mind :)
How would you "live migrate" if you lost quorum though? ;)
I wake up the node per wol. For me this is just a click on my phone. Or use ansible to automate that wake up process. I do it already.
If you are casually running only one node (at all times, except maintenance), I think your use case is for tie breaker. If you wanted to have it more resilient, the next better setup would be a QD off the cluster, it could be even in a VM, just not within the same cluster. NB QD can be even outside your network, e.g. running as an AWS VM.
I used to have my qdevice on a raspberry pi. But did not want extra device so i removed it. And regarding AWS not really a fan of using other company, but can be considered yes.


I like this discussion. It helps me and hopefully the community to get some ideas. @DDPF sorry for kinda stealing your thread. But i think it is also relevant in your case.
 
Yes I am aware of that. But it is good that you mentioned it anyways because there might be other users who are not aware.

if something happens to that VM where qdevice is running then i just turn on both nodes per wol. Then they are quorate again :) After that i restore my VM qdevice.

I should try tie breaker when I notice my setup it not ideal for me anymore :) I'll keep your suggestion in mind :)

I wake up the node per wol. For me this is just a click on my phone. Or use ansible to automate that wake up process. I do it already.

I used to have my qdevice on a raspberry pi. But did not want extra device so i removed it. And regarding AWS not really a fan of using other company, but can be considered yes.


I like this discussion. It helps me and hopefully the community to get some ideas. @DDPF sorry for kinda stealing your thread. But i think it is also relevant in your case.
@homelabenthusiast No worries, I also like this discussion, so please continue!

This should be unrelated, nevertheless:
What do you think it could be?

I think it could be because of the replication of my VM's and containers. The second node only starts up every other day to backup the primary node with sanoid for my main datasets, except the VM/LXC one. I backup the VM's and containers via the replication function. I decided to to it like this, becasue if I would have done it via sanoid then Proxmox wouldn't recognize it and if I migrate my services they would start copying the whole VM/container to my second server over the lan. (I do not have 10gig so it takes a while to do so)

So now one node is down 99% of the time and only starts up for backup every other day. To have a minimal uptime, aka as soon as everything is backed up turn the second node off, I first implemented something like this in a bash script:

Bash:
# Start replication from Proxmox

#jobs=("100-0" "101-0" "102-0" "103-0" "104-0" "105-0" "1001-0")

#for job in "${jobs[@]}"; do
#    /usr/bin/pvesr run --id $job
#    echo "Job: $job is done."
#done

This didn't work realiably so now I just have a timer (15min). Not a good solution and maybe the cause of my problems...

This should be unrelated, nevertheless:



Do not bother (with the QD), use auto_tie_breaker, have a look here and at the man page:
https://forum.proxmox.com/threads/two-servers-on-cluster-making-second-server-a-master.142339/#post-638706



Hi current design choice is less bad. QDevice as a VM within the same cluster defeats the whole purpose of a QD.
Thanks a lot, then I will use auto_tie_breaker.

Side question, do you have HA active on the "cluster"?
Can you please clarify active? If I do not have any VM's or containers that are configured as a HA resource is that HA not active?
 
Last edited:
if something happens to that VM where qdevice is running then i just turn on both nodes per wol. Then they are quorate again :) After that i restore my VM qdevice.

I understand that manual intervention can fix anything, but in scenario when the OP e.g. had fixed 2 votes on that "preferred" node, the setup with faux QD is relatively (to 2 votes or tie breaker or even two node flag with or without wfa) unstable because it requires the intervention.

I should try tie breaker when I notice my setup it not ideal for me anymore :) I'll keep your suggestion in mind :)

I am a master of being blunt here, please do not take it as being confrontational, but strictly speaking the setup you are currently running is hackish and unnecessarily so. The unnecessarily part is important because it does not bring in anything beneficial but brings instability. (See also conclusion below.)

I wake up the node per wol. For me this is just a click on my phone. Or use ansible to automate that wake up process. I do it already.

Other than this is yet another manual intervention required, it is also not always possible (e.g. the other node might be dead). Just to be clear, I meant a situation when your node 1 is up and running, your QD VM was running on it and your node 2 is off/dead. If your QD VM dies, you are inquorate, if your manual intervention does not work you remain inquorate. Contrast that with tie breaker where you would have remained quorate on that node. And that's before HA is considered when inquorate essentially caused the "HA" become "never available", even if one node could provide.

I used to have my qdevice on a raspberry pi. But did not want extra device so i removed it. And regarding AWS not really a fan of using other company, but can be considered yes.

So I think this is a good point (also for the OP) to consider because generally speaking, having a QD anywhere is really not an issue, not anymore than having your ISP route your internet traffic. After all, qnetd is essentially state-free. I mentioned this option because it is sometimes overlooked that there's no low-latency requirements on QDevice and lots of people already run something off-site (backups come to mind). QD outside the network segment where the nodes are is basically more resilient yet.

With that said, however, your setup when there's just 2 nodes and importantly one of them is even casually off, the QD becomes the source of instability in that any loss of connection to it for the sole node disrupts quorum. This is essentially same issue as running it in a VM of that very node. If this was providing services only to the outside world, you could consider it benign, but generally one does not want cluster to be inquorate just because connectivity to a QD was lost. Therefore it would start making sense from 3+ nodes only.

Moreover, it is really defeating the purpose to have anything set as HA in this setup (with just one node up most of the time) as this essentially makes your node prone to watchdog reboots, where there's nowhere to HA migrate anything in the first place. When manual intervention is taking place, it might be just manually (scripted) achieved that VMs migrate without any QD hocus pocus whatsover, thus not compromising the stability.

I like this discussion. It helps me and hopefully the community to get some ideas. @DDPF sorry for kinda stealing your thread. But i think it is also relevant in your case.

In the end, I would sum up, that for the above described setup (when only one of the two nodes is regularly on), the most elegant solution is to:
1) use tie breaker; or
2) use two node flag with wfa disabled.

And sadly, that's about it. If both nodes were regularly on, then adding a QD outside the network segment would be even better. Also from 3 nodes up where they are normally quorate (without even the QD), a QD would be beneficial. Adding a QD (hackish or not, local or remote) to essentially single node is really completely counterproductive.
 
  • Like
Reactions: DDPF
@homelabenthusiast No worries, I also like this discussion, so please continue!

Well we do not want to miss the OP here either. :D

What do you think it could be?

I just do not see you potentially losing connection to the proxy related to how many votes you assigned to something. Just to be clear, even if you change the way you go about the corosync setup, this is unlikely related to you seeing the greyed out items so unlikely to resolve it. We are chasing a red herring here in relation to your symptoms.

I think it could be because of the replication of my VM's and containers. The second node only starts up every other day to backup the primary node with sanoid for my main datasets, except the VM/LXC one.

I know I still have not answered your question, so here let's do this, can you provide output of e.g. (change the timing when your issue occurs):

Code:
journalctl -u pvedaemon -u pveproxy -u pmxcfs -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux -S "2024-02-17 12:15" -U "2024-02-18 18:45

I backup the VM's and containers via the replication function. I decided to to it like this, becasue if I would have done it via sanoid then Proxmox wouldn't recognize it and if I migrate my services they would start copying the whole VM/container to my second server over the lan. (I do not have 10gig so it takes a while to do so)

This is more about agreeing on terminology, but I suppose you replicate to replicate, not to back up. You are experiencing these issues when? As per your replication schedule?

So now one node is down 99% of the time and only starts up for backup every other day. To have a minimal uptime, aka as soon as everything is backed up turn the second node off, I first implemented something like this in a bash script:

Bash:
# Start replication from Proxmox

#jobs=("100-0" "101-0" "102-0" "103-0" "104-0" "105-0" "1001-0")

#for job in "${jobs[@]}"; do
#    /usr/bin/pvesr run --id $job
#    echo "Job: $job is done."
#done

This didn't work realiably so now I just have a timer (15min). Not a good solution and maybe the cause of my problems...

I am a little confused here, what does pvesr status show?

Thanks a lot, then I will use auto_tie_breaker.

Be sure to read the man page.

Can you please clarify active? If I do not have any VM's or containers that are configured as a HA resource is that HA not active?

Can you post ha-manager status too? :)
 
Well we do not want to miss the OP here either. :D
Neither do I.;)

I just do not see you potentially losing connection to the proxy related to how many votes you assigned to something. Just to be clear, even if you change the way you go about the corosync setup, this is unlikely related to you seeing the greyed out items so unlikely to resolve it. We are chasing a red herring here in relation to your symptoms.
Makes sense.
I know I still have not answered your question, so here let's do this, can you provide output of e.g. (change the timing when your issue occurs):
How can I share logs here without exceedin #2^14 characters?

This is more about agreeing on terminology, but I suppose you replicate to replicate, not to back up. You are experiencing these issues when? As per your replication schedule?
Okay you got me xD. I actually do not have any backup for the VM's or containers only replications to my other node. For the VM thats bad, because if something goes wrong my data there is gong. I need to fix that. For the containers those are accessing my ZFS filesystem which is backed up with sanoid so I do not mind if I loose them. (though backups would save time for rebuilding) So I should also fix this. Would you just setup backup from the webui?

And one more question, technically the replication also saves a copy of the VM or container on my second node, right? So technically in case my primary node dies, I should theoretically be able to bring the service back up on my second node or am I mistaken?

I am a little confused here, what does pvesr status show?
I need to try it again. Will post the output later on.

Be sure to read the man page.
Yes, I did, thanks.

Code:
...
quorum {
  provider: corosync_votequorum
  expected_votes: 2
  auto_tie_breaker: 1
  auto_tie_breaker_node: 2
}
...
Can you post ha-manager status too? :)
And last but not least:

Code:
quorum OK

Thanks for your help!
 
How can I share logs here without exceedin #2^14 characters?

I would suspect attachment here can have 16K. If it's massive, tweak the dates, I would only like to see one such time when you lost the conn.

Okay you got me xD. I actually do not have any backup for the VM's or containers only replications to my other node. For the VM thats bad, because if something goes wrong my data there is gong. I need to fix that. For the containers those are accessing my ZFS filesystem which is backed up with sanoid so I do not mind if I loose them.

You can't back up a zvol?

(though backups would save time for rebuilding) So I should also fix this. Would you just setup backup from the webui?

Idk what to say to be honest because I failed to understand the point of e.g. PBS when with ZFS it cannot even take advantage of snapshots. Have a look here for some inspiration though:

https://blog.guillaumematheron.fr/2023/261/taking-advantage-of-zfs-for-smarter-proxmox-backups/

And one more question, technically the replication also saves a copy of the VM or container on my second node, right? So technically in case my primary node dies, I should theoretically be able to bring the service back up on my second node or am I mistaken?

Replication makes a replica. So the answer to your question strictly - when it dies - you should be. The issue with replica is, you replicate whatever, so some people would keep replicating e.g. their Windows ransomware encrypted VM onto the other node to have it there as well.

I need to try it again. Will post the output later on.

I just basically did not quite understand how you had that script and what for since there's the built-in replication feature. So wanted to get better grasp what's going on there.

Yes, I did, thanks.

Code:
...
quorum {
  provider: corosync_votequorum
  expected_votes: 2
  auto_tie_breaker: 1
  auto_tie_breaker_node: 2
}
...

I think the expected should be ditched. Did you test this?

And last but not least:

Code:
quorum OK

So this means your HA is inactive indeed.
 
I would suspect attachment here can have 16K. If it's massive, tweak the dates, I would only like to see one such time when you lost the conn.
Okay perfect, cheers. How could I miss the button? xD
Idk what to say to be honest because I failed to understand the point of e.g. PBS when with ZFS it cannot even take advantage of snapshots. Have a look here for some inspiration though:

https://blog.guillaumematheron.fr/2023/261/taking-advantage-of-zfs-for-smarter-proxmox-backups/
I know how to backup datasets. I do not use PBS and I am a fan of ZFS. To reframe my question: When I send my newest snapshot from node 1 to my backup node 2, the VM data exists now on both nodes. However, If I want to migrate the VM from node 1 to 2 Proxmox doesn't know that a replica of the VM already exists in its filesystem. Now what else do I need to do, so that Proxmox is aware of the VM copy on node 2?

With replication from the webui, proxmox is aware of the copy and instntly migrates.

Your link is also a cool project, cheers, but I would prefer to stick to sanoid.

I think the expected should be ditched. Did you test this?
Yes, works perfectly. Also without the expected.
 

Attachments

  • nas.log
    29.1 KB · Views: 3

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!