Two node setup grey question marks everywhere error

DDPF · Feb 27, 2024

Hi there

So I have two physical devices. Both are running Proxmox Virtual Environment Hypervisor. But only one node is up 99% of the time. I decided to have two installations of Proxmox VE and not Proxmox VE and Proxmox Backup Server, because I wanted to be able to do maintenance work or try something new on one server without compromising my services. (I can just migrate them to the first server and do my thing on the second one)

Now the prolem is that my main server has grey question marks on all the icons in the webui approximately once every month. (Usually after my second server turned on to get replication data of the LXC's, VM's and ZFS syncoid sync, a restart of my main node that is up 99% of the time fixes the issue.) I personally only notice some weirdness, becasue my services stop working properly and this is a problem. To make the Quorum work I have configured my nodes like this. Might this be the cause of my issue?

Code:

    nodelist { 
      node { 
        name: node1 
        nodeid: 1 
        *quorum_votes: 2* 
        ring0_addr: 192.168.0.20 
      } 
      node { 
        name: node2   
        nodeid: 2 
        *quorum_votes: 1* 
        ring0_addr: 192.168.0.21 
      } 
    }

And if so do you have any suggestion on how to run the Qdevice on OpenWrt x86 machine, because I preferably do not want to add another pyhsical device, which runs 24/7.
Thanks for your Feedback and time!

homelabenthusiast · Feb 27, 2024

DDPF said:
Hi there

So I have two physical devices. Both are running Proxmox Virtual Environment Hypervisor. But only one node is up 99% of the time. I decided to have two installations of Proxmox VE and not Proxmox VE and Proxmox Backup Server, because I wanted to be able to do maintenance work or try something new on one server without compromising my services. (I can just migrate them to the first server and do my thing on the second one)

Now the prolem is that my main server has grey question marks on all the icons in the webui approximately once every month. (Usually after my second server turned on to get replication data of the LXC's, VM's and ZFS syncoid sync, a restart of my main node that is up 99% of the time fixes the issue.) I personally only notice some weirdness, becasue my services stop working properly and this is a problem. To make the Quorum work I have configured my nodes like this. Might this be the cause of my issue?

Code:

nodelist { node { name: node1 nodeid: 1 *quorum_votes: 2* ring0_addr: 192.168.0.20 } node { name: node2 nodeid: 2 *quorum_votes: 1* ring0_addr: 192.168.0.21 } }

And if so do you have any suggestion on how to run the Qdevice on OpenWrt x86 machine, because I preferably do not want to add another pyhsical device, which runs 24/7.
Thanks for your Feedback and time!

I do not think that is a problem that you gave you primary node 2 votes. I did this for a long time also. Now i run qdevice on a vm running on my primary server. When I need to shutdown my primray, i migrate the qdevice first. And make sure that qdevice starts first when i reboot my nodes.

Your problem could be that the corosync becomes outdated. Maybe you could try this:

Code:

service pve-cluster stop
service corosync stop
service pvestatd stop
service pveproxy stop
service pvedaemon stop

Code:

service pve-cluster start
service corosync start
service pvestatd start
service pveproxy start
service pvedaemon start

I execute it to get nodes synced again. If that does not work, then you may have a different problem. Another reason could be kernel version differences then the secondary does not get updated regulary like your primary?

DDPF · Feb 28, 2024

homelabenthusiast said:
I do not think that is a problem that you gave you primary node 2 votes. I did this for a long time also. Now i run qdevice on a vm running on my primary server. When I need to shutdown my primray, i migrate the qdevice first. And make sure that qdevice starts first when i reboot my nodes.

Your problem could be that the corosync becomes outdated. Maybe you could try this:

Code:

service pve-cluster stop service corosync stop service pvestatd stop service pveproxy stop service pvedaemon stop

Code:

service pve-cluster start service corosync start service pvestatd start service pveproxy start service pvedaemon start

I execute it to get nodes synced again. If that does not work, then you may have a different problem. Another reason could be kernel version differences then the secondary does not get updated regulary like your primary?

Thanks for your response. I will try out your suggestoin if it occurs again. For that both nodes need to be online, right? If that doens't work I will post my results here and try the qdevice in vm or lxc method. I usually update both nodes at the same time, so the kernel version is the same, currently at Linux 6.5.11-8-pve. I will keep you updated. Have a nice day.

homelabenthusiast · Feb 28, 2024

DDPF said:
Thanks for your response. I will try out your suggestoin if it occurs again. For that both nodes need to be online, right? If that doens't work I will post my results here and try the qdevice in vm or lxc method. I usually update both nodes at the same time, so the kernel version is the same, currently at Linux 6.5.11-8-pve. I will keep you updated. Have a nice day.

they should both be online, yes. And I would suggest minimal installation of a vm so you can do live migration and not have quorom problem

esi_y · Feb 28, 2024

DDPF said:
Hi there

So I have two physical devices. Both are running Proxmox Virtual Environment Hypervisor. But only one node is up 99% of the time. I decided to have two installations of Proxmox VE and not Proxmox VE and Proxmox Backup Server, because I wanted to be able to do maintenance work or try something new on one server without compromising my services. (I can just migrate them to the first server and do my thing on the second one)

Now the prolem is that my main server has grey question marks on all the icons in the webui approximately once every month. (Usually after my second server turned on to get replication data of the LXC's, VM's and ZFS syncoid sync, a restart of my main node that is up 99% of the time fixes the issue.) I personally only notice some weirdness, becasue my services stop working properly and this is a problem. To make the Quorum work I have configured my nodes like this. Might this be the cause of my issue?

This should be unrelated, nevertheless:

DDPF said:
Code:

nodelist { node { name: node1 nodeid: 1 *quorum_votes: 2* ring0_addr: 192.168.0.20 } node { name: node2 nodeid: 2 *quorum_votes: 1* ring0_addr: 192.168.0.21 } }

And if so do you have any suggestion on how to run the Qdevice on OpenWrt x86 machine, because I preferably do not want to add another pyhsical device, which runs 24/7.

Do not bother (with the QD), use auto_tie_breaker, have a look here and at the man page:
https://forum.proxmox.com/threads/two-servers-on-cluster-making-second-server-a-master.142339/#post-638706

homelabenthusiast said:
Now i run qdevice on a vm running on my primary server. When I need to shutdown my primray, i migrate the qdevice first. And make sure that qdevice starts first when i reboot my nodes.

Hi current design choice is less bad. QDevice as a VM within the same cluster defeats the whole purpose of a QD.

esi_y · Feb 28, 2024

I took the liberty of filing:

Bug 5273 - Add note to QDevice docs NOT TO run qnetd on any node of the same cluster

homelabenthusiast · Feb 28, 2024

esi_y said:
Hi current design choice is less bad. QDevice as a VM within the same cluster defeats the whole purpose of a QD.

I would say it all depends on the use case, resources and your plan on implemeting it. Best practice is of course use another device. But i do not want to use another device and @DDPF does not also want to run another machine which consumes extra energy. I mean its just a homelab aynway. For a bussiness or critical data then yes you should not use qdevice as a vm. This setup meets my goal. Your goal might be different and that's okay.

I mean worse case is that you change the expected vote becuase the machine and with it's qdevice suddely not available this is done in a few seconds.

esi_y · Feb 28, 2024

DDPF said:
To make the Quorum work I have configured my nodes like this. Might this be the cause of my issue?

Code:

nodelist { node { name: node1 nodeid: 1 *quorum_votes: 2* ring0_addr: 192.168.0.20 } node { name: node2 nodeid: 2 *quorum_votes: 1* ring0_addr: 192.168.0.21 } }

Side question, do you have HA active on the "cluster"?

esi_y · Feb 28, 2024

homelabenthusiast said:
I would say it all depends on the use case, resources and your plan on implemeting it. Best practice is of course use another device. But i do not want to use another device and @DDPF does not also want to run another machine which consumes extra energy. I mean its just a homelab aynway. For a bussiness or critical data then yes you should not use qdevice as a vm. This setup meets my goal. Your goal might be different and that's okay. I mean worse case is that you change the expected vote becuase the machine and with it's qdevice suddely not available this is done in a few seconds.

I did not mean it in a bad way, I just was essentially replying similar yesterday. I think people should be more pointed towards tie breaker and two node options. Note the two node implies wait for all, which is not what one wants most of the time. There's an elegant way to do two node clusters with tie breaker specifying the node that wins. If you have another device around already, QD is even better, if not, tie breaker is in fact, much better.

EDIT: In my opinion this is consequence of insufficient docs, they do point out some requirements on not running more-than-one-vote nodes in quorum and then provide only QDevice as the alternative. You end up doing it by the docs, sort of, but you have another piece of virtual infrastructure to maintain and no benefit (as opposed to the votequorum supported options for two node clusters).

homelabenthusiast · Feb 28, 2024

esi_y said:
I did not mean it in a bad way, I just was essentially replying similar yesterday. I think people should be more pointed towards tie breaker and two node options. Note the two node implies wait for all, which is not what one wants most of the time. There's an elegant way to do two node clusters with tie breaker specifying the node that wins. If you have another device around already, QD is even better, if not, tie breaker is in fact, much better.

I know

me too

I would be pleased to learn about that. I mean you know when we run homelab then we are the master of architecture (more like master of disaster

) thinking we are doing it all perfect lol. You may be right that tie breaker is better. I never used it, so my question is, can it achieve "HA"? Becuase one of my goals is i have master device running qdevice and when the other node comes online during the day when my solar produces enough power it moves to vms to that machine and vice versa when i need to shutdown my master node for node dissection lol

esi_y · Feb 28, 2024

homelabenthusiast said:
I know me too I would be pleased to learn about that. I mean you know when we run homelab then we are the master of architecture (more like master of disaster ) thinking we are doing it all perfect lol. You may be right that tie breaker is better. I never used it, so my question is, can it achieve "HA"?

That's big quotation marks right there.

homelabenthusiast said:
Becuase one of my goals is i have master device running qdevice and when the other node comes online during the day when my solar produces enough power it moves to vms to that machine and vice versa when i need to shutdown my master node for node dissection lol

I think you need to clarify the part "it moves". I think you move it, I don't think you have any affinity sort of set up that it auto-moves, correct?

I think you are wondering about doing maintenance on any of the 2 machines at any given time without losing quorum.

1) First of all, why would you need that (VMs keep running while you are inquorate) and;
2) second it would be possible, not with the tie breaker (because that really resembles more the master/slave logic), but with two node option, this would be best used with wait for all (see the docs), but;
3) that would be an issue if you are starting up just 1 node (when there were none running prior).

For your desired scenario you might even consider two node option without wait for all, any disasters caused by that could be, after all, manually resolved afterwards. But see (1) above one more time.

homelabenthusiast · Feb 28, 2024

esi_y said:
I think you need to clarify the part "it moves". I think you move it, I don't think you have any affinity sort of set up that it auto-moves, correct?

What i meant by that is that i have HA setup and the vms that i want to run on the secondary node are "attched" to it via priority. So when the secondary node is offline, the vms are migrated to the primary. And when the secondary is back online the nextday, they automatically go back there.

esi_y said:
I think you are wondering about doing maintenance on any of the 2 machines at any given time without losing quorum.

1) First of all, why would you need that (VMs keep running while you are inquorate) and;

The is true that they are running. However, you cannot start or stop a vm/lxc when you are inquorate. This is the part i want to avoid. Sure you can change the expected votes but that is my last resort. This is why I created qdevice so i just live migrate it when i need to.

esi_y said:
2) second it would be possible, not with the tie breaker (because that really resembles more the master/slave logic), but with two node option, this would be best used with wait for all (see the docs), but;
3) that would be an issue if you are starting up just 1 node (when there were none running prior).

Yep exactly my point.

esi_y · Feb 28, 2024

homelabenthusiast said:
What i meant by that is that i have HA setup and the vms that i want to run on the secondary node are "attched" to it via priority. So when the secondary node is offline, the vms are migrated to the primary. And when the secondary is back online the nextday, they automatically go back there.

I see, so you indeed were talking about VMs in general, I had originally thought you were referring to the QD. So two notes, I think you are aware of:

1) HA setup for VMs will continue to work as before so as long the node is quorate;
2) The QD in a VM on that very node (your current setup) is the weakest weak point with HA active because in case your other node is down (which is most of the time as I understood) and anything happens to that QD VM, you are going to experience a reboot on that inquorate node, upon reboot, since you start inquorate, that QD VM does not come up you will experience ... a reboot, upon reboot, upon ...

homelabenthusiast said:
However, you cannot start or stop a vm/lxc when you are inquorate.

Correct.

homelabenthusiast said:
This is the part i want to avoid.

I understand that, you want to be quorate, but having a QD on a node itself does not really give you anything more than "2 votes" for that node (which the OP had already), but see (2) above.

homelabenthusiast said:
Sure you can change the expected votes but that is my last resort.

If you know the other (of the sole two) node is down, it is absolutely safe to do, or you can even just corosync-quorumtool -v 2
and then back, but for what you are doing (i.e. shuffling around which of the two nodes is "quorate on its own anyhow", you could temporarily change the auto_tie_breaker_node: 1 to auto_tie_breaker_node: 2 or vice versa (this relies on auto_tie_breaker: 1 option being set). I do not think I would bother doing that for maintenance, but you could if it feels better. It's still better than shuffling around a VM with QD.

homelabenthusiast said:
This is why I created qdevice so i just live migrate it when i need to.

How would you "live migrate" if you lost quorum though?

homelabenthusiast said:
Yep exactly my point.

If you are casually running only one node (at all times, except maintenance), I think your use case is for tie breaker. If you wanted to have it more resilient, the next better setup would be a QD off the cluster, it could be even in a VM, just not within the same cluster. NB QD can be even outside your network, e.g. running as an AWS VM.

homelabenthusiast · Feb 28, 2024

esi_y said:
I see, so you indeed were talking about VMs in general, I had originally thought you were referring to the QD. So two notes, I think you are aware of:

1) HA setup for VMs will continue to work as before so as long the node is quorate;
2) The QD in a VM on that very node (your current setup) is the weakest weak point with HA active because in case your other node is down (which is most of the time as I understood) and anything happens to that QD VM, you are going to experience a reboot on that inquorate node, upon reboot, since you start inquorate, that QD VM does not come up you will experience ... a reboot, upon reboot, upon ...

Yes I am aware of that. But it is good that you mentioned it anyways because there might be other users who are not aware.

esi_y said:
I understand that, you want to be quorate, but having a QD on a node itself does not really give you anything more than "2 votes" for that node (which the OP had already), but see (2) above.

if something happens to that VM where qdevice is running then i just turn on both nodes per wol. Then they are quorate again

After that i restore my VM qdevice.

esi_y said:
If you know the other (of the sole two) node is down, it is absolutely safe to do, or you can even just corosync-quorumtool -v 2
and then back, but for what you are doing (i.e. shuffling around which of the two nodes is "quorate on its own anyhow", you could temporarily change the auto_tie_breaker_node: 1 to auto_tie_breaker_node: 2 or vice versa (this relies on auto_tie_breaker: 1 option being set). I do not think I would bother doing that for maintenance, but you could if it feels better. It's still better than shuffling around a VM with QD.

I should try tie breaker when I notice my setup it not ideal for me anymore

I'll keep your suggestion in mind

esi_y said:
How would you "live migrate" if you lost quorum though?

I wake up the node per wol. For me this is just a click on my phone. Or use ansible to automate that wake up process. I do it already.

esi_y said:
If you are casually running only one node (at all times, except maintenance), I think your use case is for tie breaker. If you wanted to have it more resilient, the next better setup would be a QD off the cluster, it could be even in a VM, just not within the same cluster. NB QD can be even outside your network, e.g. running as an AWS VM.

I used to have my qdevice on a raspberry pi. But did not want extra device so i removed it. And regarding AWS not really a fan of using other company, but can be considered yes.

I like this discussion. It helps me and hopefully the community to get some ideas. @DDPF sorry for kinda stealing your thread. But i think it is also relevant in your case.

DDPF · Feb 28, 2024

homelabenthusiast said:
Yes I am aware of that. But it is good that you mentioned it anyways because there might be other users who are not aware.

if something happens to that VM where qdevice is running then i just turn on both nodes per wol. Then they are quorate again After that i restore my VM qdevice.

I should try tie breaker when I notice my setup it not ideal for me anymore I'll keep your suggestion in mind

I wake up the node per wol. For me this is just a click on my phone. Or use ansible to automate that wake up process. I do it already.

I used to have my qdevice on a raspberry pi. But did not want extra device so i removed it. And regarding AWS not really a fan of using other company, but can be considered yes.

I like this discussion. It helps me and hopefully the community to get some ideas. @DDPF sorry for kinda stealing your thread. But i think it is also relevant in your case.

@homelabenthusiast No worries, I also like this discussion, so please continue!

esi_y said:
This should be unrelated, nevertheless:

What do you think it could be?

I think it could be because of the replication of my VM's and containers. The second node only starts up every other day to backup the primary node with sanoid for my main datasets, except the VM/LXC one. I backup the VM's and containers via the replication function. I decided to to it like this, becasue if I would have done it via sanoid then Proxmox wouldn't recognize it and if I migrate my services they would start copying the whole VM/container to my second server over the lan. (I do not have 10gig so it takes a while to do so)

So now one node is down 99% of the time and only starts up for backup every other day. To have a minimal uptime, aka as soon as everything is backed up turn the second node off, I first implemented something like this in a bash script:

Bash:

# Start replication from Proxmox

#jobs=("100-0" "101-0" "102-0" "103-0" "104-0" "105-0" "1001-0")

#for job in "${jobs[@]}"; do
#    /usr/bin/pvesr run --id $job
#    echo "Job: $job is done."
#done

This didn't work realiably so now I just have a timer (15min). Not a good solution and maybe the cause of my problems...

esi_y said:
This should be unrelated, nevertheless:

Do not bother (with the QD), use auto_tie_breaker, have a look here and at the man page:
https://forum.proxmox.com/threads/two-servers-on-cluster-making-second-server-a-master.142339/#post-638706

Hi current design choice is less bad. QDevice as a VM within the same cluster defeats the whole purpose of a QD.

Thanks a lot, then I will use auto_tie_breaker.

esi_y said:
Side question, do you have HA active on the "cluster"?

Can you please clarify active? If I do not have any VM's or containers that are configured as a HA resource is that HA not active?

esi_y · Feb 28, 2024

homelabenthusiast said:
if something happens to that VM where qdevice is running then i just turn on both nodes per wol. Then they are quorate again After that i restore my VM qdevice.

I understand that manual intervention can fix anything, but in scenario when the OP e.g. had fixed 2 votes on that "preferred" node, the setup with faux QD is relatively (to 2 votes or tie breaker or even two node flag with or without wfa) unstable because it requires the intervention.

homelabenthusiast said:
I should try tie breaker when I notice my setup it not ideal for me anymore I'll keep your suggestion in mind

I am a master of being blunt here, please do not take it as being confrontational, but strictly speaking the setup you are currently running is hackish and unnecessarily so. The unnecessarily part is important because it does not bring in anything beneficial but brings instability. (See also conclusion below.)

homelabenthusiast said:
I wake up the node per wol. For me this is just a click on my phone. Or use ansible to automate that wake up process. I do it already.

Other than this is yet another manual intervention required, it is also not always possible (e.g. the other node might be dead). Just to be clear, I meant a situation when your node 1 is up and running, your QD VM was running on it and your node 2 is off/dead. If your QD VM dies, you are inquorate, if your manual intervention does not work you remain inquorate. Contrast that with tie breaker where you would have remained quorate on that node. And that's before HA is considered when inquorate essentially caused the "HA" become "never available", even if one node could provide.

homelabenthusiast said:
I used to have my qdevice on a raspberry pi. But did not want extra device so i removed it. And regarding AWS not really a fan of using other company, but can be considered yes.

So I think this is a good point (also for the OP) to consider because generally speaking, having a QD anywhere is really not an issue, not anymore than having your ISP route your internet traffic. After all, qnetd is essentially state-free. I mentioned this option because it is sometimes overlooked that there's no low-latency requirements on QDevice and lots of people already run something off-site (backups come to mind). QD outside the network segment where the nodes are is basically more resilient yet.

With that said, however, your setup when there's just 2 nodes and importantly one of them is even casually off, the QD becomes the source of instability in that any loss of connection to it for the sole node disrupts quorum. This is essentially same issue as running it in a VM of that very node. If this was providing services only to the outside world, you could consider it benign, but generally one does not want cluster to be inquorate just because connectivity to a QD was lost. Therefore it would start making sense from 3+ nodes only.

Moreover, it is really defeating the purpose to have anything set as HA in this setup (with just one node up most of the time) as this essentially makes your node prone to watchdog reboots, where there's nowhere to HA migrate anything in the first place. When manual intervention is taking place, it might be just manually (scripted) achieved that VMs migrate without any QD hocus pocus whatsover, thus not compromising the stability.

homelabenthusiast said:
I like this discussion. It helps me and hopefully the community to get some ideas. @DDPF sorry for kinda stealing your thread. But i think it is also relevant in your case.

In the end, I would sum up, that for the above described setup (when only one of the two nodes is regularly on), the most elegant solution is to:
1) use tie breaker; or
2) use two node flag with wfa disabled.

And sadly, that's about it. If both nodes were regularly on, then adding a QD outside the network segment would be even better. Also from 3 nodes up where they are normally quorate (without even the QD), a QD would be beneficial. Adding a QD (hackish or not, local or remote) to essentially single node is really completely counterproductive.

esi_y · Feb 28, 2024

DDPF said:
@homelabenthusiast No worries, I also like this discussion, so please continue!

Well we do not want to miss the OP here either.

DDPF said:
What do you think it could be?

I just do not see you potentially losing connection to the proxy related to how many votes you assigned to something. Just to be clear, even if you change the way you go about the corosync setup, this is unlikely related to you seeing the greyed out items so unlikely to resolve it. We are chasing a red herring here in relation to your symptoms.

DDPF said:
I think it could be because of the replication of my VM's and containers. The second node only starts up every other day to backup the primary node with sanoid for my main datasets, except the VM/LXC one.

I know I still have not answered your question, so here let's do this, can you provide output of e.g. (change the timing when your issue occurs):

Code:

journalctl -u pvedaemon -u pveproxy -u pmxcfs -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux -S "2024-02-17 12:15" -U "2024-02-18 18:45

DDPF said:
I backup the VM's and containers via the replication function. I decided to to it like this, becasue if I would have done it via sanoid then Proxmox wouldn't recognize it and if I migrate my services they would start copying the whole VM/container to my second server over the lan. (I do not have 10gig so it takes a while to do so)

This is more about agreeing on terminology, but I suppose you replicate to replicate, not to back up. You are experiencing these issues when? As per your replication schedule?

DDPF said:
So now one node is down 99% of the time and only starts up for backup every other day. To have a minimal uptime, aka as soon as everything is backed up turn the second node off, I first implemented something like this in a bash script:

Bash:

# Start replication from Proxmox #jobs=("100-0" "101-0" "102-0" "103-0" "104-0" "105-0" "1001-0") #for job in "${jobs[@]}"; do # /usr/bin/pvesr run --id $job # echo "Job: $job is done." #done

This didn't work realiably so now I just have a timer (15min). Not a good solution and maybe the cause of my problems...

I am a little confused here, what does pvesr status show?

DDPF said:
Thanks a lot, then I will use auto_tie_breaker.

Be sure to read the man page.

DDPF said:
Can you please clarify active? If I do not have any VM's or containers that are configured as a HA resource is that HA not active?

Can you post ha-manager status too?

DDPF · Feb 28, 2024

esi_y said:
Well we do not want to miss the OP here either.

Neither do I.

esi_y said:
I just do not see you potentially losing connection to the proxy related to how many votes you assigned to something. Just to be clear, even if you change the way you go about the corosync setup, this is unlikely related to you seeing the greyed out items so unlikely to resolve it. We are chasing a red herring here in relation to your symptoms.

Makes sense.

esi_y said:
I know I still have not answered your question, so here let's do this, can you provide output of e.g. (change the timing when your issue occurs):

How can I share logs here without exceedin #2^14 characters?

esi_y said:
This is more about agreeing on terminology, but I suppose you replicate to replicate, not to back up. You are experiencing these issues when? As per your replication schedule?

Okay you got me xD. I actually do not have any backup for the VM's or containers only replications to my other node. For the VM thats bad, because if something goes wrong my data there is gong. I need to fix that. For the containers those are accessing my ZFS filesystem which is backed up with sanoid so I do not mind if I loose them. (though backups would save time for rebuilding) So I should also fix this. Would you just setup backup from the webui?

And one more question, technically the replication also saves a copy of the VM or container on my second node, right? So technically in case my primary node dies, I should theoretically be able to bring the service back up on my second node or am I mistaken?

esi_y said:
I am a little confused here, what does pvesr status show?

I need to try it again. Will post the output later on.

esi_y said:
Be sure to read the man page.

Yes, I did, thanks.

Code:

...
quorum {
  provider: corosync_votequorum
  expected_votes: 2
  auto_tie_breaker: 1
  auto_tie_breaker_node: 2
}
...

esi_y said:
Can you post ha-manager status too?

And last but not least:

Code:

quorum OK

Thanks for your help!

esi_y · Feb 28, 2024

DDPF said:
How can I share logs here without exceedin #2^14 characters?

I would suspect attachment here can have 16K. If it's massive, tweak the dates, I would only like to see one such time when you lost the conn.

DDPF said:
Okay you got me xD. I actually do not have any backup for the VM's or containers only replications to my other node. For the VM thats bad, because if something goes wrong my data there is gong. I need to fix that. For the containers those are accessing my ZFS filesystem which is backed up with sanoid so I do not mind if I loose them.

You can't back up a zvol?

DDPF said:
(though backups would save time for rebuilding) So I should also fix this. Would you just setup backup from the webui?

Idk what to say to be honest because I failed to understand the point of e.g. PBS when with ZFS it cannot even take advantage of snapshots. Have a look here for some inspiration though:

https://blog.guillaumematheron.fr/2023/261/taking-advantage-of-zfs-for-smarter-proxmox-backups/

DDPF said:
And one more question, technically the replication also saves a copy of the VM or container on my second node, right? So technically in case my primary node dies, I should theoretically be able to bring the service back up on my second node or am I mistaken?

Replication makes a replica. So the answer to your question strictly - when it dies - you should be. The issue with replica is, you replicate whatever, so some people would keep replicating e.g. their Windows ransomware encrypted VM onto the other node to have it there as well.

DDPF said:
I need to try it again. Will post the output later on.

I just basically did not quite understand how you had that script and what for since there's the built-in replication feature. So wanted to get better grasp what's going on there.

DDPF said:
Yes, I did, thanks.

Code:

... quorum { provider: corosync_votequorum expected_votes: 2 auto_tie_breaker: 1 auto_tie_breaker_node: 2 } ...

I think the expected should be ditched. Did you test this?

DDPF said:
And last but not least:

Code:

quorum OK

So this means your HA is inactive indeed.

DDPF · Feb 28, 2024

esi_y said:
I would suspect attachment here can have 16K. If it's massive, tweak the dates, I would only like to see one such time when you lost the conn.

Okay perfect, cheers. How could I miss the button? xD

esi_y said:
Idk what to say to be honest because I failed to understand the point of e.g. PBS when with ZFS it cannot even take advantage of snapshots. Have a look here for some inspiration though:

https://blog.guillaumematheron.fr/2023/261/taking-advantage-of-zfs-for-smarter-proxmox-backups/

I know how to backup datasets. I do not use PBS and I am a fan of ZFS. To reframe my question: When I send my newest snapshot from node 1 to my backup node 2, the VM data exists now on both nodes. However, If I want to migrate the VM from node 1 to 2 Proxmox doesn't know that a replica of the VM already exists in its filesystem. Now what else do I need to do, so that Proxmox is aware of the VM copy on node 2?

With replication from the webui, proxmox is aware of the copy and instntly migrates.

Your link is also a cool project, cheers, but I would prefer to stick to sanoid.

esi_y said:
I think the expected should be ditched. Did you test this?

Yes, works perfectly. Also without the expected.

Two node setup grey question marks everywhere error

New Member

Member

New Member

Member

Active Member

Active Member

Member

Active Member

Active Member

Member

Active Member

Member

Active Member

Member

New Member

Active Member

Active Member

New Member

Active Member

New Member

Attachments