Two node setup grey question marks everywhere error

Okay perfect, cheers. How could I miss the button? xD

So this is interesting, I suppose this log is from the node which you have to take down to get it working in GUI again, same node that is on most of the time. I can see e.g. Feb 27 4:30 the other node was orderly shut down and checked out just fine, past 9am you then get into the GUI and shut this node manually down. I assume this is because you were experiencing the said issue, however there's nothing in the log that would suggest anything wrong with quorum to the point where it would be e.g. congesting the network. This is excellent because there is nothing wrong there. :D

I think I should have asked this already before, but how about when you SSH in? Can you start/stop VMs, etc? What exactly are your symptoms other than "grey GUI"? Do you get the same issues in the terminal (just to be clear, I would prefer to SSH in, literally, not use the GUI pty).

I know how to backup datasets. I do not use PBS and I am a fan of ZFS. To reframe my question: When I send my newest snapshot from node 1 to my backup node 2, the VM data exists now on both nodes.

You do just zfs send | receive? I will be honest I never tired to do this across nodes as means of replication, only in and out (from/to another place). Did not consider it before, but ...

However, If I want to migrate the VM from node 1 to 2 Proxmox doesn't know that a replica of the VM already exists in its filesystem.

What's the error messages when you attempt that? Maybe I do not know enough about the internals of that mechanism (yet), but ...

Now what else do I need to do, so that Proxmox is aware of the VM copy on node 2? With replication from the webui, proxmox is aware of the copy and instntly migrates.

I am not suggesting it's the only way, but why not use the built-in tool? It's not about "doing it from GUI", that's just front-end for the API as is pvesr [1] if you like CLI.

When I look at pvesr source [2] it's not that exciting (sorry :D) and uses PVE::API2::Replication [3] which gets to use PVE::Replication [4] which in the end just calls PVE::Storage::storage_migrate [5] which essentially does pvesm import [6] on the other end and feeds it with pvesm export [7] and there's some accounting for snapshot situation. That's interesting. I might be a bit off now but there's some hocus pocus with activate_volumes [8] there too.

[1] https://pve.proxmox.com/pve-docs/pvesr.1.html
[2] https://github.com/proxmox/pve-manager/blob/master/PVE/CLI/pvesr.pm
[3] https://github.com/proxmox/pve-mana...f0590ddeb58fab1ad/PVE/API2/Replication.pm#L39
[4] https://github.com/proxmox/pve-gues...9b3172296c38058d0/src/PVE/Replication.pm#L220
[5] https://github.com/proxmox/pve-stor...3e84d0a18f95de0322168/src/PVE/Storage.pm#L702
[6] https://github.com/proxmox/pve-stor...3e84d0a18f95de0322168/src/PVE/Storage.pm#L820
[7] https://github.com/proxmox/pve-stor...3e84d0a18f95de0322168/src/PVE/Storage.pm#L743
[8] https://github.com/proxmox/pve-stor...de0322168/src/PVE/Storage.pm#L1200C5-L1200C21

I think I detracted by now quite a bit (truth is I have to go for now), but it's not doing zfs send|receive for sure. I would look at the resulting datasets and snapshot names/attributes. But still it's interesting this would have something to do with your greyed out GUI on the node you replicate from...

Your link is also a cool project, cheers, but I would prefer to stick to sanoid.

So you want to be replicating to the other node and then send|recv out, correct?

Yes, works perfectly. Also without the expected.

Alright, I was not sure by heart. The expected is implied from the nodes list unless explicitly specified, but it is possible the tie breaker has overridden it for you. The result should be that even 1 node with 1 vote should be quorate. You set your node votes back to 1 each, correct?
 
So this is interesting, I suppose this log is from the node which you have to take down to get it working in GUI again, same node that is on most of the time. I can see e.g. Feb 27 4:30 the other node was orderly shut down and checked out just fine, past 9am you then get into the GUI and shut this node manually down. I assume this is because you were experiencing the said issue, however there's nothing in the log that would suggest anything wrong with quorum to the point where it would be e.g. congesting the network. This is excellent because there is nothing wrong there.
Yes that's correct. Its the log of my primary node.
I think I should have asked this already before, but how about when you SSH in? Can you start/stop VMs, etc? What exactly are your symptoms other than "grey GUI"? Do you get the same issues in the terminal (just to be clear, I would prefer to SSH in, literally, not use the GUI pty).
So I wouldn't even notice if it was just the greyed out icons. So in the morning of the 27th of Feburary I noticed that in my HomeAssistant panel some sensor data is out of date and was last updated 2 days ago. I though that odd, why isn't it updating it. Then on the train I was connected to my node via wireguard vpn. I thought at least but it wasn't working at all. No connection whatsoever. Then I used my other wireguard vpn to look where the problem is and I login into the webui and see the greyed out question marks. I had it before and I know that a reboot fixes it, so I inistialized a reboot. After about 10min everythin was back to normal again. Last time I had the same issue, I noticed that I just can not access my services. It seems like the network is not working properly.

I always ssh in and almost never use the GUI pty, but I can not remember if starting and stoping VM and containers worked. I would need to try that out if the issue comes back.

You do just zfs send | receive? I will be honest I never tired to do this across nodes as means of replication, only in and out (from/to another place). Did not consider it before, but ...
Yes, I use syncoid, under the hood I think it is more sophisticated then just zfs send and receive but the result is the same.

What's the error messages when you attempt that? Maybe I do not know enough about the internals of that mechanism (yet), but ...
So there is no error message, but the last time I tried that it just starts copying the whole VM/container over the network to my second node. And just ignoring the already existing ones. I do not understand why.
I am not suggesting it's the only way, but why not use the built-in tool? It's not about "doing it from GUI", that's just front-end for the API as is pvesr [1] if you like CLI.

When I look at pvesr source [2] it's not that exciting (sorry :D) and uses PVE::API2::Replication [3] which gets to use PVE::Replication [4] which in the end just calls PVE::Storage::storage_migrate [5] which essentially does pvesm import [6] on the other end and feeds it with pvesm export [7] and there's some accounting for snapshot situation. That's interesting. I might be a bit off now but there's some hocus pocus with activate_volumes [8] there too.

[1] https://pve.proxmox.com/pve-docs/pvesr.1.html
[2] https://github.com/proxmox/pve-manager/blob/master/PVE/CLI/pvesr.pm
[3] https://github.com/proxmox/pve-mana...f0590ddeb58fab1ad/PVE/API2/Replication.pm#L39
[4] https://github.com/proxmox/pve-gues...9b3172296c38058d0/src/PVE/Replication.pm#L220
[5] https://github.com/proxmox/pve-stor...3e84d0a18f95de0322168/src/PVE/Storage.pm#L702
[6] https://github.com/proxmox/pve-stor...3e84d0a18f95de0322168/src/PVE/Storage.pm#L820
[7] https://github.com/proxmox/pve-stor...3e84d0a18f95de0322168/src/PVE/Storage.pm#L743
[8] https://github.com/proxmox/pve-stor...de0322168/src/PVE/Storage.pm#L1200C5-L1200C21

I think I detracted by now quite a bit (truth is I have to go for now), but it's not doing zfs send|receive for sure. I would look at the resulting datasets and snapshot names/attributes. But still it's interesting this would have something to do with your greyed out GUI on the node you replicate from...
That a good point, thanks for providing the sources, I will have a look around! :)
So you want to be replicating to the other node and then send|recv out, correct?
Yes, so that the migrate just works for the VM and containers.

Alright, I was not sure by heart. The expected is implied from the nodes list unless explicitly specified, but it is possible the tie breaker has overridden it for you. The result should be that even 1 node with 1 vote should be quorate. You set your node votes back to 1 each, correct?
Yes, that exaclty what I did. But the expected wasn't there before :)...
 
1711892426526.png

The problem is back. What kind of logs can I share to settle this issue?
 

Attachments

  • nas.log
    71.9 KB · Views: 1
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!