Guys I am in a real mess. And I am not sure how I get in this situation. And I need to recover from this, please help.
Proxmox Setup Details:
Promox Cluster : Total 2 Nodes
Node 1 - Primary
Node 2 - Secondary
Versions:
Node 1: 8.4.17 (Broke after this)
Node 2: 8.4.14 (healthy, didn’t update)
How it all started:
Everything was working fine. I logged in to my cluster and saw Node 1 is having question marks on the main node and the workloads. (this issue is known to me, and wanted to fix it later by adding a QDevice, it happens because my node 1 hosts firewall, after updates if the node 2 starts up and firewall has not loaded yet, quorum is lost)
Reading on the forums and elsewhere I found its the quorum issue. I restarted the corosync service to fix that and verified that now cluster is quorate. But the issue persisted.
Then I found pvestatd pve-daemon and pve-cluster services should be restarted as well. Once I did, nothing happened, issue persisted.
Then I thought lets restart the node and things will be fine.
I saw there are some updates pending on the node, and I thought I will install the updates and restart peacefully.
This destroyed the node. Still the workloads are running fine, but now the node is appearing as offline, and I have several issues in that node.
What happened during update, it tried to install packages but several packages particularly zfs and pve-manager packages started throwing the following errors:
as well as some zfs errors.
Now everything is stuck here. What I have tried:
So I am stuck in a loop essentially. I have not restarted the system and is vary of doing it. I can still SSH into the system, GUI is not usable though. I can connect to the secondary node and everything is working fine there.
Please let me know what to do. Please note that node 1 is running very important workloads (OPNSense, HomeAssistant MQTT) I really don't want to reinstall this node.
Any help will be appreciated.
Best Regards,
Muhammad Ayub.
Proxmox Setup Details:
Promox Cluster : Total 2 Nodes
Node 1 - Primary
Node 2 - Secondary
Versions:
Node 1: 8.4.17 (Broke after this)
Node 2: 8.4.14 (healthy, didn’t update)
How it all started:
Everything was working fine. I logged in to my cluster and saw Node 1 is having question marks on the main node and the workloads. (this issue is known to me, and wanted to fix it later by adding a QDevice, it happens because my node 1 hosts firewall, after updates if the node 2 starts up and firewall has not loaded yet, quorum is lost)
Reading on the forums and elsewhere I found its the quorum issue. I restarted the corosync service to fix that and verified that now cluster is quorate. But the issue persisted.
Then I found pvestatd pve-daemon and pve-cluster services should be restarted as well. Once I did, nothing happened, issue persisted.
Then I thought lets restart the node and things will be fine.
I saw there are some updates pending on the node, and I thought I will install the updates and restart peacefully.
This destroyed the node. Still the workloads are running fine, but now the node is appearing as offline, and I have several issues in that node.
What happened during update, it tried to install packages but several packages particularly zfs and pve-manager packages started throwing the following errors:
Code:
Reload daemon failed: Transport endpoint is not connected
Failed to get unit file state for pvedaemon.service: Transport endpoint is not connected
Failed to get unit file state for pveproxy.service: Transport endpoint is not connected
Failed to get unit file state for spiceproxy.service: Transport endpoint is not connected
Failed to get unit file state for pvestatd.service: Transport endpoint is not connected
Failed to get unit file state for pvebanner.service: Transport endpoint is not connected
Failed to get unit file state for pvescheduler.service: Transport endpoint is not connected
Failed to get unit file state for pve-daily-update.timer: Transport endpoint is not connected
as well as some zfs errors.
Now everything is stuck here. What I have tried:
dpkg --configure -a(first time it resolved the zfs errors but can't fix the pve-manager installation)systemctl restart pve-cluster(this returns the following error: Failed to get properties: Transport endpoint is not connected)apt cleanapt upgrade(it returns me to run the dpkg --configure -a again)
So I am stuck in a loop essentially. I have not restarted the system and is vary of doing it. I can still SSH into the system, GUI is not usable though. I can connect to the secondary node and everything is working fine there.
Please let me know what to do. Please note that node 1 is running very important workloads (OPNSense, HomeAssistant MQTT) I really don't want to reinstall this node.
Any help will be appreciated.
Best Regards,
Muhammad Ayub.
Last edited: