[SOLVED] Upgrade stuck on pve-ha-lrm.service

ufoludek

New Member
Oct 8, 2022
12
1
3
Poland
Hello there, I've been running a 2 nodes cluster (no HA) for the past 6 months, yesterday I decided that it's time to upgrade them because the new version was available, I did an upgrade on both nodes at the same time, one node updated perfectly the other one was still updating, in the meantime I rebooted the one that finished updating. What I then discovered was that the first node got stuck mid update which forced me to hard reboot the server, today after some struggle I managed to boot the machine but it still had 6 packages left to be installed, I did a full-upgrade and it's been stuck for a long time on the pve-ha-lrm.service any tips how can I push through this update? I have physical access to the server and I can access via SSH, unfortunately the login via the web GUI is not possible as it throws an error even though the password is correct, any help would be very appreciated.
Thanks in advance.
 
What is the current status of pve-ha-lrm on that node? (please adjust the since date, if necessary)

Code:
systemctl status pve-ha-lrm
journalctl -u pve-ha-lrm --since "2023-01-11"

Additionally, what is the current status of your cluster?

Code:
pvecm status


Running a 2 node cluster without a QDevice [1] is highly discouraged, since the failure of one node means loss of quorum in the cluster, which means you cannot perform any actions that require quorum.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support
 
pve-ha-lrm status

Bash:
pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-01-12 10:16:59 CET; 39min ago
    Process: 961137 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
   Main PID: 961288 (pve-ha-lrm)
      Tasks: 1 (limit: 154506)
     Memory: 94.3M
        CPU: 1.154s
     CGroup: /system.slice/pve-ha-lrm.service
             └─961288 pve-ha-lrm

Jan 12 10:16:58 pve-1 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jan 12 10:16:59 pve-1 pve-ha-lrm[961288]: starting server
Jan 12 10:16:59 pve-1 pve-ha-lrm[961288]: status change startup => wait_for_agent_lock
Jan 12 10:16:59 pve-1 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Jan 12 10:45:57 pve-1 pve-ha-lrm[961288]: unable to write lrm status file - closing file '/etc/pve/nodes/pve-1/lrm_status.tmp.961288' failed - Permission denied
Jan 12 10:46:02 pve-1 pve-ha-lrm[961288]: loop take too long (1743 seconds)


Bash:
journalctl -u -pve-ha-lrm --since "2023-01-11"
-- Journal begins at Wed 2022-10-19 15:26:33 CEST, ends at Thu 2023-01-12 10:59:05 CET. --
-- No entries --

After a long time the update finally "pushed through" unfortunately it finished with a bunch of errors:
More specifically
Bash:
Errors were encountered while processing:
pve-manager
proxmox-ve
pve-ha-manager


Bash:
pvecm status
Cluster information
-------------------
Name:             pve-cluster
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Jan 12 11:02:21 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.964
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.18.6.7
0x00000002          1 19.16.13.7 (local)

Yeah I know about the issue without a QDevice, now I kinda regret :\ creating this cluster in the first place, 2 individual hosts would've caused a lot less problems.
 
Can you provide me with the output of the whole syslog?
Code:
journalctl --since "2023-01-12" > output.txt
 
It looks like your cluster network (used by corosync) is experiencing serious issues. There are many messages in the syslog indicating that corosync is constantly losing connection to the other node:

Code:
Jan 12 13:32:07 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:34:11 pve-1 corosync[1563555]:   [KNET  ] link: host: 1 link: 0 is down
Jan 12 13:34:11 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:34:11 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 has no active links
Jan 12 13:34:13 pve-1 corosync[1563555]:   [KNET  ] rx: host: 1 link: 0 is up
Jan 12 13:34:13 pve-1 corosync[1563555]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 12 13:34:13 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:34:21 pve-1 corosync[1563555]:   [KNET  ] link: host: 1 link: 0 is down
Jan 12 13:34:21 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:34:21 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 has no active links
Jan 12 13:34:23 pve-1 corosync[1563555]:   [KNET  ] rx: host: 1 link: 0 is up
Jan 12 13:34:23 pve-1 corosync[1563555]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 12 13:34:23 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:34:53 pve-1 corosync[1563555]:   [KNET  ] link: host: 1 link: 0 is down
Jan 12 13:34:53 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:34:53 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 has no active links
Jan 12 13:34:55 pve-1 corosync[1563555]:   [KNET  ] rx: host: 1 link: 0 is up
Jan 12 13:34:55 pve-1 corosync[1563555]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 12 13:34:55 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:35:03 pve-1 corosync[1563555]:   [KNET  ] link: host: 1 link: 0 is down
Jan 12 13:35:03 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:35:03 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 has no active links
Jan 12 13:35:04 pve-1 corosync[1563555]:   [KNET  ] rx: host: 1 link: 0 is up
Jan 12 13:35:04 pve-1 corosync[1563555]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 12 13:35:04 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:36:01 pve-1 corosync[1563555]:   [KNET  ] link: host: 1 link: 0 is down
Jan 12 13:36:01 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:36:01 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 has no active links
Jan 12 13:36:03 pve-1 corosync[1563555]:   [KNET  ] rx: host: 1 link: 0 is up
Jan 12 13:36:03 pve-1 corosync[1563555]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 12 13:36:03 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:36:37 pve-1 corosync[1563555]:   [KNET  ] link: host: 1 link: 0 is down
Jan 12 13:36:37 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 12 13:36:37 pve-1 corosync[1563555]:   [KNET  ] host: host: 1 has no active links

This is probably the root cause for your issue, since this causes the cluster file system to be unavailable. This in turn causes the error message from pve-ha-lrm, because it is unable to write to the cluster file system.

What kind of network do you have between your nodes? What is the latency between the nodes on this connection (check with a simple ping)? Corosync is very sensitive to latency and even a few ms latency can cause serious issues.
 
There is no shared storage between the nodes though, each node is pretty much independent, the reason I put them together in a cluster in the first place was to have both hosts available in the same GUI, although now that I look at it, it wasn't the smartest idea to be honest...

Can I delete the cluster altogether without reinstalling both hosts?
 
I'll try later today, thanks for the help, don't close the thread yet, I'll mark it as solved when I get this thing running.
 
Ok, thanks a lot for your help, I was able to remove the cluster completely, after removing the node from the cluster I did a dist-upgrade and it went through successfully. Have a nice day, thanks for the help again :D
 
  • Like
Reactions: shanreich

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!