HA groups fails to migrate after upgrade 8 to 9

sbarmen

New Member
Dec 28, 2023
9
2
3
Norway
www.barmen.no
Hello all, I have upgraded my cluster to 9 (currently on 9.1.1) and I find that my HA groups have not been migrated during the upgrade. I get the following error when I try to make changes.

1764409070956.png

I believe that this is caused by me removing some old nodes in the cluster previously. When the cluster was installed back in the day I had two nodes, thinkserver and pvrserver. Now they are gone and I only have px0-rv, px1-rv and px2-rv.

I think I followed the guide properly to remove the old nodes but still they are causing problems. Note, the current nodes run CEPH, but the two old nodes were never part of the CEPH cluster. Any pointers for me?

Code:
root@px2-rv:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         4          1 px0-rv
         5          1 px1-rv
         6          1 px2-rv (local)

Code:
root@px2-rv:/etc/pve/nodes# ls
px0-rv  px1-rv  px2-rv
root@px2-rv:/etc/pve/nodes# pvecm delnode pvrserver
Node/IP: pvrserver is not a known host of the cluster.
root@px2-rv:/etc/pve/nodes# pvecm delnode thinkserver
Node/IP: thinkserver is not a known host of the cluster.
root@px2-rv:/etc/pve/nodes#

I also find errors in pve-ha-crm service running journalctl:

Code:
Nov 29 10:50:46 px2-rv pve-ha-crm[1874]: ha groups migration: node 'pvrserver' is in state 'gone'
Nov 29 10:50:46 px2-rv pve-ha-crm[1874]: abort ha groups migration: node 'pvrserver' is not online
Nov 29 10:50:46 px2-rv pve-ha-crm[1874]: ha groups migration failed
Nov 29 10:50:46 px2-rv pve-ha-crm[1874]: retry ha groups migration in 6 rounds (~ 60 seconds)
Nov 29 10:50:56 px2-rv pve-ha-crm[1874]: unable to read file '/etc/pve/nodes/pvrserver/lrm_status'
Nov 29 10:50:56 px2-rv pve-ha-crm[1874]: unable to read file '/etc/pve/nodes/thinkserver/lrm_status'
Nov 29 10:51:06 px2-rv pve-ha-crm[1874]: unable to read file '/etc/pve/nodes/pvrserver/lrm_status'
Nov 29 10:51:06 px2-rv pve-ha-crm[1874]: unable to read file '/etc/pve/nodes/thinkserver/lrm_status'
Nov 29 10:51:16 px2-rv pve-ha-crm[1874]: unable to read file '/etc/pve/nodes/pvrserver/lrm_status'
Nov 29 10:51:16 px2-rv pve-ha-crm[1874]: unable to read file '/etc/pve/nodes/thinkserver/lrm_status'


Bash:
root@px2-rv:~# ha-manager status -v
unable to read file '/etc/pve/nodes/pvrserver/lrm_status'
unable to read file '/etc/pve/nodes/thinkserver/lrm_status'
quorum OK
master px2-rv (active, Sat Nov 29 10:32:06 2025)
lrm pvrserver (unable to read lrm status)
lrm px0-rv (active, Sat Nov 29 10:32:06 2025)
lrm px1-rv (active, Sat Nov 29 10:32:05 2025)
lrm px2-rv (active, Sat Nov 29 10:32:11 2025)
lrm thinkserver (unable to read lrm status)
service ct:101 (px0-rv, started)
service ct:115 (px0-rv, started)
service vm:103 (px1-rv, started)
service vm:104 (px1-rv, started)
service vm:106 (px2-rv, started)
service vm:107 (px2-rv, started)
service vm:108 (px2-rv, started)
service vm:111 (px0-rv, started)
service vm:113 (px1-rv, started)
service vm:300 (px2-rv, started)
service vm:305 (px2-rv, disabled)
service vm:306 (px1-rv, started)
service vm:307 (px2-rv, started)
full cluster state:
unable to read file '/etc/pve/nodes/pvrserver/lrm_status'
unable to read file '/etc/pve/nodes/thinkserver/lrm_status'
{
   "lrm_status" : {
      "pvrserver" : {
         "mode" : "unknown"
      },
      "px0-rv" : {
         "mode" : "active",
         "results" : {
            "UID-REDACTED-001" : {
               "exit_code" : 0,
               "sid" : "ct:115",
               "state" : "started"
            },
            "UID-REDACTED-002" : {
               "exit_code" : 0,
               "sid" : "vm:111",
               "state" : "started"
            },
            "UID-REDACTED-003" : {
               "exit_code" : 0,
               "sid" : "ct:101",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1764408726
      },
      "px1-rv" : {
         "mode" : "active",
         "results" : {
            "UID-REDACTED-004" : {
               "exit_code" : 0,
               "sid" : "vm:103",
               "state" : "started"
            },
            "UID-REDACTED-005" : {
               "exit_code" : 0,
               "sid" : "vm:104",
               "state" : "started"
            },
            "UID-REDACTED-006" : {
               "exit_code" : 0,
               "sid" : "vm:113",
               "state" : "started"
            },
            "UID-REDACTED-007" : {
               "exit_code" : 0,
               "sid" : "vm:306",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1764408725
      },
      "px2-rv" : {
         "mode" : "active",
         "results" : {
            "UID-REDACTED-008" : {
               "exit_code" : 0,
               "sid" : "vm:307",
               "state" : "started"
            },
            "UID-REDACTED-009" : {
               "exit_code" : 0,
               "sid" : "vm:107",
               "state" : "started"
            },
            "UID-REDACTED-010" : {
               "exit_code" : 0,
               "sid" : "vm:305",
               "state" : "stopped"
            },
            "UID-REDACTED-011" : {
               "exit_code" : 0,
               "sid" : "vm:300",
               "state" : "started"
            },
            "UID-REDACTED-012" : {
               "exit_code" : 0,
               "sid" : "vm:108",
               "state" : "started"
            },
            "UID-REDACTED-013" : {
               "exit_code" : 0,
               "sid" : "vm:106",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1764408731
      },
      "thinkserver" : {
         "mode" : "unknown"
      }
   },
   "manager_status" : {
      "master_node" : "px2-rv",
      "node_status" : {
         "pvrserver" : "gone",
         "px0-rv" : "online",
         "px1-rv" : "online",
         "px2-rv" : "online",
         "thinkserver" : "gone"
      },
      "service_status" : {
         "ct:101" : {
            "node" : "px0-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-003"
         },
         "ct:115" : {
            "node" : "px0-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-001"
         },
         "vm:103" : {
            "node" : "px1-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-004"
         },
         "vm:104" : {
            "node" : "px1-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-005"
         },
         "vm:106" : {
            "node" : "px2-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-013"
         },
         "vm:107" : {
            "node" : "px2-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-009"
         },
         "vm:108" : {
            "node" : "px2-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-012"
         },
         "vm:111" : {
            "node" : "px0-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-002"
         },
         "vm:113" : {
            "node" : "px1-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-006"
         },
         "vm:300" : {
            "node" : "px2-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-011"
         },
         "vm:305" : {
            "node" : "px2-rv",
            "state" : "stopped",
            "uid" : "UID-REDACTED-010"
         },
         "vm:306" : {
            "node" : "px1-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-007"
         },
         "vm:307" : {
            "node" : "px2-rv",
            "running" : 1,
            "state" : "started",
            "uid" : "UID-REDACTED-008"
         }
      },
      "timestamp" : 1764408726
   },
   "quorum" : {
      "node" : "px2-rv",
      "quorate" : "1"
   }
}
 
Hi!

The HA group to HA node affinity rules migration is done every 6 HA rounds, i.e., around every minute with an HA round lasting ~10 seconds. To be sure that everything is fine, it will only do the migration if the cluster is quorate, all nodes are online, the LRMs are active or idle, and all nodes have been upgraded to 9.0.0+.

The HA Manager will mark a node as 'gone' if it's status is unknown (e.g. not online, deleted) and delete it from the HA Manager status after around an hour of being in the state 'gone'. Has the HA Manager caught up to that yet? The log message should be deleting gone node '<nodename>', not a cluster member anymore.

Does either /etc/pve/nodes/pvrserver/ or /etc/pve/nodes/thinkserver/ still exist?
 
Hello @dakralex, the nodes thinkserver and pvrserver were delete many months ago and they were never part of the v9 upgrade. When starting to troubleshoot this problem I found that the folders in /etc/pve/nodes did infact include thinkserver and pvrserver folders, but I have since deleted them.

And now, while looking into your reply it seems that the cluster has fixed itself. Probably it was remedied by deleting the folders and giving it some time.

Case closed, thanks for replying!
 
  • Like
Reactions: dakralex