[SOLVED] Cluster of 3 nodes with status unknown only if 2 specific nodes are online

slyx

New Member
Jun 5, 2023
4
1
3
Hello :)
Since 24h I'v been banging my head on a problem that appeared by itself yesterday around 12pm according to the logs.
Since yesterday, I can't get my cluster to work properly:
I have 3 nodes A, B, C.
If I turn on nodes A and B or B and C, everything works fine.
As soon as I turn on A and C together (so Either A and C or A, B and C), the commnands start hanging, The gui bug (hanging for a long time, can't access to C node interface from A node, or A from C), can't load the replication tabs of the VMs, and cluster commands like pvesh /nodes hang until I turn off one of the nodes.

Also, I have a Samba VM on C node.... and if A and C, are ON at the same time, the writing/reading perf drop.
By looking at the I/O of the nodes, it seems they are permanently exchanging a lot of data

Weirdly, I have a full quorum and no much problem to start/stop VMs.

The following data come from a state where all nodes are running.

The problem see to come from the pvescheduler (Node C):
Code:
 pvescheduler.service - Proxmox VE scheduler
     Loaded: loaded (/lib/systemd/system/pvescheduler.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-03-31 18:10:37 CEST; 3h 51min ago
    Process: 10658 ExecStart=/usr/bin/pvescheduler start (code=exited, status=0/SUCCESS)
   Main PID: 10833 (pvescheduler)
      Tasks: 3 (limit: 38286)
     Memory: 118.4M
        CPU: 3.882s
     CGroup: /system.slice/pvescheduler.service
             ├─ 10833 pvescheduler
             ├─109087 pvescheduler
             └─109088 pvescheduler

Mar 31 18:28:09 pvestorage01 pvescheduler[24910]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 31 18:28:09 pvestorage01 pvescheduler[24909]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 31 21:10:35 pvestorage01 pvescheduler[94073]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 31 21:10:35 pvestorage01 pvescheduler[94072]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Mar 31 21:31:44 pvestorage01 pvescheduler[98848]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Mar 31 21:31:44 pvestorage01 pvescheduler[98849]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 31 21:39:32 pvestorage01 pvescheduler[100460]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Mar 31 21:39:32 pvestorage01 pvescheduler[100461]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Mar 31 22:00:21 pvestorage01 pvescheduler[104669]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Mar 31 22:00:21 pvestorage01 pvescheduler[102682]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

pve-ha-lrm (Node C):
Code:
pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-03-31 17:56:41 CEST; 4h 5min ago
    Process: 3987 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
   Main PID: 3999 (pve-ha-lrm)
      Tasks: 1 (limit: 38286)
     Memory: 111.0M
        CPU: 1.774s
     CGroup: /system.slice/pve-ha-lrm.service
             └─3999 pve-ha-lrm

Mar 31 21:34:38 pvestorage01 pve-ha-lrm[3999]: loop take too long (172 seconds)
Mar 31 21:39:32 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - close (rename) atomic file '/etc/pve/nodes/pvestorage01/lrm_status' failed: Permission denied
Mar 31 21:39:37 pvestorage01 pve-ha-lrm[3999]: loop take too long (299 seconds)
Mar 31 21:44:02 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pvestorage01/lrm_status.tmp.3999' - Permission denied
Mar 31 21:44:07 pvestorage01 pve-ha-lrm[3999]: loop take too long (270 seconds)
Mar 31 21:52:51 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - unable to write '/etc/pve/nodes/pvestorage01/lrm_status.tmp.3999' - Permission denied
Mar 31 21:52:56 pvestorage01 pve-ha-lrm[3999]: loop take too long (529 seconds)
Mar 31 21:52:56 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - unable to delete old temp file: Device or resource busy
Mar 31 22:00:21 pvestorage01 pve-ha-lrm[3999]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pvestorage01/lrm_status.tmp.3999' - Permission denied
Mar 31 22:00:26 pvestorage01 pve-ha-lrm[3999]: loop take too long (445 seconds)

pve-ha-lrm (Node A):
Code:
 pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-03-31 21:09:34 CEST; 59min ago
    Process: 2140 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
   Main PID: 2144 (pve-ha-lrm)
      Tasks: 1 (limit: 76898)
     Memory: 111.0M
        CPU: 593ms
     CGroup: /system.slice/pve-ha-lrm.service
             └─2144 pve-ha-lrm

Mar 31 21:17:27 R340-01 pve-ha-lrm[2144]: loop take too long (69 seconds)
Mar 31 21:20:27 R340-01 pve-ha-lrm[2144]: loop take too long (175 seconds)
Mar 31 21:22:37 R340-01 pve-ha-lrm[2144]: loop take too long (110 seconds)
Mar 31 21:24:15 R340-01 pve-ha-lrm[2144]: loop take too long (78 seconds)
Mar 31 21:26:08 R340-01 pve-ha-lrm[2144]: loop take too long (93 seconds)
Mar 31 21:28:00 R340-01 pve-ha-lrm[2144]: loop take too long (92 seconds)
Mar 31 21:29:47 R340-01 pve-ha-lrm[2144]: loop take too long (92 seconds)
Mar 31 21:31:36 R340-01 pve-ha-lrm[2144]: loop take too long (89 seconds)
Mar 31 21:34:38 R340-01 pve-ha-lrm[2144]: loop take too long (172 seconds)
Mar 31 21:52:57 R340-01 pve-ha-lrm[2144]: loop take too long (1094 seconds)


pvestatd service (Node C)
Code:
 pvestatd.service - PVE Status Daemon
     Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-03-31 17:54:11 CEST; 4h 8min ago
    Process: 2735 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
   Main PID: 2774 (pvestatd)
      Tasks: 1 (limit: 38286)
     Memory: 164.6M
        CPU: 7min 6.894s
     CGroup: /system.slice/pvestatd.service
             └─2774 pvestatd

Mar 31 21:26:03 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 21:26:03 pvestorage01 pvestatd[2774]: status update time (112.975 seconds)
Mar 31 21:31:44 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 21:31:44 pvestorage01 pvestatd[2774]: status update time (340.638 seconds)
Mar 31 21:34:33 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 21:34:33 pvestorage01 pvestatd[2774]: status update time (169.084 seconds)
Mar 31 21:39:32 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 21:39:33 pvestorage01 pvestatd[2774]: status update time (299.533 seconds)
Mar 31 22:00:21 pvestorage01 pvestatd[2774]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Mar 31 22:00:22 pvestorage01 pvestatd[2774]: status update time (1249.205 seconds)


i updated all three nodes to the last version.

pveversion -v (Node C):
Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-9-pve)
pve-manager: 8.3.5 (running version: 8.3.5/dac3aa88bac3f300)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.1
libpve-rs-perl: 0.9.2
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.4-1
proxmox-backup-file-restore: 3.3.4-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.7
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.1
pve-qemu-kvm: 9.2.0-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

Hope somebody can help, in any case, thanks for reading me and scratching your head :)
 
Quick shot: triple check that there are no IP address conflicts. Make sure that /etc/hosts contains the same information on all three nodes.

"ping A" must work on A+B+C; "ping B" must ...; - by using their names, not just an IP address.
 
Hello
Thanks UdoB for your reply.
The hosts are properly defined in all 3 /etc/hosts and the system ping each other :)
However I think I finally found the root cause : node B had a directory defined disk /mnt/pve/xxxxx full
I moved some VM disk from it and all seem to be working
I will let it be for 24 hours and mark the thread as solved if it sticks.

Thanks again for reading and helping :)
 
  • Like
Reactions: UdoB
Well, it didn't stick, the source of the problem seems to be elsewhere.
I think it has something to have with storage space... However I don't see anything full anymore :/
(The "resource is busy" and unable to "write lrm status" errors seem to point that way, Also I can't update VMs conf)

Node A
Code:
pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-01 08:57:35 CEST; 1 day 1h ago
    Process: 2140 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
   Main PID: 2144 (pve-ha-lrm)
      Tasks: 1 (limit: 76898)
     Memory: 4.2M
        CPU: 1.367s
     CGroup: /system.slice/pve-ha-lrm.service
             └─2144 pve-ha-lrm

Apr 02 03:13:43 R340-01 pve-ha-lrm[2144]: loop take too long (443 seconds)
Apr 02 04:11:50 R340-01 pve-ha-lrm[2144]: loop take too long (3487 seconds)
Apr 02 04:25:01 R340-01 pve-ha-lrm[2144]: loop take too long (791 seconds)
Apr 02 05:43:35 R340-01 pve-ha-lrm[2144]: loop take too long (4714 seconds)
Apr 02 07:12:24 R340-01 pve-ha-lrm[2144]: loop take too long (5329 seconds)
Apr 02 07:23:01 R340-01 pve-ha-lrm[2144]: loop take too long (637 seconds)
Apr 02 09:01:02 R340-01 pve-ha-lrm[2144]: loop take too long (5881 seconds)
Apr 02 09:45:51 R340-01 pve-ha-lrm[2144]: unable to write lrm status file - unable to write '/etc/pve/nodes/R340-01/lrm_status.tmp.2144' - Device or resource busy
Apr 02 09:45:56 R340-01 pve-ha-lrm[2144]: loop take too long (2694 seconds)
Apr 02 09:45:56 R340-01 pve-ha-lrm[2144]: unable to write lrm status file - unable to delete old temp file: Device or resource busy


pvescheduler.service - Proxmox VE scheduler
     Loaded: loaded (/lib/systemd/system/pvescheduler.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-01 08:58:44 CEST; 1 day 1h ago
    Process: 3437 ExecStart=/usr/bin/pvescheduler start (code=exited, status=0/SUCCESS)
   Main PID: 5030 (pvescheduler)
      Tasks: 3 (limit: 76898)
     Memory: 9.4M
        CPU: 8.738s
     CGroup: /system.slice/pvescheduler.service
             ├─  5030 pvescheduler
             ├─559786 pvescheduler
             └─559787 pvescheduler

Apr 02 07:12:19 R340-01 pvescheduler[494721]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 07:12:19 R340-01 pvescheduler[494722]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 07:22:56 R340-01 pvescheduler[517567]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 07:22:56 R340-01 pvescheduler[517566]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 07:42:57 R340-01 pvescheduler[520352]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 07:49:05 R340-01 pvescheduler[525873]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 09:00:57 R340-01 pvescheduler[520351]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 09:00:57 R340-01 pvescheduler[527897]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 09:45:51 R340-01 pvescheduler[547800]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 09:45:51 R340-01 pvescheduler[547801]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

Node B
Code:
 pvescheduler.service - Proxmox VE scheduler
     Loaded: loaded (/lib/systemd/system/pvescheduler.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-03-31 18:29:13 CEST; 1 day 15h ago
    Process: 2317 ExecStart=/usr/bin/pvescheduler start (code=exited, status=0/SUCCESS)
   Main PID: 2318 (pvescheduler)
      Tasks: 3 (limit: 309168)
     Memory: 118.5M
        CPU: 36.916s
     CGroup: /system.slice/pvescheduler.service
             ├─  2318 pvescheduler
             ├─787934 pvescheduler
             └─787935 pvescheduler

Apr 02 09:00:57 R730-01 pvescheduler[750806]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 09:17:25 R730-01 pvescheduler[775125]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 09:18:35 R730-01 pvescheduler[780539]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 09:18:35 R730-01 pvescheduler[780540]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 09:20:14 R730-01 pvescheduler[780880]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 09:21:34 R730-01 pvescheduler[780881]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 09:23:34 R730-01 pvescheduler[781559]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 09:24:14 R730-01 pvescheduler[781898]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 09:39:19 R730-01 pvescheduler[782576]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 09:39:19 R730-01 pvescheduler[782918]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout


pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-03-31 18:29:09 CEST; 1 day 15h ago
    Process: 2272 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
   Main PID: 2276 (pve-ha-lrm)
      Tasks: 1 (limit: 309168)
     Memory: 111.0M
        CPU: 10.576s
     CGroup: /system.slice/pve-ha-lrm.service
             └─2276 pve-ha-lrm

Apr 02 07:47:00 R730-01 pve-ha-lrm[2276]: loop take too long (120 seconds)
Apr 02 07:49:10 R730-01 pve-ha-lrm[2276]: loop take too long (130 seconds)
Apr 02 09:01:02 R730-01 pve-ha-lrm[2276]: loop take too long (4312 seconds)
Apr 02 09:17:30 R730-01 pve-ha-lrm[2276]: loop take too long (988 seconds)
Apr 02 09:19:29 R730-01 pve-ha-lrm[2276]: loop take too long (84 seconds)
Apr 02 09:21:30 R730-01 pve-ha-lrm[2276]: loop take too long (121 seconds)
Apr 02 09:23:31 R730-01 pve-ha-lrm[2276]: loop take too long (121 seconds)
Apr 02 09:39:19 R730-01 pve-ha-lrm[2276]: unable to write lrm status file - unable to write '/etc/pve/nodes/R730-01/lrm_status.tmp.2276' - Device or resource busy
Apr 02 09:39:24 R730-01 pve-ha-lrm[2276]: loop take too long (953 seconds)
Apr 02 09:39:24 R730-01 pve-ha-lrm[2276]: unable to write lrm status file - unable to delete old temp file: Device or resource busy

Node C
Code:
pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-01 08:40:32 CEST; 1 day 1h ago
    Process: 2909 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
   Main PID: 2913 (pve-ha-lrm)
      Tasks: 1 (limit: 38286)
     Memory: 111.0M
        CPU: 1.158s
     CGroup: /system.slice/pve-ha-lrm.service
             └─2913 pve-ha-lrm

Apr 02 09:17:29 pvestorage01 pve-ha-lrm[2913]: loop take too long (579 seconds)
Apr 02 09:24:15 pvestorage01 pve-ha-lrm[2913]: unable to write lrm status file - unable to write '/etc/pve/nodes/pvestorage01/lrm_status.tmp.2913' - Permission denied
Apr 02 09:24:20 pvestorage01 pve-ha-lrm[2913]: loop take too long (411 seconds)
Apr 02 09:24:20 pvestorage01 pve-ha-lrm[2913]: unable to write lrm status file - unable to delete old temp file: Device or resource busy
Apr 02 09:31:25 pvestorage01 pve-ha-lrm[2913]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pvestorage01/lrm_status.tmp.2913' - Permission denied
Apr 02 09:31:30 pvestorage01 pve-ha-lrm[2913]: loop take too long (425 seconds)
Apr 02 09:45:50 pvestorage01 pve-ha-lrm[2913]: unable to write lrm status file - close (rename) atomic file '/etc/pve/nodes/pvestorage01/lrm_status' failed: Permission denied
Apr 02 09:45:55 pvestorage01 pve-ha-lrm[2913]: loop take too long (865 seconds)
Apr 02 09:53:48 pvestorage01 pve-ha-lrm[2913]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pvestorage01/lrm_status.tmp.2913' - Permission denied
Apr 02 09:53:53 pvestorage01 pve-ha-lrm[2913]: loop take too long (478 seconds)


pvescheduler.service - Proxmox VE scheduler
     Loaded: loaded (/lib/systemd/system/pvescheduler.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-01 08:43:53 CEST; 1 day 1h ago
    Process: 5766 ExecStart=/usr/bin/pvescheduler start (code=exited, status=0/SUCCESS)
   Main PID: 5767 (pvescheduler)
      Tasks: 3 (limit: 38286)
     Memory: 118.5M
        CPU: 3.910s
     CGroup: /system.slice/pvescheduler.service
             ├─  5767 pvescheduler
             ├─919625 pvescheduler
             └─919626 pvescheduler

Apr 02 06:33:27 pvestorage01 pvescheduler[864533]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Apr 02 06:33:27 pvestorage01 pvescheduler[864534]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Apr 02 06:40:40 pvestorage01 pvescheduler[870031]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 06:40:40 pvestorage01 pvescheduler[870030]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 07:05:48 pvestorage01 pvescheduler[871737]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 07:49:05 pvestorage01 pvescheduler[886432]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Apr 02 09:07:45 pvestorage01 pvescheduler[901938]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 09:07:45 pvestorage01 pvescheduler[903735]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Apr 02 09:45:50 pvestorage01 pvescheduler[910825]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Apr 02 09:45:50 pvestorage01 pvescheduler[910824]: replication: cfs-lock 'file-replication_cfg' error: no quorum!

Thanks for your help :)
 
Hello
So I ended up finding the following solution : I renamed the node that seemed to cause corosync problems.

To do so I followed the procedure discribed in these links:
https://pve.proxmox.com/wiki/Renaming_a_PVE_node
Specifically I followed the procedure given by @fnrfarid :
Code:
Stopped all VMs [ Also Moved all important VM to another Node for my peace of mind]
From the node terminal which I want to rename, I ran these and made changes for new_hostname
cp -r /etc/pve /root/pve_backup
nano /etc/hosts
nano /etc/hostname
nano /etc/mailname
nano /etc/postfix/main.cf
hostnamectl set-hostname new_hostname
nano /etc/pve/corosync.conf (increased the config_version from 3 to 4, the version maybe different from user to user)
pvecm updatecerts
systemctl restart pveproxy
systemctl restart pvedaemon
cd /etc/pve/nodes
cp -r /etc/pve/nodes/old_hosname/ /root/oldconfig
cp /etc/pve/nodes/old_hosname/lxc/* /etc/pve/nodes/new_hostname/lxc
cp /etc/pve/nodes/old_hosname/qemu-server/* /etc/pve/nodes/new_hostname/qemu-server
rm -rf /etc/pve/nodes/old_hosname
cd /var/lib/rrdcached/db/pve2-node/
cp -r old_hosname new_hostname
cd /var/lib/rrdcached/db/pve2-storage/
cp -r old_hosname new_hostname
reboot
After restart I also changed my storage settings as well
nano /etc/pve/storage.cfg

Also, because all the node were out of sync as soon as they were all online at the same type, I replicated the following on all nodes:
Code:
cp -r /etc/pve /root/pve_backup
nano /etc/hosts
nano /etc/pve/corosync.conf (increased the config_version from 3 to 4, the version maybe different from user to user)
pvecm updatecerts
systemctl restart pveproxy
systemctl restart pvedaemon
cd /etc/pve/nodes
cp -r /etc/pve/nodes/old_hosname/ /root/oldconfig
cp /etc/pve/nodes/old_hosname/lxc/* /etc/pve/nodes/new_hostname/lxc
cp /etc/pve/nodes/old_hosname/qemu-server/* /etc/pve/nodes/new_hostname/qemu-server
rm -rf /etc/pve/nodes/old_hosname
cd /var/lib/rrdcached/db/pve2-node/
cp -r old_hosname new_hostname
cd /var/lib/rrdcached/db/pve2-storage/
cp -r old_hosname new_hostname
pvecm updatecerts
reboot

Now I need to analyze properly the logs to find what caused so much trouble (I think it has something to do with daylight saving time as it appeared the times of the 3 systems were not properly setup (one different timezone for each -_-), but not sure)
Thanks for the help :).