Please help! Cluster of 8 nodes offline, any help is very appreciated.

BinaryCrash · Friday at 21:09

Hello,

I am losing hope, VM won't start, any help is really needed.

There are two nodes, that didn't restart the hardware, still running 4 VMs, that i am really afraid of rebooting and it not starting anymore.
When i try to start a VM, it tells the HA to start, but no success.
If the VM don't have HA configured i can try to start, but it timeout. Also can't console to it as it timeout too.(using web noVNC from proxmox VE)

Ceph "apparently" is showing all clean, but i am not sure it really is, could not test the VMs.

commands like qm list
freeze, and i cannot kill it's process. (with -9)
there are VMs in other nodes i also cannot start.

Please help me start the VM so i would have more time to debug what is happening.

Scenario:
8 PVE nodes running
Linux 6.8.12-5-pve
pve-manager/8.3.2/
Ceph Squid (19)

Code:

# pveversion --verbose
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph: 19.2.0-pve2
ceph-fuse: 19.2.0-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

From that 8 nodes, 1 was offline for some months, motherboard issue.
I was booting the new motherboard, with proxmox already installed and was the old OS that is already joined the cluster, so it would come online and i don't know if would disturb the cluster synchronization as the other nodes would vote and fix this node config. (i think)

Also i was restoring a VM from proxmox backup server (the VM didn't exist in this cluster yet).
Suddenly 4 nodes hardware rebooted. No indication of power issue.
Servers are HP ProLiant DL380 Gen9 and one is also Gen9, 1U.
Maybe a coincidence of restoring a VM? and old node getting online?
Recently i updated pve kernel to be able to upgrade ceph from reef to squid.
But the cluster e relatively new, only a few months old and was installed Proxmox VE 8 directly/fresh install.

Tried to select kernel Linux 6.8.12-1-pve instead of Linux 6.8.12-5-pve, but no changes in the dmesg error messages.

Here are the dmesg -T output:

Code:

[Fri Jan  3 13:15:32 2025] INFO: task pve-ha-crm:3025 blocked for more than 122 seconds.
[Fri Jan  3 13:15:32 2025]       Tainted: P           O       6.8.12-5-pve #1
[Fri Jan  3 13:15:32 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jan  3 13:15:32 2025] task:pve-ha-crm      state:D stack:0     pid:3025  tgid:3025  ppid:1      flags:0x00000002
[Fri Jan  3 13:15:32 2025] Call Trace:
[Fri Jan  3 13:15:32 2025]  <TASK>
[Fri Jan  3 13:15:32 2025]  __schedule+0x401/0x15e0
[Fri Jan  3 13:15:32 2025]  ? try_to_unlazy+0x60/0xe0
[Fri Jan  3 13:15:32 2025]  ? terminate_walk+0x65/0x100
[Fri Jan  3 13:15:32 2025]  schedule+0x33/0x110
[Fri Jan  3 13:15:32 2025]  schedule_preempt_disabled+0x15/0x30
[Fri Jan  3 13:15:32 2025]  rwsem_down_write_slowpath+0x392/0x6a0
[Fri Jan  3 13:15:32 2025]  down_write+0x5c/0x80
[Fri Jan  3 13:15:32 2025]  filename_create+0xaf/0x1b0
[Fri Jan  3 13:15:32 2025]  do_mkdirat+0x59/0x180
[Fri Jan  3 13:15:32 2025]  __x64_sys_mkdir+0x4a/0x70
[Fri Jan  3 13:15:32 2025]  x64_sys_call+0x2e3/0x2480
[Fri Jan  3 13:15:32 2025]  do_syscall_64+0x81/0x170
[Fri Jan  3 13:15:32 2025]  ? do_user_addr_fault+0x33e/0x660
[Fri Jan  3 13:15:32 2025]  ? irqentry_exit_to_user_mode+0x7b/0x260
[Fri Jan  3 13:15:32 2025]  ? irqentry_exit+0x43/0x50
[Fri Jan  3 13:15:32 2025]  ? exc_page_fault+0x94/0x1b0
[Fri Jan  3 13:15:32 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Fri Jan  3 13:15:32 2025] RIP: 0033:0x731e04219ea7
[Fri Jan  3 13:15:32 2025] RSP: 002b:00007ffeef405e78 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Fri Jan  3 13:15:32 2025] RAX: ffffffffffffffda RBX: 000064c124e6c2a0 RCX: 0000731e04219ea7
[Fri Jan  3 13:15:32 2025] RDX: 0000000000000021 RSI: 00000000000001ff RDI: 000064c12b8d1d90
[Fri Jan  3 13:15:32 2025] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[Fri Jan  3 13:15:32 2025] R10: 0000000000000000 R11: 0000000000000246 R12: 000064c124e71c88
[Fri Jan  3 13:15:32 2025] R13: 000064c12b8d1d90 R14: 000064c12b69df88 R15: 00000000000001ff
[Fri Jan  3 13:15:32 2025]  </TASK>
[Fri Jan  3 13:15:32 2025] INFO: task pvescheduler:16975 blocked for more than 122 seconds.
[Fri Jan  3 13:15:32 2025]       Tainted: P           O       6.8.12-5-pve #1
[Fri Jan  3 13:15:32 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jan  3 13:15:32 2025] task:pvescheduler    state:D stack:0     pid:16975 tgid:16975 ppid:16974  flags:0x00000006
[Fri Jan  3 13:15:32 2025] Call Trace:
[Fri Jan  3 13:15:32 2025]  <TASK>
[Fri Jan  3 13:15:32 2025]  __schedule+0x401/0x15e0
[Fri Jan  3 13:15:32 2025]  ? try_to_unlazy+0x60/0xe0
[Fri Jan  3 13:15:32 2025]  ? terminate_walk+0x65/0x100
[Fri Jan  3 13:15:32 2025]  schedule+0x33/0x110
[Fri Jan  3 13:15:32 2025]  schedule_preempt_disabled+0x15/0x30
[Fri Jan  3 13:15:32 2025]  rwsem_down_write_slowpath+0x392/0x6a0
[Fri Jan  3 13:15:32 2025]  down_write+0x5c/0x80
[Fri Jan  3 13:15:32 2025]  filename_create+0xaf/0x1b0
[Fri Jan  3 13:15:32 2025]  do_mkdirat+0x59/0x180
[Fri Jan  3 13:15:32 2025]  __x64_sys_mkdir+0x4a/0x70
[Fri Jan  3 13:15:32 2025]  x64_sys_call+0x2e3/0x2480
[Fri Jan  3 13:15:32 2025]  do_syscall_64+0x81/0x170
[Fri Jan  3 13:15:32 2025]  ? do_mkdirat+0xe1/0x180
[Fri Jan  3 13:15:32 2025]  ? timerqueue_add+0x69/0xd0
[Fri Jan  3 13:15:32 2025]  ? enqueue_hrtimer+0x4d/0xc0
[Fri Jan  3 13:15:32 2025]  ? hrtimer_start_range_ns+0x12f/0x3b0
[Fri Jan  3 13:15:32 2025]  ? do_setitimer+0x1a4/0x230
[Fri Jan  3 13:15:32 2025]  ? __x64_sys_alarm+0x76/0xd0
[Fri Jan  3 13:15:32 2025]  ? syscall_exit_to_user_mode+0x86/0x260
[Fri Jan  3 13:15:32 2025]  ? do_syscall_64+0x8d/0x170
[Fri Jan  3 13:15:32 2025]  ? irqentry_exit+0x43/0x50
[Fri Jan  3 13:15:32 2025]  ? exc_page_fault+0x94/0x1b0
[Fri Jan  3 13:15:32 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Fri Jan  3 13:15:32 2025] RIP: 0033:0x73911a227ea7
[Fri Jan  3 13:15:32 2025] RSP: 002b:00007ffd9254d298 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Fri Jan  3 13:15:32 2025] RAX: ffffffffffffffda RBX: 000062febef9e2a0 RCX: 000073911a227ea7
[Fri Jan  3 13:15:32 2025] RDX: 0000000000000026 RSI: 00000000000001ff RDI: 000062fec5cc43e0
[Fri Jan  3 13:15:32 2025] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[Fri Jan  3 13:15:32 2025] R10: 0000000000000000 R11: 0000000000000246 R12: 000062febefa3c88
[Fri Jan  3 13:15:32 2025] R13: 000062fec5cc43e0 R14: 000062febf240368 R15: 00000000000001ff
[Fri Jan  3 13:15:32 2025]  </TASK>
...
[Fri Jan  3 13:23:43 2025] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

root@x3:~# ceph -s

Code:

  cluster:
    id:     b682a805-8831-480a-8a8a-c7d896bfecae
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum x1,x2,x3 (age 4h)
    mgr: x2(active, since 4h), standbys: x1, x3
    osd: 10 osds: 10 up (since 69m), 10 in (since 69m)

  data:
    pools:   4 pools, 385 pgs
    objects: 35.15k objects, 131 GiB
    usage:   367 GiB used, 35 TiB / 35 TiB avail
    pgs:     385 active+clean

pvecm status

Code:

root@x3:~# pvecm status
Cluster information
-------------------
Name:             m[redacted]3
Config Version:   8
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Jan  3 16:56:31 2025
Quorum provider:  corosync_votequorum
Nodes:            7
Node ID:          0x00000005
Ring ID:          2.862
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   8
Highest expected: 8
Total votes:      7
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 100.80.83.121
0x00000003          1 100.80.83.124
0x00000004          1 100.80.83.125
0x00000005          1 100.80.83.126 (local)
0x00000006          1 100.80.83.127
0x00000007          1 100.80.83.129
0x00000008          1 100.80.83.130

When the hardware rebooted, it took almost 1 hour and a half to sync, but i had to reboot some of the nodes. then pvecm status showed quorate. Because it got "fenced" i think.
I had SSH access all the time, and could ping any node from any node.
Had a problem to access WEB UI, then later it was available again.
The node that was booting after months offline i disconnected from the switch, as a precaution.
So the kernel errors in dmesg are after that node got offline and are from the nodes that are online.

BinaryCrash · Friday at 21:15

Also this happen at the WEB UI.
The node i access show it green and other with "?" mark.

BinaryCrash · Saturday at 12:34

Anyone?

kellogs · Saturday at 14:31

are you able to try this?

systemctl reset-failed pve-ha-lrm.service
systemctl start pve-ha-lrm.service

maxim.webster · Saturday at 14:36

BinaryCrash said:
I was booting the new motherboard, with proxmox already installed and was the old OS that is already joined the cluster, so it would come online and i don't know if would disturb the cluster synchronization as the other nodes would vote and fix this node config. (i think)

Changing the motherboard for sure did change the network device enumeration and therefore network devices names may have changed. Check files /etc/hosts, /etc/network/interfaces for consistency to the network names printed by ip show a .

alexskysilk · Saturday at 18:44

what interface(s) are being used for corosync? can you verify they all work (are all other nodes reachable via the corosync ip address(es)?

tscret · Saturday at 19:55

Just a stupid idea... is the time on all the nodes in snyc?
Seems that there are issues with the corosync timestamp or it could be an networking issues => Are you able to ping all nodes by IP and hostename (/etc/hosts)?

check clock on each node with timedatectl

When I encounter such thing with nodes not showing proper state i normaly start by restarting the pve-cluster service
service pve-cluster restart

BinaryCrash · Saturday at 23:22

Hello, thank you so much for trying to help.
The clock on each node is correct:

Code:

root@x1:~# timedatectl
               Local time: Sat 2025-01-04 19:08:59 -03
           Universal time: Sat 2025-01-04 22:08:59 UTC
                 RTC time: Sat 2025-01-04 22:08:59
                Time zone: America/Fortaleza (-03, -0300)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

they are sync.
the only difference is some are Time zone: America/Fortaleza (-03, -0300) and some are Time zone: America/Sao_Paulo (-03, -0300) , but they are the same timezone.

I got another hardware HP Gen8 and i am trying to recover the VM from the proxmox backup server.
From the 3 VMs, two have backup, but outdated 1 day because today it didn't run (don't know the reason yet, lost access to WEB UI).
the other VM don't have backups, not as important VM, but would like to copy that VM if possible.
I am used to WEB UI, but much of pve commands.

How can i verify if the backups are running, the log, and maybe try to run the backup again to update it. (VM still online).
And if the run backup command get stuck, should i stop(how?), if stopping or killing it will corrupt the other backups at proxmox backup server?
If the Verify task at Proxmox Backup Server fail to validade a backup, will it still be available to restore on a VE node? I just stopped the verify task in case it would cause problems.

After the restore of the backup to the new VE i will try those commands.
the node that was briefly online is still disconnected from the switch and will remain that way until i format it again (after fixing the other nodes).

BinaryCrash · Saturday at 23:24

maxim.webster said:
Changing the motherboard for sure did change the network device enumeration and therefore network devices names may have changed. Check files /etc/hosts, /etc/network/interfaces for consistency to the network names printed by ip show a .

That node is still disconnected from the switch, i will try to fix the other nodes and then reformat this node if needed. For now that node is not a problem as it is purposely offline. But thanks for the explanation, i will check it.

BinaryCrash · Saturday at 23:31

alexskysilk said:
what interface(s) are being used for corosync? can you verify they all work (are all other nodes reachable via the corosync ip address(es)?

They are reachable by ping and SSH.

Code:

  291  ping 100.80.83.121
  292  ping 100.80.83.125
  293  ping 100.80.83.126
  294  ping 100.80.83.127
  295  ping 100.80.83.129
  296  ping 100.80.83.130
  297  pvecm status
  298  ssh root@100.80.83.121
  299  ssh root@100.80.83.125
  300  ssh root@100.80.83.126
  301  ssh root@100.80.83.127
  302  ssh root@100.80.83.129
  303  ssh root@100.80.83.130
  304  history
root@x1:~# pvecm status
Cluster information
-------------------
Name:             m[redacted]3
Config Version:   8
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Jan  4 19:29:53 2025
Quorum provider:  corosync_votequorum
Nodes:            7
Node ID:          0x00000003
Ring ID:          2.862
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   8
Highest expected: 8
Total votes:      7
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 100.80.83.121
0x00000003          1 100.80.83.124 (local)
0x00000004          1 100.80.83.125
0x00000005          1 100.80.83.126
0x00000006          1 100.80.83.127
0x00000007          1 100.80.83.129
0x00000008          1 100.80.83.130

Are other commands to check corosync?

rj45 · Saturday at 23:46

corosync-cmapctl -m stats | grep -E "(latency_ave|time_since_token_last_received|error|down_count)"

mr44er · Saturday at 23:48

Very similar, maybe it can help you: https://forum.proxmox.com/threads/lrm-node1-old-timestamp-dead.123424/

BinaryCrash · 2025-01-05T21:20:14+0100

rj45 said:
corosync-cmapctl -m stats | grep -E "(latency_ave|time_since_token_last_received|error|down_count)"

I don't know how to interpret the results:

Code:

# corosync-cmapctl -m stats | grep -E "(latency_ave|time_since_token_last_received|error|down_count)"
stats.knet.node1.link0.down_count (u32) = 2
stats.knet.node1.link0.latency_ave (u32) = 326
stats.knet.node1.link0.tx_data_errors (u32) = 0
stats.knet.node1.link0.tx_ping_errors (u32) = 0
stats.knet.node1.link0.tx_pmtu_errors (u32) = 0
stats.knet.node1.link0.tx_pong_errors (u32) = 0
stats.knet.node1.link0.tx_total_errors (u64) = 0
stats.knet.node1.link1.down_count (u32) = 2
stats.knet.node1.link1.latency_ave (u32) = 339
stats.knet.node1.link1.tx_data_errors (u32) = 0
stats.knet.node1.link1.tx_ping_errors (u32) = 0
stats.knet.node1.link1.tx_pmtu_errors (u32) = 0
stats.knet.node1.link1.tx_pong_errors (u32) = 0
stats.knet.node1.link1.tx_total_errors (u64) = 0
stats.knet.node2.link0.down_count (u32) = 5
stats.knet.node2.link0.latency_ave (u32) = 162
stats.knet.node2.link0.tx_data_errors (u32) = 0
stats.knet.node2.link0.tx_ping_errors (u32) = 0
stats.knet.node2.link0.tx_pmtu_errors (u32) = 0
stats.knet.node2.link0.tx_pong_errors (u32) = 2
stats.knet.node2.link0.tx_total_errors (u64) = 2
stats.knet.node2.link1.down_count (u32) = 1
stats.knet.node2.link1.latency_ave (u32) = 160
stats.knet.node2.link1.tx_data_errors (u32) = 0
stats.knet.node2.link1.tx_ping_errors (u32) = 0
stats.knet.node2.link1.tx_pmtu_errors (u32) = 0
stats.knet.node2.link1.tx_pong_errors (u32) = 0
stats.knet.node2.link1.tx_total_errors (u64) = 0
stats.knet.node3.link0.down_count (u32) = 3
stats.knet.node3.link0.latency_ave (u32) = 168
stats.knet.node3.link0.tx_data_errors (u32) = 0
stats.knet.node3.link0.tx_ping_errors (u32) = 0
stats.knet.node3.link0.tx_pmtu_errors (u32) = 0
stats.knet.node3.link0.tx_pong_errors (u32) = 0
stats.knet.node3.link0.tx_total_errors (u64) = 0
stats.knet.node3.link1.down_count (u32) = 3
stats.knet.node3.link1.latency_ave (u32) = 168
stats.knet.node3.link1.tx_data_errors (u32) = 0
stats.knet.node3.link1.tx_ping_errors (u32) = 0
stats.knet.node3.link1.tx_pmtu_errors (u32) = 0
stats.knet.node3.link1.tx_pong_errors (u32) = 0
stats.knet.node3.link1.tx_total_errors (u64) = 0
stats.knet.node4.link0.down_count (u32) = 2
stats.knet.node4.link0.latency_ave (u32) = 151
stats.knet.node4.link0.tx_data_errors (u32) = 0
stats.knet.node4.link0.tx_ping_errors (u32) = 0
stats.knet.node4.link0.tx_pmtu_errors (u32) = 0
stats.knet.node4.link0.tx_pong_errors (u32) = 3
stats.knet.node4.link0.tx_total_errors (u64) = 3
stats.knet.node4.link1.down_count (u32) = 2
stats.knet.node4.link1.latency_ave (u32) = 153
stats.knet.node4.link1.tx_data_errors (u32) = 0
stats.knet.node4.link1.tx_ping_errors (u32) = 0
stats.knet.node4.link1.tx_pmtu_errors (u32) = 0
stats.knet.node4.link1.tx_pong_errors (u32) = 2
stats.knet.node4.link1.tx_total_errors (u64) = 2
stats.knet.node5.link0.down_count (u32) = 0
stats.knet.node5.link0.latency_ave (u32) = 0
stats.knet.node5.link0.tx_data_errors (u32) = 0
stats.knet.node5.link0.tx_ping_errors (u32) = 0
stats.knet.node5.link0.tx_pmtu_errors (u32) = 0
stats.knet.node5.link0.tx_pong_errors (u32) = 0
stats.knet.node5.link0.tx_total_errors (u64) = 0
stats.knet.node6.link0.down_count (u32) = 6
stats.knet.node6.link0.latency_ave (u32) = 197
stats.knet.node6.link0.tx_data_errors (u32) = 22
stats.knet.node6.link0.tx_ping_errors (u32) = 0
stats.knet.node6.link0.tx_pmtu_errors (u32) = 0
stats.knet.node6.link0.tx_pong_errors (u32) = 1
stats.knet.node6.link0.tx_total_errors (u64) = 23
stats.knet.node6.link1.down_count (u32) = 5
stats.knet.node6.link1.latency_ave (u32) = 179
stats.knet.node6.link1.tx_data_errors (u32) = 0
stats.knet.node6.link1.tx_ping_errors (u32) = 0
stats.knet.node6.link1.tx_pmtu_errors (u32) = 0
stats.knet.node6.link1.tx_pong_errors (u32) = 0
stats.knet.node6.link1.tx_total_errors (u64) = 0
stats.knet.node7.link0.down_count (u32) = 7
stats.knet.node7.link0.latency_ave (u32) = 189
stats.knet.node7.link0.tx_data_errors (u32) = 3
stats.knet.node7.link0.tx_ping_errors (u32) = 0
stats.knet.node7.link0.tx_pmtu_errors (u32) = 0
stats.knet.node7.link0.tx_pong_errors (u32) = 1
stats.knet.node7.link0.tx_total_errors (u64) = 4
stats.knet.node7.link1.down_count (u32) = 3
stats.knet.node7.link1.latency_ave (u32) = 193
stats.knet.node7.link1.tx_data_errors (u32) = 0
stats.knet.node7.link1.tx_ping_errors (u32) = 0
stats.knet.node7.link1.tx_pmtu_errors (u32) = 0
stats.knet.node7.link1.tx_pong_errors (u32) = 2
stats.knet.node7.link1.tx_total_errors (u64) = 2
stats.knet.node8.link0.down_count (u32) = 4
stats.knet.node8.link0.latency_ave (u32) = 292
stats.knet.node8.link0.tx_data_errors (u32) = 0
stats.knet.node8.link0.tx_ping_errors (u32) = 0
stats.knet.node8.link0.tx_pmtu_errors (u32) = 0
stats.knet.node8.link0.tx_pong_errors (u32) = 7
stats.knet.node8.link0.tx_total_errors (u64) = 7
stats.knet.node8.link1.down_count (u32) = 3
stats.knet.node8.link1.latency_ave (u32) = 272
stats.knet.node8.link1.tx_data_errors (u32) = 0
stats.knet.node8.link1.tx_ping_errors (u32) = 0
stats.knet.node8.link1.tx_pmtu_errors (u32) = 0
stats.knet.node8.link1.tx_pong_errors (u32) = 2
stats.knet.node8.link1.tx_total_errors (u64) = 2
stats.srp.time_since_token_last_received (u64) = 523

Code:

# pvecm status
Cluster information
-------------------
Name:             m[redacted]3
Config Version:   8
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jan  5 17:15:24 2025
Quorum provider:  corosync_votequorum
Nodes:            7
Node ID:          0x00000005
Ring ID:          2.862
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   8
Highest expected: 8
Total votes:      7
Quorum:           5
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 100.80.83.121
0x00000003          1 100.80.83.124
0x00000004          1 100.80.83.125
0x00000005          1 100.80.83.126 (local)
0x00000006          1 100.80.83.127
0x00000007          1 100.80.83.129
0x00000008          1 100.80.83.130

one node is going to be offline as expected. (it is purposely disconnected)
didn't want to try excluding it with the cluster in this state.

BinaryCrash · 2025-01-05T22:31:37+0100

I managed to get VM running on another hardware without a cluster, but can't access the web ui on the cluster to change start on boot configuration of the old VM.
Is it a way, from SSH, to prevent a VM from starting on a node?
Maybe the commands from pve will not work, so looking for alternative way to prevent the old VM start/run.
Maybe chmod 000 some files? Any help on this?
This is because i would like to try some commands on the nodes as i reallocate the VMs to another hardware, but the VMs have start on boot = yes and some also have HA = started.

BinaryCrash · 2025-01-05T22:35:49+0100

There is one node i would like to separate from the cluster and try to run stand alone, as the storage is only local.
Is it possible? or should i format the hardware e copy the VM from backup?
The docs i saw states i need to make changes to the other nodes, and the cluster in this state i think will get things worse.
i rebooted the node that i wanted the VM go offline (had to hardreset the hardware), then it booted without web ui, connection refused.
to be sure the VM won't start for now i let the node offline/powered off.

BinaryCrash · 2025-01-06T00:43:25+0100

mr44er said:
Very similar, maybe it can help you: https://forum.proxmox.com/threads/lrm-node1-old-timestamp-dead.123424/

/etc/hosts appears to be ok on all nodes. just like this one that is stand alone.

Code:

~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
100.80.83.123 s10.[redacted] s10

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

UdoB · 2025-01-06T09:28:22+0100

BinaryCrash said:
There is one node i would like to separate from the cluster and try to run stand alone, as the storage is only local.
Is it possible?

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_separate_node_without_reinstall

Search

Search

Please help! Cluster of 8 nodes offline, any help is very appreciated.

BinaryCrash

Active Member

BinaryCrash

Active Member

BinaryCrash

Active Member

kellogs

Active Member

maxim.webster

Active Member

alexskysilk

Distinguished Member

tscret

New Member

BinaryCrash

Active Member

BinaryCrash

Active Member

BinaryCrash

Active Member

rj45

Member

mr44er

Renowned Member

BinaryCrash

Active Member

BinaryCrash

Active Member

BinaryCrash

Active Member

BinaryCrash

Active Member

UdoB

Distinguished Member