Proxmox restarted unexpectedly

manzke89 · Feb 19, 2024

I have a Proxmox cluster with 4 nodes, 2 nodes are in one region and another 2 in another, it happened that 2 nodes in the same region restarted, it was not a hardware or power failure, as there is nothing registered in iLo, follow the logs moment before restarting

2024-02-18T20:21:14.806280-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 10
2024-02-18T20:21:15.807400-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 20
2024-02-18T20:21:16.808346-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 30
2024-02-18T20:21:17.061405-03:00 pve3 corosync[2542]: [TOTEM ] Token has not been received in 3225 ms
2024-02-18T20:21:17.630506-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 10
2024-02-18T20:21:17.809365-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 40
2024-02-18T20:21:18.631434-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 20
2024-02-18T20:21:18.810387-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 50
2024-02-18T20:21:19.185975-03:00 pve3 corosync[2542]: [TOTEM ] Retransmit List: 5
2024-02-18T20:21:19.632457-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 30
2024-02-18T20:21:19.811325-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 60
2024-02-18T20:21:20.220501-03:00 pve3 corosync[2542]: [QUORUM] Sync members[4]: 1 2 3 4
2024-02-18T20:21:20.220670-03:00 pve3 corosync[2542]: [QUORUM] Sync joined[1]: 1
2024-02-18T20:21:20.220755-03:00 pve3 corosync[2542]: [QUORUM] Sync left[1]: 1
2024-02-18T20:21:20.220950-03:00 pve3 corosync[2542]: [TOTEM ] A new membership (1.15fc) was formed. Members joined: 1 left: 1
2024-02-18T20:21:20.221038-03:00 pve3 corosync[2542]: [TOTEM ] Failed to receive the leave message. failed: 1
2024-02-18T20:21:20.633636-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 40
2024-02-18T20:21:20.812480-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 70
2024-02-18T20:21:21.634563-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 50
2024-02-18T20:21:21.813482-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 80
2024-02-18T20:21:22.635479-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 60
2024-02-18T20:21:22.814443-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 90
2024-02-18T20:21:23.445734-03:00 pve3 corosync[2542]: [TOTEM ] Token has not been received in 3225 ms
2024-02-18T20:21:23.636519-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 70
2024-02-18T20:21:23.815602-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 100
2024-02-18T20:21:23.815799-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retried 100 times
2024-02-18T20:21:23.815893-03:00 pve3 pmxcfs[2531]: [dcdb] crit: cpg_send_message failed: 6
2024-02-18T20:21:24.224327-03:00 pve3 corosync[2542]: [KNET ] link: host: 1 link: 0 is down
2024-02-18T20:21:24.224559-03:00 pve3 corosync[2542]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2024-02-18T20:21:24.224644-03:00 pve3 corosync[2542]: [KNET ] host: host: 1 has no active links
2024-02-18T20:21:24.314455-03:00 pve3 watchdog-mux[2048]: client watchdog expired - disable watchdog updates
2024-02-18T20:21:24.520837-03:00 pve3 corosync[2542]: [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
2024-02-18T20:21:24.637634-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 80
2024-02-18T20:21:24.817047-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 10
2024-02-18T20:21:25.638516-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 90
2024-02-18T20:21:25.817957-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 20
2024-02-18T20:21:26.639501-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 100
2024-02-18T20:21:26.639816-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retried 100 times
2024-02-18T20:21:26.639970-03:00 pve3 pmxcfs[2531]: [status] crit: cpg_send_message failed: 6
2024-02-18T20:21:26.819049-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 30
2024-02-18T20:21:27.227451-03:00 pve3 corosync[2542]: [KNET ] rx: host: 1 link: 0 is up
2024-02-18T20:21:27.227614-03:00 pve3 corosync[2542]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
2024-02-18T20:21:27.227665-03:00 pve3 corosync[2542]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2024-02-18T20:21:27.322892-03:00 pve3 corosync[2542]: [KNET ] pmtud: Global data MTU changed to: 1397
2024-02-18T20:21:27.645674-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 10
2024-02-18T20:21:27.819964-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 40
2024-02-18T20:21:28.646623-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 20
2024-02-18T20:21:28.820924-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 50

cheiss · Feb 19, 2024

manzke89 said:
I have a Proxmox cluster with 4 nodes, 2 nodes are in one region and another 2 in another, it happened that 2 nodes in the same region restarted

Are these physical regions? How far apart are they? What's the latency between them?

Please have a look at Network Requirements for clusters. They need LAN-level connectivity and should have a maximum latency of around 5ms - the lower, the better.

In you case - on a hunch - it sounds like there was a network hiccup and two nodes got cut off - and two of four nodes do not have quorum and a fencing occured.

manzke89 · Feb 19, 2024

cheiss said:
Are these physical regions? How far apart are they? What's the latency between them?

Please have a look at Network Requirements for clusters. They need LAN-level connectivity and should have a maximum latency of around 5ms - the lower, the better.

In you case - on a hunch - it sounds like there was a network hiccup and two nodes got cut off - and two of four nodes do not have quorum and a fencing occured.

There are 2 datacenters, connected by a fiber between them, it has L2 connectivity between the nodes, the average latency is 3ms.
If there was an outage between the 2 datacenters, why did proxmox restart?

esi_y · Feb 19, 2024

manzke89 said:
2024-02-18T20:21:24.314455-03:00 pve3 watchdog-mux[2048]: client watchdog expired - disable watchdog updates

Because they self-fenced on network outage that was too long. Are you using High Availability to begin with?

manzke89 · Feb 19, 2024

tempacc346235 said:
Because they self-fenced on network outage that was too long. Are you using High Availability to begin with?

No, I'm not using HA yet

esi_y · Feb 19, 2024

manzke89 said:
No, I'm not using HA yet

Can you post ha-manager status?

manzke89 · Feb 19, 2024

tempacc346235 said:
Can you post ha-manager status?

esi_y · Feb 19, 2024

manzke89 said:
View attachment 63386

Alright and could you post ... from the "master":

journalctl -b -u pve-ha-crm

... and from the node that had active "lrm"

journalctl -b -u pve-ha-lrm

I see you want to redact the node names, that's fine, but please substitute them. It's something important to know which one was which.

Also, which of those rebooted? I am going to call them just (from lrm) line 1-4. Which one is master and which ones rebooted?

manzke89 · Feb 19, 2024

tempacc346235 said:
Alright and could you post ... from the "master":

journalctl -b -u pve-ha-crm

... and from the node that had active "lrm"

journalctl -b -u pve-ha-lrm

I see you want to redact the node names, that's fine, but please substitute them. It's something important to know which one was which.

Also, which of those rebooted? I am going to call them just (from lrm) line 1-4. Which one is master and which ones rebooted?

node master:
root@mdc-023:~# journalctl -b -u pve-ha-crm
Feb 18 20:24:04 mdc-023 systemd[1]: Starting pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon...
Feb 18 20:24:05 mdc-023 pve-ha-crm[11895]: starting server
Feb 18 20:24:05 mdc-023 pve-ha-crm[11895]: status change startup => wait_for_quorum
Feb 18 20:24:05 mdc-023 systemd[1]: Started pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon.
Feb 18 20:24:06 mdc-023 pve-ha-crm[11895]: successfully acquired lock 'ha_manager_lock'
Feb 18 20:24:06 mdc-023 pve-ha-crm[11895]: watchdog active
Feb 18 20:24:06 mdc-023 pve-ha-crm[11895]: status change wait_for_quorum => master
Feb 18 20:24:06 mdc-023 pve-ha-crm[11895]: node 'mdc-025': state changed from 'online' => 'unknown'
Feb 18 20:25:16 mdc-023 pve-ha-crm[11895]: node 'mdc-025': state changed from 'unknown' => 'online'
Feb 18 21:47:20 mdc-023 pve-ha-crm[11895]: loop take too long (46 seconds)
Feb 18 21:48:33 mdc-023 pve-ha-crm[11895]: loop take too long (63 seconds)

everyone has lrm idle

root@mdc-022:~# ha-manager status
quorum OK
master mdc-023 (active, Mon Feb 19 09:51:15 2024)
lrm mdc-022 (idle, Mon Feb 19 09:51:14 2024)
lrm mdc-023 (idle, Mon Feb 19 09:51:13 2024)
lrm mdc-025 (active, Mon Feb 19 09:51:12 2024)

root@mdc-025:~# journalctl -b -u pve-ha-lrm
Feb 18 20:25:10 mdc-025 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Feb 18 20:25:11 mdc-025 pve-ha-lrm[12989]: starting server
Feb 18 20:25:11 mdc-025 pve-ha-lrm[12989]: status change startup => wait_for_agent_lock
Feb 18 20:25:11 mdc-025 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Feb 18 20:25:17 mdc-025 pve-ha-lrm[12989]: successfully acquired lock 'ha_agent_mdc-025_lock'
Feb 18 20:25:17 mdc-025 pve-ha-lrm[12989]: watchdog active
Feb 18 20:25:17 mdc-025 pve-ha-lrm[12989]: status change wait_for_agent_lock => active
Feb 18 20:25:17 mdc-025 pve-ha-lrm[13227]: starting service vm:2501
Feb 18 20:25:17 mdc-025 pve-ha-lrm[13228]: start VM 2501: UPID:mdc-025:000033AC:0000110E:65D291DD:qmstart:2501:root@pam:
Feb 18 20:25:17 mdc-025 pve-ha-lrm[13227]: <root@pam> starting task UPID:mdc-025:000033AC:0000110E:65D291DD:qmstart:2501:root@pam:
Feb 18 20:25:19 mdc-025 pve-ha-lrm[13227]: <root@pam> end task UPID:mdc-025:000033AC:0000110E:65D291DD:qmstart:2501:root@pam: OK
Feb 18 20:25:19 mdc-025 pve-ha-lrm[13227]: service status vm:2501 started
Feb 18 21:47:12 mdc-025 pve-ha-lrm[12989]: unable to write lrm status file - unable to open file '/etc/pve/nodes/mdc-025/lrm_status.tmp.12989' - Device or resource busy
Feb 18 21:47:12 mdc-025 pve-ha-lrm[12989]: loop take too long (38 seconds)
Feb 18 21:48:33 mdc-025 pve-ha-lrm[12989]: loop take too long (63 seconds)
Feb 19 09:46:42 mdc-025 pve-ha-lrm[3788377]: missing resource configuration for 'vm:2501'
Feb 19 09:56:53 mdc-025 pve-ha-lrm[12989]: node had no service configured for 60 rounds, going idle.
Feb 19 09:56:53 mdc-025 pve-ha-lrm[12989]: watchdog closed (disabled)
Feb 19 09:56:53 mdc-025 pve-ha-lrm[12989]: status change active => wait_for_agent_lock

The nodes that restarted were 23 and 25.

esi_y · Feb 19, 2024

manzke89 said:
everyone has lrm idle

root@mdc-022:~# ha-manager status
quorum OK
master mdc-023 (active, Mon Feb 19 09:51:15 2024)
lrm mdc-022 (idle, Mon Feb 19 09:51:14 2024)
lrm mdc-023 (idle, Mon Feb 19 09:51:13 2024)
lrm mdc-025 (active, Mon Feb 19 09:51:12 2024)

I am a bit confused by this. You had 4 LRMs before, now it's 3 and 25 shows active, but you mention all are idle?

Since you do not mind and this reboot occured literally not long ago (Feb 18 20:25-ish?), can you simply post/attach from EACH node:

Code:

journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux --since=-1week

manzke89 · Feb 19, 2024

tempacc346235 said:
I am a bit confused by this. You had 4 LRMs before, now it's 3 and 25 shows active, but you mention all are idle?

Since you do not mind and this reboot occured literally not long ago (Feb 18 20:25-ish?), can you simply post/attach from EACH node:

Code:

journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux --since=-1week

follow the link for log files:

https://file.io/UmUbC0qolgOh

esi_y · Feb 19, 2024

Alright, what's with vm 2501 and did you do any HA testing at all prior to this?

manzke89 · Feb 19, 2024

tempacc346235 said:
Alright, what's with vm 2501 and did you do any HA testing at all prior to this?

Yes, I had done tests with this VM in the past, but it no longer exists, and the one it was part of also no longer exists

esi_y · Feb 19, 2024

Can you provide whole previous boot logs for 23 and 25?

Code:

journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux -b -1

manzke89 said:
Yes, I had done tests with this VM in the past, but it no longer exists, and the one it was part of also no longer exists

The one it was part of? Do you mean you had e.g. removed a node?

esi_y · Feb 19, 2024

Also, is ha-manager config empty?

manzke89 · Feb 19, 2024

tempacc346235 said:
Also, is ha-manager config empty?

Yes

pionk · Feb 19, 2024

i am facing this problem too, i have 6 nodes, when 1 nodes are suddenly have problem on connectivity suddenly all nodes rebooting and all VMs are dead. It should be still on quorum when only 1 nodes is down.

esi_y · Feb 19, 2024

tempacc346235 said:
Can you provide whole previous boot logs for 23 and 25?

Code:

journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux -b -1

@manzke89 Could you add these logs?

And cat /etc/pve/ha/resources.cfg?

pionk · Feb 19, 2024

tempacc346235 said:
Do you mind opening a separate thread? Provide the same info on HA, network setup and logs / outputs? Feel free to link it from here, just it would be helpful if it does not mix up for the two cases.

yes sure, i will open new thread

manzke89 · Feb 19, 2024

tempacc346235 said:
@manzke89 Could you add these logs?

And cat /etc/pve/ha/resources.cfg?

link for log files: https://file.io/WlfZ4AHZctrv

file is blank cat /etc/pve/ha/resources.cfg

Proxmox restarted unexpectedly

New Member

Proxmox Staff Member

New Member

Active Member

New Member

Active Member

New Member

Active Member

New Member

Active Member

New Member

Active Member

New Member

Active Member

Active Member

New Member

New Member

Active Member

New Member

New Member