Proxmox restarted unexpectedly

manzke89

New Member
Sep 1, 2023
29
0
1
I have a Proxmox cluster with 4 nodes, 2 nodes are in one region and another 2 in another, it happened that 2 nodes in the same region restarted, it was not a hardware or power failure, as there is nothing registered in iLo, follow the logs moment before restarting

2024-02-18T20:21:14.806280-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 10
2024-02-18T20:21:15.807400-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 20
2024-02-18T20:21:16.808346-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 30
2024-02-18T20:21:17.061405-03:00 pve3 corosync[2542]: [TOTEM ] Token has not been received in 3225 ms
2024-02-18T20:21:17.630506-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 10
2024-02-18T20:21:17.809365-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 40
2024-02-18T20:21:18.631434-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 20
2024-02-18T20:21:18.810387-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 50
2024-02-18T20:21:19.185975-03:00 pve3 corosync[2542]: [TOTEM ] Retransmit List: 5
2024-02-18T20:21:19.632457-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 30
2024-02-18T20:21:19.811325-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 60
2024-02-18T20:21:20.220501-03:00 pve3 corosync[2542]: [QUORUM] Sync members[4]: 1 2 3 4
2024-02-18T20:21:20.220670-03:00 pve3 corosync[2542]: [QUORUM] Sync joined[1]: 1
2024-02-18T20:21:20.220755-03:00 pve3 corosync[2542]: [QUORUM] Sync left[1]: 1
2024-02-18T20:21:20.220950-03:00 pve3 corosync[2542]: [TOTEM ] A new membership (1.15fc) was formed. Members joined: 1 left: 1
2024-02-18T20:21:20.221038-03:00 pve3 corosync[2542]: [TOTEM ] Failed to receive the leave message. failed: 1
2024-02-18T20:21:20.633636-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 40
2024-02-18T20:21:20.812480-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 70
2024-02-18T20:21:21.634563-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 50
2024-02-18T20:21:21.813482-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 80
2024-02-18T20:21:22.635479-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 60
2024-02-18T20:21:22.814443-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 90
2024-02-18T20:21:23.445734-03:00 pve3 corosync[2542]: [TOTEM ] Token has not been received in 3225 ms
2024-02-18T20:21:23.636519-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 70
2024-02-18T20:21:23.815602-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 100
2024-02-18T20:21:23.815799-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retried 100 times
2024-02-18T20:21:23.815893-03:00 pve3 pmxcfs[2531]: [dcdb] crit: cpg_send_message failed: 6
2024-02-18T20:21:24.224327-03:00 pve3 corosync[2542]: [KNET ] link: host: 1 link: 0 is down
2024-02-18T20:21:24.224559-03:00 pve3 corosync[2542]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2024-02-18T20:21:24.224644-03:00 pve3 corosync[2542]: [KNET ] host: host: 1 has no active links
2024-02-18T20:21:24.314455-03:00 pve3 watchdog-mux[2048]: client watchdog expired - disable watchdog updates
2024-02-18T20:21:24.520837-03:00 pve3 corosync[2542]: [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
2024-02-18T20:21:24.637634-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 80
2024-02-18T20:21:24.817047-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 10
2024-02-18T20:21:25.638516-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 90
2024-02-18T20:21:25.817957-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 20
2024-02-18T20:21:26.639501-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 100
2024-02-18T20:21:26.639816-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retried 100 times
2024-02-18T20:21:26.639970-03:00 pve3 pmxcfs[2531]: [status] crit: cpg_send_message failed: 6
2024-02-18T20:21:26.819049-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 30
2024-02-18T20:21:27.227451-03:00 pve3 corosync[2542]: [KNET ] rx: host: 1 link: 0 is up
2024-02-18T20:21:27.227614-03:00 pve3 corosync[2542]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
2024-02-18T20:21:27.227665-03:00 pve3 corosync[2542]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2024-02-18T20:21:27.322892-03:00 pve3 corosync[2542]: [KNET ] pmtud: Global data MTU changed to: 1397
2024-02-18T20:21:27.645674-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 10
2024-02-18T20:21:27.819964-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 40
2024-02-18T20:21:28.646623-03:00 pve3 pmxcfs[2531]: [status] notice: cpg_send_message retry 20
2024-02-18T20:21:28.820924-03:00 pve3 pmxcfs[2531]: [dcdb] notice: cpg_send_message retry 50
 
I have a Proxmox cluster with 4 nodes, 2 nodes are in one region and another 2 in another, it happened that 2 nodes in the same region restarted
Are these physical regions? How far apart are they? What's the latency between them?

Please have a look at Network Requirements for clusters. They need LAN-level connectivity and should have a maximum latency of around 5ms - the lower, the better.

In you case - on a hunch - it sounds like there was a network hiccup and two nodes got cut off - and two of four nodes do not have quorum and a fencing occured.
 
Are these physical regions? How far apart are they? What's the latency between them?

Please have a look at Network Requirements for clusters. They need LAN-level connectivity and should have a maximum latency of around 5ms - the lower, the better.

In you case - on a hunch - it sounds like there was a network hiccup and two nodes got cut off - and two of four nodes do not have quorum and a fencing occured.
There are 2 datacenters, connected by a fiber between them, it has L2 connectivity between the nodes, the average latency is 3ms.
If there was an outage between the 2 datacenters, why did proxmox restart?
 
2024-02-18T20:21:24.314455-03:00 pve3 watchdog-mux[2048]: client watchdog expired - disable watchdog updates

Because they self-fenced on network outage that was too long. Are you using High Availability to begin with?
 

Alright and could you post ... from the "master":

journalctl -b -u pve-ha-crm

... and from the node that had active "lrm"

journalctl -b -u pve-ha-lrm

I see you want to redact the node names, that's fine, but please substitute them. It's something important to know which one was which.

Also, which of those rebooted? I am going to call them just (from lrm) line 1-4. Which one is master and which ones rebooted?
 
Alright and could you post ... from the "master":

journalctl -b -u pve-ha-crm

... and from the node that had active "lrm"

journalctl -b -u pve-ha-lrm

I see you want to redact the node names, that's fine, but please substitute them. It's something important to know which one was which.

Also, which of those rebooted? I am going to call them just (from lrm) line 1-4. Which one is master and which ones rebooted?
node master:
root@mdc-023:~# journalctl -b -u pve-ha-crm
Feb 18 20:24:04 mdc-023 systemd[1]: Starting pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon...
Feb 18 20:24:05 mdc-023 pve-ha-crm[11895]: starting server
Feb 18 20:24:05 mdc-023 pve-ha-crm[11895]: status change startup => wait_for_quorum
Feb 18 20:24:05 mdc-023 systemd[1]: Started pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon.
Feb 18 20:24:06 mdc-023 pve-ha-crm[11895]: successfully acquired lock 'ha_manager_lock'
Feb 18 20:24:06 mdc-023 pve-ha-crm[11895]: watchdog active
Feb 18 20:24:06 mdc-023 pve-ha-crm[11895]: status change wait_for_quorum => master
Feb 18 20:24:06 mdc-023 pve-ha-crm[11895]: node 'mdc-025': state changed from 'online' => 'unknown'
Feb 18 20:25:16 mdc-023 pve-ha-crm[11895]: node 'mdc-025': state changed from 'unknown' => 'online'
Feb 18 21:47:20 mdc-023 pve-ha-crm[11895]: loop take too long (46 seconds)
Feb 18 21:48:33 mdc-023 pve-ha-crm[11895]: loop take too long (63 seconds)

everyone has lrm idle


root@mdc-022:~# ha-manager status
quorum OK
master mdc-023 (active, Mon Feb 19 09:51:15 2024)
lrm mdc-022 (idle, Mon Feb 19 09:51:14 2024)
lrm mdc-023 (idle, Mon Feb 19 09:51:13 2024)
lrm mdc-025 (active, Mon Feb 19 09:51:12 2024)


root@mdc-025:~# journalctl -b -u pve-ha-lrm
Feb 18 20:25:10 mdc-025 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Feb 18 20:25:11 mdc-025 pve-ha-lrm[12989]: starting server
Feb 18 20:25:11 mdc-025 pve-ha-lrm[12989]: status change startup => wait_for_agent_lock
Feb 18 20:25:11 mdc-025 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Feb 18 20:25:17 mdc-025 pve-ha-lrm[12989]: successfully acquired lock 'ha_agent_mdc-025_lock'
Feb 18 20:25:17 mdc-025 pve-ha-lrm[12989]: watchdog active
Feb 18 20:25:17 mdc-025 pve-ha-lrm[12989]: status change wait_for_agent_lock => active
Feb 18 20:25:17 mdc-025 pve-ha-lrm[13227]: starting service vm:2501
Feb 18 20:25:17 mdc-025 pve-ha-lrm[13228]: start VM 2501: UPID:mdc-025:000033AC:0000110E:65D291DD:qmstart:2501:root@pam:
Feb 18 20:25:17 mdc-025 pve-ha-lrm[13227]: <root@pam> starting task UPID:mdc-025:000033AC:0000110E:65D291DD:qmstart:2501:root@pam:
Feb 18 20:25:19 mdc-025 pve-ha-lrm[13227]: <root@pam> end task UPID:mdc-025:000033AC:0000110E:65D291DD:qmstart:2501:root@pam: OK
Feb 18 20:25:19 mdc-025 pve-ha-lrm[13227]: service status vm:2501 started
Feb 18 21:47:12 mdc-025 pve-ha-lrm[12989]: unable to write lrm status file - unable to open file '/etc/pve/nodes/mdc-025/lrm_status.tmp.12989' - Device or resource busy
Feb 18 21:47:12 mdc-025 pve-ha-lrm[12989]: loop take too long (38 seconds)
Feb 18 21:48:33 mdc-025 pve-ha-lrm[12989]: loop take too long (63 seconds)
Feb 19 09:46:42 mdc-025 pve-ha-lrm[3788377]: missing resource configuration for 'vm:2501'
Feb 19 09:56:53 mdc-025 pve-ha-lrm[12989]: node had no service configured for 60 rounds, going idle.
Feb 19 09:56:53 mdc-025 pve-ha-lrm[12989]: watchdog closed (disabled)
Feb 19 09:56:53 mdc-025 pve-ha-lrm[12989]: status change active => wait_for_agent_lock


The nodes that restarted were 23 and 25.
 
everyone has lrm idle


root@mdc-022:~# ha-manager status
quorum OK
master mdc-023 (active, Mon Feb 19 09:51:15 2024)
lrm mdc-022 (idle, Mon Feb 19 09:51:14 2024)
lrm mdc-023 (idle, Mon Feb 19 09:51:13 2024)
lrm mdc-025 (active, Mon Feb 19 09:51:12 2024)

I am a bit confused by this. You had 4 LRMs before, now it's 3 and 25 shows active, but you mention all are idle?

Since you do not mind and this reboot occured literally not long ago (Feb 18 20:25-ish?), can you simply post/attach from EACH node:

Code:
journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux --since=-1week
 
I am a bit confused by this. You had 4 LRMs before, now it's 3 and 25 shows active, but you mention all are idle?

Since you do not mind and this reboot occured literally not long ago (Feb 18 20:25-ish?), can you simply post/attach from EACH node:

Code:
journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux --since=-1week
follow the link for log files:

https://file.io/UmUbC0qolgOh
 
Alright, what's with vm 2501 and did you do any HA testing at all prior to this?
 
Can you provide whole previous boot logs for 23 and 25?

Code:
journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux -b -1

Yes, I had done tests with this VM in the past, but it no longer exists, and the one it was part of also no longer exists

The one it was part of? Do you mean you had e.g. removed a node?
 
i am facing this problem too, i have 6 nodes, when 1 nodes are suddenly have problem on connectivity suddenly all nodes rebooting and all VMs are dead. It should be still on quorum when only 1 nodes is down.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!