The 2 servers of the cluster randomly reboot without logs

Feb 24, 2020
6
0
6
46
Hello all,

We currently have a proxmox cluster with 2 servers (in different provider and different city) and another server, in our company, with NFS share (for backup) and qdevice.

Code:
(A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox Server B (Provider Two)
                 |                                                          |
                 \----------------------------------------------------------/
                                               |
                  (C) Qdevice on Debian server (in the company) + NFS share

Saturday, the two servers have reboot with one hour of difference without logs.

The proxmox servers have the same version with latest updates (6.4-13).

We use only ZFS file system. The server A is a intel server and the server B is a amd server.

Here is the kernel logs of the reboot of the server A :

...
Nov 14 00:42:45 ns399886 kernel: [394026.624237] perf: interrupt took too long (4924 > 4900), lowering kernel.perf_event_max_sample_ra
te to 40500
Nov 16 08:56:53 ns399886 kernel: [596475.328665] zd64: p1 p2 < p5 >
Nov 18 04:54:22 ns399886 kernel: [754724.101735] watchdog: watchdog0: watchdog did not stop!
Nov 18 04:55:52 ns399886 kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 18 04:55:52 ns399886 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs rootdelay=10 vga=normal nomodeset rootdelay=15 noquiet nosplash
....

And 2 days later ...

Nov 20 15:50:26 ns399886 kernel: [ 180.022100] fwbr101i0: port 2(tap101i0) entered disabled state
Nov 20 15:50:26 ns399886 kernel: [ 180.022196] fwbr101i0: port 2(tap101i0) entered blocking state
Nov 20 15:50:26 ns399886 kernel: [ 180.022226] fwbr101i0: port 2(tap101i0) entered forwarding state
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SM
P PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/p
ve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs rootdelay=10 vga=normal nomodeset rootdelay=15 noquiet nosplash
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] KERNEL supported cpus:
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Intel GenuineIntel



Here is the kernel logs of the reboot of the server B :

Nov 15 17:11:41 server-hetzner kernel: [536283.455709] fwbr102i0: port 2(tap102i0) entered blocking state
Nov 15 17:11:41 server-hetzner kernel: [536283.455709] fwbr102i0: port 2(tap102i0) entered forwarding state
Nov 20 16:47:44 server-hetzner kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6))
#1 SMP PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 20 16:47:44 server-hetzner kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/pve-1 ro roo
t=ZFS=rpool/ROOT/pve-1 boot=zfs quiet


Do you have a idea of these reboot ? Currently, the reboot is almost one reboot per week.

I installed the intel-microcode package for the server A and amd64-microcode for the server B. It seems no currently change (in the kernel log, the microcode version don't rise). Also, I installed the kdump package. I wait the next reboot.

Meanwhile, do you have a idea of the origin of the crash/reboot ?

Best regards.
 
Last edited:
The server rebooted twenty minutes ago :

Here is the kernel logs of the reboot of the server A :

Nov 21 05:20:30 ns399886 kernel: [41931.019689] perf: interrupt took too long (3133 > 3131), lowering kernel.perf_event_max_sample_rate to 63750
Nov 21 13:18:34 ns399886 kernel: [70615.117951] perf: interrupt took too long (3918 > 3916), lowering kernel.perf_event_max_sample_rate to 51000
Nov 22 13:06:19 ns399886 kernel: [ 0.000000] microcode: microcode updated early to revision 0x21, date = 2019-02-13
Nov 22 13:06:19 ns399886 kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 22 13:06:19 ns399886 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/p
ve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs rootdelay=10 vga=normal nomodeset rootdelay=15 noquiet nosplash

As kdump were just installed, the kdump not started.
 
We currently have a proxmox cluster with 2 servers (in different provider and different city)
How far apart are they? ping-wise, of course.

As far as I know a cluster requires a stable connection for corosync with a low latency of a few milliseconds maximum. (I believe I've read 2 ms.)

I am unsure if this a problem only if "HA" is activated though. Did you activate High-Availability?

But basically: if the reply from the other cluster members takes too long a node will consider itself having lost connection. Then fencing occurs. Fencing results in rebooting...

See also https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

Best regards
 
Thanks a lot UdoB for your explanation ! It makes sense.

Yes, The HA functionality is activated.

Effectively, each server is located in different country :
- Server A : North of France
- Server B : Sweden
- Qdevice C : South of France

As the nodes don't share the same space hard disk (we synchronize each hard disk of each virtual machine, between node, every 5 minutes by ZFS system), how rise the delay before a server fence itself in our cluster ?

Effectively, we don't need a protection of millisecond, just that a virtual machine start on another node if a the original node don't answer during 1 minute (for example).

Currently :

The ping between server A and B is a average of 35 ms.
The ping between server A and C is a average of 15 ms.
The ping between server B and C is a average of 47 ms.

Thanks in advance for your help.

Best regards.
 
Last edited:
Hello at all !

Since your message UdoB, I removed all servers in HA group.

Since these modification, 5 days, there is no reset.

Unfortunely, I need a kind of avaibility of each virtual machine, but no High Avaibility. If a virtual server don't work respond, I would that the second server start the same virtual machine in 5 minutes after the outage. But without reset de Host.

Do you know a way to do that ?

I already set a sync, every 5 minutes, between the servers, now I need to create a avaibility functionality but high latence (in second). The same thing as corosync but where i can set the timeout between the servers.

Thanks in advance for your help.

Best regards.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!