Hello all,
We currently have a proxmox cluster with 2 servers (in different provider and different city) and another server, in our company, with NFS share (for backup) and qdevice.
Saturday, the two servers have reboot with one hour of difference without logs.
The proxmox servers have the same version with latest updates (6.4-13).
We use only ZFS file system. The server A is a intel server and the server B is a amd server.
Here is the kernel logs of the reboot of the server A :
...
Nov 14 00:42:45 ns399886 kernel: [394026.624237] perf: interrupt took too long (4924 > 4900), lowering kernel.perf_event_max_sample_ra
te to 40500
Nov 16 08:56:53 ns399886 kernel: [596475.328665] zd64: p1 p2 < p5 >
Nov 18 04:54:22 ns399886 kernel: [754724.101735] watchdog: watchdog0: watchdog did not stop!
Nov 18 04:55:52 ns399886 kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 18 04:55:52 ns399886 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs rootdelay=10 vga=normal nomodeset rootdelay=15 noquiet nosplash
....
And 2 days later ...
Nov 20 15:50:26 ns399886 kernel: [ 180.022100] fwbr101i0: port 2(tap101i0) entered disabled state
Nov 20 15:50:26 ns399886 kernel: [ 180.022196] fwbr101i0: port 2(tap101i0) entered blocking state
Nov 20 15:50:26 ns399886 kernel: [ 180.022226] fwbr101i0: port 2(tap101i0) entered forwarding state
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SM
P PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/p
ve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs rootdelay=10 vga=normal nomodeset rootdelay=15 noquiet nosplash
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] KERNEL supported cpus:
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Intel GenuineIntel
Here is the kernel logs of the reboot of the server B :
Nov 15 17:11:41 server-hetzner kernel: [536283.455709] fwbr102i0: port 2(tap102i0) entered blocking state
Nov 15 17:11:41 server-hetzner kernel: [536283.455709] fwbr102i0: port 2(tap102i0) entered forwarding state
Nov 20 16:47:44 server-hetzner kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6))
#1 SMP PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 20 16:47:44 server-hetzner kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/pve-1 ro roo
t=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Do you have a idea of these reboot ? Currently, the reboot is almost one reboot per week.
I installed the intel-microcode package for the server A and amd64-microcode for the server B. It seems no currently change (in the kernel log, the microcode version don't rise). Also, I installed the kdump package. I wait the next reboot.
Meanwhile, do you have a idea of the origin of the crash/reboot ?
Best regards.
We currently have a proxmox cluster with 2 servers (in different provider and different city) and another server, in our company, with NFS share (for backup) and qdevice.
Code:
(A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox Server B (Provider Two)
| |
\----------------------------------------------------------/
|
(C) Qdevice on Debian server (in the company) + NFS share
Saturday, the two servers have reboot with one hour of difference without logs.
The proxmox servers have the same version with latest updates (6.4-13).
We use only ZFS file system. The server A is a intel server and the server B is a amd server.
Here is the kernel logs of the reboot of the server A :
...
Nov 14 00:42:45 ns399886 kernel: [394026.624237] perf: interrupt took too long (4924 > 4900), lowering kernel.perf_event_max_sample_ra
te to 40500
Nov 16 08:56:53 ns399886 kernel: [596475.328665] zd64: p1 p2 < p5 >
Nov 18 04:54:22 ns399886 kernel: [754724.101735] watchdog: watchdog0: watchdog did not stop!
Nov 18 04:55:52 ns399886 kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 18 04:55:52 ns399886 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs rootdelay=10 vga=normal nomodeset rootdelay=15 noquiet nosplash
....
And 2 days later ...
Nov 20 15:50:26 ns399886 kernel: [ 180.022100] fwbr101i0: port 2(tap101i0) entered disabled state
Nov 20 15:50:26 ns399886 kernel: [ 180.022196] fwbr101i0: port 2(tap101i0) entered blocking state
Nov 20 15:50:26 ns399886 kernel: [ 180.022226] fwbr101i0: port 2(tap101i0) entered forwarding state
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SM
P PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/p
ve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs rootdelay=10 vga=normal nomodeset rootdelay=15 noquiet nosplash
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] KERNEL supported cpus:
Nov 20 17:42:19 ns399886 kernel: [ 0.000000] Intel GenuineIntel
Here is the kernel logs of the reboot of the server B :
Nov 15 17:11:41 server-hetzner kernel: [536283.455709] fwbr102i0: port 2(tap102i0) entered blocking state
Nov 15 17:11:41 server-hetzner kernel: [536283.455709] fwbr102i0: port 2(tap102i0) entered forwarding state
Nov 20 16:47:44 server-hetzner kernel: [ 0.000000] Linux version 5.4.143-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6))
#1 SMP PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) ()
Nov 20 16:47:44 server-hetzner kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.4.143-1-pve root=ZFS=rpool/ROOT/pve-1 ro roo
t=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Do you have a idea of these reboot ? Currently, the reboot is almost one reboot per week.
I installed the intel-microcode package for the server A and amd64-microcode for the server B. It seems no currently change (in the kernel log, the microcode version don't rise). Also, I installed the kdump package. I wait the next reboot.
Meanwhile, do you have a idea of the origin of the crash/reboot ?
Best regards.
Last edited: