Why is server automatic rebooted?

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hi,

I'm operating a 7-node cluster with Ceph.

From time-to-time I must notice that a node is automatically rebooting.
Now I want to analyse what is triggering this reboot.

The output of last -x | head | tac indicates that the last reboot was triggered by kernel 5.3.10-1-pve:
root@ld5506:~# last -x | head | tac
root pts/3 tmux(31958).%2 Tue Dec 10 11:04 - crash (1+17:25)
reboot system boot 5.3.10-1-pve Thu Dec 12 04:29 still running
runlevel (to lvl 5) 5.3.10-1-pve Thu Dec 12 04:33 still running
root pts/0 10.177.32.32 Thu Dec 12 08:47 still logged in
root pts/1 tmux(248633).%0 Thu Dec 12 08:47 still logged in
root pts/2 tmux(248633).%1 Thu Dec 12 08:47 still logged in
root pts/3 tmux(248633).%2 Thu Dec 12 08:51 still logged in
root pts/4 tmux(248633).%3 Thu Dec 12 10:01 still logged in
root pts/5 tmux(248633).%4 Fri Dec 13 10:19 still logged in
root pts/6 tmux(248633).%5 Fri Dec 13 10:23 still logged in


If this is true, my question is:
Why is kernel triggering a reboot?

If not, the question is:
What should I check next in order to identify the root cause?

THX
 
Is it a normal reboot?
Anything in the journal, syslogs or kernel logs that can give a hint?
Do you have any HA resources defined?
Is it always the same node?
 
Hi,
I have this information in /var/log/messages:
[...]
Dec 12 04:15:06 ld5506 kernel: [148842.580673] usb 1-1.1: SerialNumber: 20171009
Dec 12 04:15:06 ld5506 kernel: [148842.602380] hidraw: raw HID events driver (C) Jiri Kosina
Dec 12 04:15:06 ld5506 kernel: [148842.608082] usbcore: registered new interface driver usbhid
Dec 12 04:15:06 ld5506 kernel: [148842.609055] usbhid: USB HID core driver
Dec 12 04:15:06 ld5506 kernel: [148842.610537] usbcore: registered new interface driver usbmouse
Dec 12 04:15:06 ld5506 kernel: [148842.611607] usbcore: registered new interface driver usbkbd
Dec 12 04:15:06 ld5506 kernel: [148842.616303] input: Avocent Keyboard/Mouse Function as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1.1/1-1.1:1.0/0003:17EF:B000.0001/input/input2
Dec 12 04:15:06 ld5506 kernel: [148842.673610] hid-generic 0003:17EF:B000.0001: input,hidraw0: USB HID v1.00 Keyboard [Avocent Keyboard/Mouse Function] on usb-0000:00:14.0-1.1/input0
Dec 12 04:15:06 ld5506 kernel: [148842.677077] input: Avocent Keyboard/Mouse Function as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1.1/1-1.1:1.1/0003:17EF:B000.0002/input/input3
Dec 12 04:15:06 ld5506 kernel: [148842.680817] hid-generic 0003:17EF:B000.0002: input,hidraw1: USB HID v1.00 Mouse [Avocent Keyboard/Mouse Function] on usb-0000:00:14.0-1.1/input1
Dec 12 04:15:06 ld5506 kernel: [148842.683197] input: Avocent Keyboard/Mouse Function as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1.1/1-1.1:1.2/0003:17EF:B000.0003/input/input4
Dec 12 04:15:06 ld5506 kernel: [148842.685710] hid-generic 0003:17EF:B000.0003: input,hidraw2: USB HID v1.00 Mouse [Avocent Keyboard/Mouse Function] on usb-0000:00:14.0-1.1/input2
Dec 12 04:15:07 ld5506 kernel: [148843.761419] usb 1-1.6: new high-speed USB device number 6 using xhci_hcd
Dec 12 04:15:07 ld5506 kernel: [148843.862995] usb 1-1.6: New USB device found, idVendor=04b3, idProduct=4010, bcdDevice= 3.14
Dec 12 04:15:07 ld5506 kernel: [148843.864389] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Dec 12 04:15:07 ld5506 kernel: [148843.865800] usb 1-1.6: Product: XClarity Controller
Dec 12 04:15:07 ld5506 kernel: [148843.867158] usb 1-1.6: Manufacturer: IBM
Dec 12 04:15:07 ld5506 kernel: [148843.870961] cdc_ether 1-1.6:1.0 usb0: register 'cdc_ether' at usb-0000:00:14.0-1.6, CDC Ethernet Device, 7e:d3:0a:60:9b:1f
Dec 12 04:15:07 ld5506 kernel: [148843.878125] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: renamed from usb0
Dec 12 04:16:12 ld5506 kernel: [148908.958402] power_meter ACPI000D:00: Found ACPI power meter.
Dec 12 04:16:12 ld5506 kernel: [148908.960504] power_meter ACPI000D:00: Ignoring unsafe software power cap!
Dec 12 04:16:23 ld5506 kernel: [148919.385924] usb 1-1.1: USB disconnect, device number 5
Dec 12 04:18:12 ld5506 kernel: [149028.186267] usb 1-1.6: USB disconnect, device number 6
Dec 12 04:18:12 ld5506 kernel: [149028.188432] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: unregister 'cdc_ether' usb-0000:00:14.0-1.6, CDC Ethernet Device
Dec 12 04:18:14 ld5506 kernel: [149031.018326] usb 1-1.6: new high-speed USB device number 7 using xhci_hcd
Dec 12 04:18:15 ld5506 kernel: [149031.119969] usb 1-1.6: New USB device found, idVendor=04b3, idProduct=4010, bcdDevice= 3.14
Dec 12 04:18:15 ld5506 kernel: [149031.121993] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Dec 12 04:18:15 ld5506 kernel: [149031.124059] usb 1-1.6: Product: XClarity Controller
Dec 12 04:18:15 ld5506 kernel: [149031.126031] usb 1-1.6: Manufacturer: IBM
Dec 12 04:18:15 ld5506 kernel: [149031.130377] cdc_ether 1-1.6:1.0 usb0: register 'cdc_ether' at usb-0000:00:14.0-1.6, CDC Ethernet Device, 7e:d3:0a:60:9b:1f
Dec 12 04:18:15 ld5506 kernel: [149031.139547] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: renamed from usb0
Dec 12 04:20:09 ld5506 kernel: [149145.442569] usb 1-1.6: USB disconnect, device number 7
Dec 12 04:20:09 ld5506 kernel: [149145.444702] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: unregister 'cdc_ether' usb-0000:00:14.0-1.6, CDC Ethernet Device
Dec 12 04:20:11 ld5506 kernel: [149148.088385] usb 1-1.6: new high-speed USB device number 8 using xhci_hcd
Dec 12 04:20:12 ld5506 kernel: [149148.190414] usb 1-1.6: New USB device found, idVendor=04b3, idProduct=4010, bcdDevice= 3.14
Dec 12 04:20:12 ld5506 kernel: [149148.190419] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Dec 12 04:20:12 ld5506 kernel: [149148.190422] usb 1-1.6: Product: XClarity Controller
Dec 12 04:20:12 ld5506 kernel: [149148.190425] usb 1-1.6: Manufacturer: IBM
Dec 12 04:20:12 ld5506 kernel: [149148.198646] cdc_ether 1-1.6:1.0 usb0: register 'cdc_ether' at usb-0000:00:14.0-1.6, CDC Ethernet Device, 7e:d3:0a:60:9b:1f
Dec 12 04:20:12 ld5506 kernel: [149148.205269] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: renamed from usb0
Dec 12 04:21:55 ld5506 kernel: [149251.690817] usb 1-1.6: USB disconnect, device number 8
Dec 12 04:21:55 ld5506 kernel: [149251.693148] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: unregister 'cdc_ether' usb-0000:00:14.0-1.6, CDC Ethernet Device
Dec 12 04:21:57 ld5506 kernel: [149253.974636] usb 1-1.6: new high-speed USB device number 9 using xhci_hcd
Dec 12 04:21:57 ld5506 kernel: [149254.076376] usb 1-1.6: New USB device found, idVendor=04b3, idProduct=4010, bcdDevice= 3.14
Dec 12 04:21:57 ld5506 kernel: [149254.078671] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Dec 12 04:21:57 ld5506 kernel: [149254.080891] usb 1-1.6: Product: XClarity Controller
Dec 12 04:21:57 ld5506 kernel: [149254.083156] usb 1-1.6: Manufacturer: IBM
Dec 12 04:21:57 ld5506 kernel: [149254.087920] cdc_ether 1-1.6:1.0 usb0: register 'cdc_ether' at usb-0000:00:14.0-1.6, CDC Ethernet Device, 7e:d3:0a:60:9b:1f
Dec 12 04:21:57 ld5506 kernel: [149254.095332] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: renamed from usb0
Dec 12 04:29:51 ld5506 rsyslogd: warning: ~ action is deprecated, consider using the 'stop' statement instead [v8.1901.0 try https://www.rsyslog.com/e/2307 ]
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Linux version 5.3.10-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.3.10-1 (Thu, 14 Nov 2019 10:43:13 +0100) ()
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Command line: BOOT_IMAGE=/@snapshots/26/snapshot/boot/vmlinuz-5.3.10-1-pve root=UUID=b99e7b73-1149-40e8-8f7f-10ab1e4e3d51 ro rootflags=subvol=@snapshots/26/snapshot
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] KERNEL supported cpus:
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Intel GenuineIntel
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] AMD AuthenticAMD
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Hygon HygonGenuine
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Centaur CentaurHauls
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] zhaoxin Shanghai
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'


You can see that the reboot was triggered at Dec 12 04:21:57.

I have activated a watchdog on server side (Lenovo) and my assumption is that renaming of interface usb0 is related to the reboot.
This is a USB/IP interface used by Lenovo's watchdog to verify if the OS is still available from IMM.

This is no HA setup.
And it's not always the same node going to reboot.

I noticed that most of the reboots on other nodes happens over the weekend.

THX
 
I have activated a watchdog on server side (Lenovo) and my assumption is that renaming of interface usb0 is related to the reboot.
This is a USB/IP interface used by Lenovo's watchdog to verify if the OS is still available from IMM.
May I ask why you want to use this watchdog? As you experience they can be quite tricky to get right.

I cannot help you with specific advice because I don't have access to similar hardware.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!