Why is server automatic rebooted?

cmonty14

Renowned Member
Mar 4, 2014
344
5
83
Hi,

I'm operating a 7-node cluster with Ceph.

From time-to-time I must notice that a node is automatically rebooting.
Now I want to analyse what is triggering this reboot.

The output of last -x | head | tac indicates that the last reboot was triggered by kernel 5.3.10-1-pve:
root@ld5506:~# last -x | head | tac
root pts/3 tmux(31958).%2 Tue Dec 10 11:04 - crash (1+17:25)
reboot system boot 5.3.10-1-pve Thu Dec 12 04:29 still running
runlevel (to lvl 5) 5.3.10-1-pve Thu Dec 12 04:33 still running
root pts/0 10.177.32.32 Thu Dec 12 08:47 still logged in
root pts/1 tmux(248633).%0 Thu Dec 12 08:47 still logged in
root pts/2 tmux(248633).%1 Thu Dec 12 08:47 still logged in
root pts/3 tmux(248633).%2 Thu Dec 12 08:51 still logged in
root pts/4 tmux(248633).%3 Thu Dec 12 10:01 still logged in
root pts/5 tmux(248633).%4 Fri Dec 13 10:19 still logged in
root pts/6 tmux(248633).%5 Fri Dec 13 10:23 still logged in


If this is true, my question is:
Why is kernel triggering a reboot?

If not, the question is:
What should I check next in order to identify the root cause?

THX
 
Is it a normal reboot?
Anything in the journal, syslogs or kernel logs that can give a hint?
Do you have any HA resources defined?
Is it always the same node?
 
Hi,
I have this information in /var/log/messages:
[...]
Dec 12 04:15:06 ld5506 kernel: [148842.580673] usb 1-1.1: SerialNumber: 20171009
Dec 12 04:15:06 ld5506 kernel: [148842.602380] hidraw: raw HID events driver (C) Jiri Kosina
Dec 12 04:15:06 ld5506 kernel: [148842.608082] usbcore: registered new interface driver usbhid
Dec 12 04:15:06 ld5506 kernel: [148842.609055] usbhid: USB HID core driver
Dec 12 04:15:06 ld5506 kernel: [148842.610537] usbcore: registered new interface driver usbmouse
Dec 12 04:15:06 ld5506 kernel: [148842.611607] usbcore: registered new interface driver usbkbd
Dec 12 04:15:06 ld5506 kernel: [148842.616303] input: Avocent Keyboard/Mouse Function as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1.1/1-1.1:1.0/0003:17EF:B000.0001/input/input2
Dec 12 04:15:06 ld5506 kernel: [148842.673610] hid-generic 0003:17EF:B000.0001: input,hidraw0: USB HID v1.00 Keyboard [Avocent Keyboard/Mouse Function] on usb-0000:00:14.0-1.1/input0
Dec 12 04:15:06 ld5506 kernel: [148842.677077] input: Avocent Keyboard/Mouse Function as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1.1/1-1.1:1.1/0003:17EF:B000.0002/input/input3
Dec 12 04:15:06 ld5506 kernel: [148842.680817] hid-generic 0003:17EF:B000.0002: input,hidraw1: USB HID v1.00 Mouse [Avocent Keyboard/Mouse Function] on usb-0000:00:14.0-1.1/input1
Dec 12 04:15:06 ld5506 kernel: [148842.683197] input: Avocent Keyboard/Mouse Function as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1.1/1-1.1:1.2/0003:17EF:B000.0003/input/input4
Dec 12 04:15:06 ld5506 kernel: [148842.685710] hid-generic 0003:17EF:B000.0003: input,hidraw2: USB HID v1.00 Mouse [Avocent Keyboard/Mouse Function] on usb-0000:00:14.0-1.1/input2
Dec 12 04:15:07 ld5506 kernel: [148843.761419] usb 1-1.6: new high-speed USB device number 6 using xhci_hcd
Dec 12 04:15:07 ld5506 kernel: [148843.862995] usb 1-1.6: New USB device found, idVendor=04b3, idProduct=4010, bcdDevice= 3.14
Dec 12 04:15:07 ld5506 kernel: [148843.864389] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Dec 12 04:15:07 ld5506 kernel: [148843.865800] usb 1-1.6: Product: XClarity Controller
Dec 12 04:15:07 ld5506 kernel: [148843.867158] usb 1-1.6: Manufacturer: IBM
Dec 12 04:15:07 ld5506 kernel: [148843.870961] cdc_ether 1-1.6:1.0 usb0: register 'cdc_ether' at usb-0000:00:14.0-1.6, CDC Ethernet Device, 7e:d3:0a:60:9b:1f
Dec 12 04:15:07 ld5506 kernel: [148843.878125] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: renamed from usb0
Dec 12 04:16:12 ld5506 kernel: [148908.958402] power_meter ACPI000D:00: Found ACPI power meter.
Dec 12 04:16:12 ld5506 kernel: [148908.960504] power_meter ACPI000D:00: Ignoring unsafe software power cap!
Dec 12 04:16:23 ld5506 kernel: [148919.385924] usb 1-1.1: USB disconnect, device number 5
Dec 12 04:18:12 ld5506 kernel: [149028.186267] usb 1-1.6: USB disconnect, device number 6
Dec 12 04:18:12 ld5506 kernel: [149028.188432] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: unregister 'cdc_ether' usb-0000:00:14.0-1.6, CDC Ethernet Device
Dec 12 04:18:14 ld5506 kernel: [149031.018326] usb 1-1.6: new high-speed USB device number 7 using xhci_hcd
Dec 12 04:18:15 ld5506 kernel: [149031.119969] usb 1-1.6: New USB device found, idVendor=04b3, idProduct=4010, bcdDevice= 3.14
Dec 12 04:18:15 ld5506 kernel: [149031.121993] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Dec 12 04:18:15 ld5506 kernel: [149031.124059] usb 1-1.6: Product: XClarity Controller
Dec 12 04:18:15 ld5506 kernel: [149031.126031] usb 1-1.6: Manufacturer: IBM
Dec 12 04:18:15 ld5506 kernel: [149031.130377] cdc_ether 1-1.6:1.0 usb0: register 'cdc_ether' at usb-0000:00:14.0-1.6, CDC Ethernet Device, 7e:d3:0a:60:9b:1f
Dec 12 04:18:15 ld5506 kernel: [149031.139547] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: renamed from usb0
Dec 12 04:20:09 ld5506 kernel: [149145.442569] usb 1-1.6: USB disconnect, device number 7
Dec 12 04:20:09 ld5506 kernel: [149145.444702] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: unregister 'cdc_ether' usb-0000:00:14.0-1.6, CDC Ethernet Device
Dec 12 04:20:11 ld5506 kernel: [149148.088385] usb 1-1.6: new high-speed USB device number 8 using xhci_hcd
Dec 12 04:20:12 ld5506 kernel: [149148.190414] usb 1-1.6: New USB device found, idVendor=04b3, idProduct=4010, bcdDevice= 3.14
Dec 12 04:20:12 ld5506 kernel: [149148.190419] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Dec 12 04:20:12 ld5506 kernel: [149148.190422] usb 1-1.6: Product: XClarity Controller
Dec 12 04:20:12 ld5506 kernel: [149148.190425] usb 1-1.6: Manufacturer: IBM
Dec 12 04:20:12 ld5506 kernel: [149148.198646] cdc_ether 1-1.6:1.0 usb0: register 'cdc_ether' at usb-0000:00:14.0-1.6, CDC Ethernet Device, 7e:d3:0a:60:9b:1f
Dec 12 04:20:12 ld5506 kernel: [149148.205269] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: renamed from usb0
Dec 12 04:21:55 ld5506 kernel: [149251.690817] usb 1-1.6: USB disconnect, device number 8
Dec 12 04:21:55 ld5506 kernel: [149251.693148] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: unregister 'cdc_ether' usb-0000:00:14.0-1.6, CDC Ethernet Device
Dec 12 04:21:57 ld5506 kernel: [149253.974636] usb 1-1.6: new high-speed USB device number 9 using xhci_hcd
Dec 12 04:21:57 ld5506 kernel: [149254.076376] usb 1-1.6: New USB device found, idVendor=04b3, idProduct=4010, bcdDevice= 3.14
Dec 12 04:21:57 ld5506 kernel: [149254.078671] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Dec 12 04:21:57 ld5506 kernel: [149254.080891] usb 1-1.6: Product: XClarity Controller
Dec 12 04:21:57 ld5506 kernel: [149254.083156] usb 1-1.6: Manufacturer: IBM
Dec 12 04:21:57 ld5506 kernel: [149254.087920] cdc_ether 1-1.6:1.0 usb0: register 'cdc_ether' at usb-0000:00:14.0-1.6, CDC Ethernet Device, 7e:d3:0a:60:9b:1f
Dec 12 04:21:57 ld5506 kernel: [149254.095332] cdc_ether 1-1.6:1.0 enp0s20f0u1u6: renamed from usb0
Dec 12 04:29:51 ld5506 rsyslogd: warning: ~ action is deprecated, consider using the 'stop' statement instead [v8.1901.0 try https://www.rsyslog.com/e/2307 ]
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Linux version 5.3.10-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.3.10-1 (Thu, 14 Nov 2019 10:43:13 +0100) ()
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Command line: BOOT_IMAGE=/@snapshots/26/snapshot/boot/vmlinuz-5.3.10-1-pve root=UUID=b99e7b73-1149-40e8-8f7f-10ab1e4e3d51 ro rootflags=subvol=@snapshots/26/snapshot
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] KERNEL supported cpus:
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Intel GenuineIntel
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] AMD AuthenticAMD
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Hygon HygonGenuine
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] Centaur CentaurHauls
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] zhaoxin Shanghai
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Dec 12 04:29:51 ld5506 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'


You can see that the reboot was triggered at Dec 12 04:21:57.

I have activated a watchdog on server side (Lenovo) and my assumption is that renaming of interface usb0 is related to the reboot.
This is a USB/IP interface used by Lenovo's watchdog to verify if the OS is still available from IMM.

This is no HA setup.
And it's not always the same node going to reboot.

I noticed that most of the reboots on other nodes happens over the weekend.

THX
 
I have activated a watchdog on server side (Lenovo) and my assumption is that renaming of interface usb0 is related to the reboot.
This is a USB/IP interface used by Lenovo's watchdog to verify if the OS is still available from IMM.
May I ask why you want to use this watchdog? As you experience they can be quite tricky to get right.

I cannot help you with specific advice because I don't have access to similar hardware.