We just purchased a SuperMicro SuperServer 6029p-tr and put proxmox on it. It was running OK for about a week and then it rebooted in the middle of the night randomly. Then I made a bunch of adjustments below:
1. updated the BIOS
2. adjusted ZFS to reduce ARC_MAX to 8GB and made sure primarycache=metadata rpool/swap and secondarycache=metadata rpool/swap
3. set vm.swappiness to 10
4. ran SuperDiag HW diag tool
5. double check UPS and generator
Yet, it did it again about a week later You can see it's definitely an ungraceful shutdown in the last command @ 01:31 AM. 2 Reboots in a row. I'm not seeing anything in kernel or system logs. I mean nothing. Is there any way to know definitively if this is HW or if the is something configured incorrectly.
IDK what to do at this point. I'm kind of at a loss.... Any help would be appreciated
last command
root@pve:~# last -n10 -x shutdown reboot
reboot system boot 4.13.13-6-pve Fri Mar 16 12:55 still running
shutdown system down 4.13.13-6-pve Fri Mar 16 11:49 - 12:55 (01:06)
reboot system boot 4.13.13-6-pve Fri Mar 16 01:31 - 11:49 (10:18)
reboot system boot 4.13.13-6-pve Mon Mar 12 13:56 - 11:49 (3+21:53)
shutdown system down 4.13.13-6-pve Mon Mar 12 13:35 - 13:56 (00:21)
reboot system boot 4.13.13-6-pve Mon Mar 12 13:15 - 13:35 (00:19)
shutdown system down 4.13.13-6-pve Mon Mar 12 13:14 - 13:15 (00:01)
reboot system boot 4.13.13-6-pve Mon Mar 12 10:24 - 13:14 (02:49)
shutdown system down 4.13.13-6-pve Mon Mar 12 10:18 - 10:24 (00:06)
reboot system boot 4.13.13-6-pve Mon Mar 12 10:13 - 10:18 (00:05)
kern.log
Mar 15 13:01:37 pve pvedaemon[43253]: <root@pam> successful auth for user 'root@pam'
Mar 15 13:16:38 pve pvedaemon[101049]: <root@pam> successful auth for user 'root@pam'
Mar 15 13:31:39 pve pvedaemon[43253]: <root@pam> successful auth for user 'root@pam'
Mar 16 01:31:30 pve kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x501 with crng_init=0
Mar 16 01:31:30 pve kernel: [ 0.000000] Linux version 4.13.13-6-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u0)) #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) ()
Mar 16 01:31:30 pve kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.13.13-6-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Mar 16 01:31:30 pve kernel: [ 0.000000] KERNEL supported cpus:
Mar 16 01:31:30 pve kernel: [ 0.000000] Intel GenuineIntel
syslog.1
Mar 16 01:27:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:27:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:28:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:28:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:29:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:29:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'iscsi_tcp'
Mar 16 01:31:30 pve kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x501 with crng_init=0
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'ib_iser'
Mar 16 01:31:30 pve kernel: [ 0.000000] Linux version 4.13.13-6-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u0)) #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) ()
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'vhost_net'
Mar 16 01:31:30 pve kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.13.13-6-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Mar 16 01:31:30 pve kernel: [ 0.000000] KERNEL supported cpus:
Mar 16 01:31:30 pve kernel: [ 0.000000] Intel GenuineIntel
Mar 16 01:31:30 pve systemd-udevd[2889]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Mar 16 01:31:30 pve kernel: [ 0.000000] AMD AuthenticAMD
Mar 16 01:31:30 pve kernel: [ 0.000000] Centaur CentaurHauls
Mar 16 01:31:30 pve systemd[1]: Starting Flush Journal to Persistent Storage...
SuperDiag Results
Copyright(c) 1993-2018 Super Micro Computer, Inc.
Execution Time : 16:00:24 03/16/2018
MB Name : X11DPi-NT
MB Serial Number: OM178S024019
[Component Detection]
Start Time: 16:01:01 03/16/2018
Result: Passed
Total Type Count: 11, Passed Count: 11, Failed Count: 0
[Component Diagnostics]
Start Time: 16:01:47 03/16/2018
Result: Passed
Total Type Count: 10, Passed Count: 10, Failed Count: 0
Overall Result: Passed
kernel info
root@pve:~# uname -a
Linux pve 4.13.13-6-pve #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) x86_64 GNU/Linux
1. updated the BIOS
2. adjusted ZFS to reduce ARC_MAX to 8GB and made sure primarycache=metadata rpool/swap and secondarycache=metadata rpool/swap
3. set vm.swappiness to 10
4. ran SuperDiag HW diag tool
5. double check UPS and generator
Yet, it did it again about a week later You can see it's definitely an ungraceful shutdown in the last command @ 01:31 AM. 2 Reboots in a row. I'm not seeing anything in kernel or system logs. I mean nothing. Is there any way to know definitively if this is HW or if the is something configured incorrectly.
IDK what to do at this point. I'm kind of at a loss.... Any help would be appreciated
last command
root@pve:~# last -n10 -x shutdown reboot
reboot system boot 4.13.13-6-pve Fri Mar 16 12:55 still running
shutdown system down 4.13.13-6-pve Fri Mar 16 11:49 - 12:55 (01:06)
reboot system boot 4.13.13-6-pve Fri Mar 16 01:31 - 11:49 (10:18)
reboot system boot 4.13.13-6-pve Mon Mar 12 13:56 - 11:49 (3+21:53)
shutdown system down 4.13.13-6-pve Mon Mar 12 13:35 - 13:56 (00:21)
reboot system boot 4.13.13-6-pve Mon Mar 12 13:15 - 13:35 (00:19)
shutdown system down 4.13.13-6-pve Mon Mar 12 13:14 - 13:15 (00:01)
reboot system boot 4.13.13-6-pve Mon Mar 12 10:24 - 13:14 (02:49)
shutdown system down 4.13.13-6-pve Mon Mar 12 10:18 - 10:24 (00:06)
reboot system boot 4.13.13-6-pve Mon Mar 12 10:13 - 10:18 (00:05)
kern.log
Mar 15 13:01:37 pve pvedaemon[43253]: <root@pam> successful auth for user 'root@pam'
Mar 15 13:16:38 pve pvedaemon[101049]: <root@pam> successful auth for user 'root@pam'
Mar 15 13:31:39 pve pvedaemon[43253]: <root@pam> successful auth for user 'root@pam'
Mar 16 01:31:30 pve kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x501 with crng_init=0
Mar 16 01:31:30 pve kernel: [ 0.000000] Linux version 4.13.13-6-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u0)) #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) ()
Mar 16 01:31:30 pve kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.13.13-6-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Mar 16 01:31:30 pve kernel: [ 0.000000] KERNEL supported cpus:
Mar 16 01:31:30 pve kernel: [ 0.000000] Intel GenuineIntel
syslog.1
Mar 16 01:27:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:27:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:28:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:28:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:29:00 pve systemd[1]: Starting Proxmox VE replication runner...
Mar 16 01:29:00 pve systemd[1]: Started Proxmox VE replication runner.
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'iscsi_tcp'
Mar 16 01:31:30 pve kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x501 with crng_init=0
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'ib_iser'
Mar 16 01:31:30 pve kernel: [ 0.000000] Linux version 4.13.13-6-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u0)) #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) ()
Mar 16 01:31:30 pve systemd-modules-load[2791]: Inserted module 'vhost_net'
Mar 16 01:31:30 pve kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.13.13-6-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Mar 16 01:31:30 pve kernel: [ 0.000000] KERNEL supported cpus:
Mar 16 01:31:30 pve kernel: [ 0.000000] Intel GenuineIntel
Mar 16 01:31:30 pve systemd-udevd[2889]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Mar 16 01:31:30 pve kernel: [ 0.000000] AMD AuthenticAMD
Mar 16 01:31:30 pve kernel: [ 0.000000] Centaur CentaurHauls
Mar 16 01:31:30 pve systemd[1]: Starting Flush Journal to Persistent Storage...
SuperDiag Results
Copyright(c) 1993-2018 Super Micro Computer, Inc.
Execution Time : 16:00:24 03/16/2018
MB Name : X11DPi-NT
MB Serial Number: OM178S024019
[Component Detection]
Start Time: 16:01:01 03/16/2018
Result: Passed
Total Type Count: 11, Passed Count: 11, Failed Count: 0
[Component Diagnostics]
Start Time: 16:01:47 03/16/2018
Result: Passed
Total Type Count: 10, Passed Count: 10, Failed Count: 0
Overall Result: Passed
kernel info
root@pve:~# uname -a
Linux pve 4.13.13-6-pve #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) x86_64 GNU/Linux