Random Reboots PVE 4.1

techsolo · Feb 29, 2016

This is the first time that I used the installer instead of setting up a minimal debian. Now I was doing this to get a native ZFS installation in place but I'm already regretting that I started this.

My system was using vast amounts of RAM. To conquer this I upgraded from 16 to 32GB RAM but the system kept rebooting. I placed a limit to it and also changed some parameters on the ZFS pool (I use LZ4 compression)

zfs set primarycache=metadata rpool/swap
zfs set secondarycache=metadata rpool/swap
zfs set compression=off rpool/swap
zfs set sync=disabled rpool/swap
zfs set checksum=on rpool/swap

Swapiness doesn't seem to help and is not set at the moment. Set it to 10 as suggested in the wiki.

I also had to set a higher timeout on the snapshots. Which worked before the ZFS install without any issue.

I'm really fed up with this since the random reboots leave no traces in any log file... My backups fail (since they trigger the random reboots) and I'm really starting to fear to loose data (and yes I have a copy but that doesn't mean this should be possible).

Any suggestions excluding starting all over again and go back to ext4 on LVM?

LnxBil · Feb 29, 2016

Do not use swap on zfs. There are plenty of entries in this forum of people having reboots - me included - with swap on zfs. Disable it completely and use zram-config (ubunto package for zram) or use a dedicated non-zfs-based partition for swap. This resolved almost all problems of random reboots.

techsolo · Feb 29, 2016

LnxBil said:
Do not use swap on zfs. There are plenty of entries in this forum of people having reboots - me included - with swap on zfs. Disable it completely and use zram-config (ubunto package for zram) or use a dedicated non-zfs-based partition for swap. This resolved almost all problems of random reboots.

Yeah I had set-up a set of swap partitions on my cache SSD's but the machine didn't came back after reboot. Will see later today when I'm home what is stalling the reboot/start up. But it's a pity a 'production' system is setting up a swap that guarantees instability...

LnxBil · Feb 29, 2016

That's why I switched to zram-config for all my machines (real and virtual) and never had a problem with it.

techsolo · Feb 29, 2016

can I do a zpool destroy rpool/swap to free that unused space?

techsolo · Feb 29, 2016

To mark this as resolved:

I did ->
remove the swap partition on the zfs with "zfs destroy rpool/swap"
Uncommented the zfs swap in /etc/fstab
created 2 new partitions on my cache SSDs and made a striped swap
ram usage seems to be stable for now. Will update if I see the thing crash again.

LnxBil · Feb 29, 2016

Perfect! It solved really all my problems :-D

RobFantini · Feb 29, 2016

LnxBil said:
That's why I switched to zram-config for all my machines (real and virtual) and never had a problem with it.

can you post how you installed and configured zram ?

LnxBil · Feb 29, 2016

I installed the ubuntu package zram-config manually. It automatically divides half your ram on 2 * # cpu cores devices and loads it automatically. Really no hassle.

http://packages.ubuntu.com/wily/all/zram-config/download

techsolo · Mar 1, 2016

still a reboot and this is all I retrieve

Mar 1 00:00:02 solo-prox-01 vzdump[23140]: <root@pam> starting task UPID:solo-prox-01:00005A65:0023D34D:56D4CD72:vzdump::root@pam:
Mar 1 00:00:03 solo-prox-01 qm[23144]: <root@pam> update VM 103: -lock backup
Mar 1 00:05:41 solo-prox-01 pvedaemon[30443]: <root@pam> successful auth for user 'root@pam'
Mar 1 00:20:41 solo-prox-01 pvedaemon[32234]: <root@pam> successful auth for user 'root@pam'
Mar 1 00:35:41 solo-prox-01 pvedaemon[30443]: <root@pam> successful auth for user 'root@pam'
Mar 1 00:38:22 solo-prox-01 qm[27674]: <root@pam> update VM 105: -lock backup
Mar 1 00:50:41 solo-prox-01 pvedaemon[20032]: <root@pam> successful auth for user 'root@pam'
Mar 1 01:00:26 solo-prox-01 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="3188" x-info="http://www.rsyslog.com"] start
Mar 1 01:00:26 solo-prox-01 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Mar 1 01:00:26 solo-prox-01 kernel: [ 0.000000] Initializing cgroup subsys cpu
Mar 1 01:00:26 solo-prox-01 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Mar 1 01:00:26 solo-prox-01 kernel: [ 0.000000] Linux version 4.2.8-1-pve (root@elsa) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Wed Feb 3 16:33:06 CET 2016 ()
Mar 1 01:00:26 solo-prox-01 kernel: [ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.2.8-1-pve root=ZFS=/ROOT/pve-1 ro boot=zfs root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Mar 1 01:00:26 solo-prox-01 kernel: [ 0.000000] KERNEL supported cpus:

LnxBil · Mar 1, 2016

Please use some higher level kernel debugging like crashdump to analyse further.

do 1-4, 6, 7
http://superuser.com/a/946474

techsolo · Mar 2, 2016

Had a random reboot this night with the crash kernel configuration.

crash: cannot find booted kernel -- please enter namelist argument

Any suggestions?

LnxBil · Mar 2, 2016

Please check your configuration, it should look similar than this:

Code:

root@backup ~ > service kdump-tools status
● kdump-tools.service - Kernel crash dump capture service
   Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled)
   Active: active (exited) since Mi 2016-03-02 09:08:37 CET; 1s ago
  Process: 4389 ExecStop=/etc/init.d/kdump-tools stop (code=exited, status=0/SUCCESS)
  Process: 4407 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
Main PID: 4407 (code=exited, status=0/SUCCESS)

Mär 02 09:08:37 backup kdump-tools[4407]: Starting kdump-tools: loaded kdump kernel.


root@backup ~ > dmesg | grep -i crash
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.2.6-1-pve root=UUID=6922ba43-52cc-4db9-9916-7bbf7e3fb5af ro pcie_aspm=off crashkernel=256M intel_idle.max_cstate=0 quiet
[    0.000000] Reserving 256MB of memory at 592MB for crashkernel (System RAM: 32757MB)
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.2.6-1-pve root=UUID=6922ba43-52cc-4db9-9916-7bbf7e3fb5af ro pcie_aspm=off crashkernel=256M intel_idle.max_cstate=0 quiet

techsolo · Mar 2, 2016

root@solo-prox-01:~# service kdump-tools status
● kdump-tools.service - Kernel crash dump capture service
Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled)
Active: active (exited) since Wed 2016-03-02 07:48:16 CET; 1h 59min ago
Process: 3140 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
Main PID: 3140 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/kdump-tools.service

Mar 02 07:48:15 solo-prox-01 kdump-tools[3140]: Starting kdump-tools: Could not find an installed debug vmlinux image and
Mar 02 07:48:15 solo-prox-01 kdump-tools[3140]: DEBUG_KERNEL is not specified in /etc/default/kdump-tools
Mar 02 07:48:15 solo-prox-01 kdump-tools[3140]: makedumpfile may be limited to -d 1 ... (warning).
Mar 02 07:48:16 solo-prox-01 kdump-tools[3140]: loaded kdump kernel.
root@solo-prox-01:~# dmesg | grep -i crash
[ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.2.8-1-pve root=ZFS=/ROOT/pve-1 ro boot=zfs root=ZFS=rpool/ROOT/pve-1 boot=zfs crashkernel=128M nmi_watchdog=1 quiet
[ 0.000000] Reserving 128MB of memory at 704MB for crashkernel (System RAM: 32741MB)
[ 0.000000] Kernel command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.2.8-1-pve root=ZFS=/ROOT/pve-1 ro boot=zfs root=ZFS=rpool/ROOT/pve-1 boot=zfs crashkernel=128M nmi_watchdog=1 quiet
root@solo-prox-01:~#

lumen · Mar 2, 2016

I have a similar problem. But I not use zfs. My server rebooted. And a see then my virutal machine normaly shutdown.
I see the time server shutdown, ,y server normaly shutdown virutal machine:
daemon.log
Mar 2 00:28:49 vm3 zed[1662]: Exiting
Mar 2 00:28:49 vm3 zabbix-agent[22995]: zabbix_agentd stopping...done.
Mar 2 00:28:49 vm3 postfix[22996]: Stopping Postfix Mail Transport Agent: postfix.
Mar 2 00:28:50 vm3 hwclock[22986]: hwclock from util-linux 2.25.2
Mar 2 00:28:50 vm3 hwclock[22986]: Using the /dev interface to the clock.
Mar 2 00:28:50 vm3 hwclock[22986]: Last drift adjustment done at 1454418075 seconds after 1969
Mar 2 00:28:50 vm3 hwclock[22986]: Last calibration done at 1454418075 seconds after 1969
Mar 2 00:28:50 vm3 hwclock[22986]: Hardware clock is on UTC time
Mar 2 00:28:50 vm3 hwclock[22986]: Assuming hardware clock is kept in UTC time.
Mar 2 00:28:50 vm3 hwclock[22986]: Waiting for clock tick...
Mar 2 00:28:50 vm3 hwclock[22986]: ...got clock tick
Mar 2 00:28:50 vm3 hwclock[22986]: Time read from Hardware Clock: 2016/03/01 20:28:50
Mar 2 00:28:50 vm3 hwclock[22986]: Hw clock time : 2016/03/01 20:28:50 = 1456864130 seconds since 1969
Mar 2 00:28:50 vm3 hwclock[22986]: 1456864130.500000 is close enough to 1456864130.500000 (0.000000 < 0.001000)
Mar 2 00:28:50 vm3 hwclock[22986]: Set RTC to 1456864130 (1456864130 + 0; refsystime = 1456864130.000000)
Mar 2 00:28:50 vm3 hwclock[22986]: Setting Hardware Clock to 20:28:50 = 1456864130 seconds since 1969
Mar 2 00:28:50 vm3 hwclock[22986]: ioctl(RTC_SET_TIME) was successful.
Mar 2 00:28:50 vm3 hwclock[22986]: Clock drifted 0.0 seconds in the past 2446055 seconds in spite of a drift factor of 0.000106 seconds/day.
Mar 2 00:28:50 vm3 hwclock[22986]: Adjusting drift factor by 0.000036 seconds/day
Mar 2 00:28:51 vm3 pve-manager[23108]: shutdown VM 140006: UPID:vm3:00005A44:0E9429CD:56D5FB83:qmshutdown:140006:root@pam:
Mar 2 00:28:51 vm3 pve-manager[23110]: shutdown VM 140005: UPID:vm3:00005A46:0E9429CE:56D5FB83:qmshutdown:140005:root@pam:
Mar 2 00:28:51 vm3 pve-manager[23112]: shutdown VM 140004: UPID:vm3:00005A48:0E9429D3:56D5FB83:qmshutdown:140004:root@pam:
Mar 2 00:28:51 vm3 pve-manager[23113]: shutdown VM 303: UPID:vm3:00005A49:0E9429D3:56D5FB83:qmshutdown:303:root@pam:
Mar 2 00:28:51 vm3 pve-manager[23116]: shutdown VM 302: UPID:vm3:00005A4C:0E9429D5:56D5FB83:qmshutdown:302:root@pam:
Mar 2 00:28:51 vm3 hwclock[23072]: hwclock from util-linux 2.25.2
Mar 2 00:28:51 vm3 hwclock[23072]: Using the /dev interface to the clock.
Mar 2 00:28:51 vm3 hwclock[23072]: Last drift adjustment done at 1456864130 seconds after 1969
Mar 2 00:28:51 vm3 hwclock[23072]: Last calibration done at 1456864130 seconds after 1969
Mar 2 00:28:51 vm3 hwclock[23072]: Hardware clock is on UTC time
Mar 2 00:28:51 vm3 hwclock[23072]: Assuming hardware clock is kept in UTC time.
Mar 2 00:28:51 vm3 hwclock[23072]: Waiting for clock tick...
Mar 2 00:28:51 vm3 hwclock[23072]: ...got clock tick
Mar 2 00:28:51 vm3 hwclock[23072]: Time read from Hardware Clock: 2016/03/01 20:28:51
Mar 2 00:28:51 vm3 hwclock[23072]: Hw clock time : 2016/03/01 20:28:51 = 1456864131 seconds since 1969
Mar 2 00:28:51 vm3 hwclock[23072]: 1456864131.500000 is close enough to 1456864131.500000 (0.000000 < 0.001000)
Mar 2 00:28:51 vm3 hwclock[23072]: Set RTC to 1456864131 (1456864131 + 0; refsystime = 1456864131.000000)
Mar 2 00:28:51 vm3 hwclock[23072]: Setting Hardware Clock to 20:28:51 = 1456864131 seconds since 1969
Mar 2 00:28:51 vm3 hwclock[23072]: ioctl(RTC_SET_TIME) was successful.
Mar 2 00:28:51 vm3 hwclock[23072]: Not adjusting drift factor because it has been less than a day since the last calibration.
Mar 2 00:28:52 vm3 hwclock[23121]: hwclock from util-linux 2.25.2
Mar 2 00:28:52 vm3 hwclock[23121]: Using the /dev interface to the clock.
Mar 2 00:28:52 vm3 hwclock[23121]: Last drift adjustment done at 1456864131 seconds after 1969
Mar 2 00:28:52 vm3 hwclock[23121]: Last calibration done at 1456864131 seconds after 1969
Mar 2 00:28:52 vm3 hwclock[23121]: Hardware clock is on UTC time
Mar 2 00:28:52 vm3 hwclock[23121]: Assuming hardware clock is kept in UTC time.
Mar 2 00:28:52 vm3 hwclock[23121]: Waiting for clock tick...
Mar 2 00:28:52 vm3 hwclock[23121]: ...got clock tick
Mar 2 00:28:52 vm3 hwclock[23121]: Time read from Hardware Clock: 2016/03/01 20:28:52
Mar 2 00:28:52 vm3 hwclock[23121]: Hw clock time : 2016/03/01 20:28:52 = 1456864132 seconds since 1969
Mar 2 00:28:52 vm3 hwclock[23121]: 1456864132.500000 is close enough to 1456864132.500000 (0.000000 < 0.001000)
Mar 2 00:28:52 vm3 hwclock[23121]: Set RTC to 1456864132 (1456864132 + 0; refsystime = 1456864132.000000)
Mar 2 00:28:52 vm3 hwclock[23121]: Setting Hardware Clock to 20:28:52 = 1456864132 seconds since 1969
Mar 2 00:28:52 vm3 hwclock[23121]: ioctl(RTC_SET_TIME) was successful.
Mar 2 00:28:52 vm3 hwclock[23121]: Not adjusting drift factor because it has been less than a day since the last calibration.
Mar 2 00:28:53 vm3 hwclock[23153]: hwclock from util-linux 2.25.2
Mar 2 00:28:53 vm3 hwclock[23153]: Using the /dev interface to the clock.
Mar 2 00:28:53 vm3 hwclock[23153]: Last drift adjustment done at 1456864132 seconds after 1969
Mar 2 00:28:53 vm3 hwclock[23153]: Last calibration done at 1456864132 seconds after 1969
Mar 2 00:28:53 vm3 hwclock[23153]: Hardware clock is on UTC time
Mar 2 00:28:53 vm3 hwclock[23153]: Assuming hardware clock is kept in UTC time.
Mar 2 00:28:53 vm3 hwclock[23153]: Waiting for clock tick...
Mar 2 00:28:53 vm3 hwclock[23153]: ...got clock tick
Mar 2 00:28:53 vm3 hwclock[23153]: Time read from Hardware Clock: 2016/03/01 20:28:53
Mar 2 00:28:53 vm3 hwclock[23153]: Hw clock time : 2016/03/01 20:28:53 = 1456864133 seconds since 1969
Mar 2 00:28:53 vm3 hwclock[23153]: 1456864133.500000 is close enough to 1456864133.500000 (0.000000 < 0.001000)
Mar 2 00:28:53 vm3 hwclock[23153]: Set RTC to 1456864133 (1456864133 + 0; refsystime = 1456864133.000000)
Mar 2 00:28:53 vm3 hwclock[23153]: Setting Hardware Clock to 20:28:53 = 1456864133 seconds since 1969
Mar 2 00:28:53 vm3 hwclock[23153]: ioctl(RTC_SET_TIME) was successful.
Mar 2 00:28:53 vm3 hwclock[23153]: Not adjusting drift factor because it has been less than a day since the last calibration.
Mar 2 00:28:54 vm3 hwclock[23184]: hwclock from util-linux 2.25.2
Mar 2 00:28:54 vm3 hwclock[23184]: Using the /dev interface to the clock.
Mar 2 00:28:54 vm3 hwclock[23184]: Last drift adjustment done at 1456864133 seconds after 1969
Mar 2 00:28:54 vm3 hwclock[23184]: Last calibration done at 1456864133 seconds after 1969
Mar 2 00:28:54 vm3 hwclock[23184]: Hardware clock is on UTC time
Mar 2 00:28:54 vm3 hwclock[23184]: Assuming hardware clock is kept in UTC time.
Mar 2 00:28:54 vm3 hwclock[23184]: Waiting for clock tick...
Mar 2 00:28:54 vm3 hwclock[23184]: ...got clock tick
Mar 2 00:28:54 vm3 hwclock[23184]: Time read from Hardware Clock: 2016/03/01 20:28:54
Mar 2 00:28:54 vm3 hwclock[23184]: Hw clock time : 2016/03/01 20:28:54 = 1456864134 seconds since 1969
Mar 2 00:28:54 vm3 hwclock[23184]: 1456864134.500000 is close enough to 1456864134.500000 (0.000000 < 0.001000)
Mar 2 00:28:54 vm3 hwclock[23184]: Set RTC to 1456864134 (1456864134 + 0; refsystime = 1456864134.000000)
Mar 2 00:28:54 vm3 hwclock[23184]: Setting Hardware Clock to 20:28:54 = 1456864134 seconds since 1969
Mar 2 00:28:54 vm3 hwclock[23184]: ioctl(RTC_SET_TIME) was successful.
Mar 2 00:28:54 vm3 hwclock[23184]: Not adjusting drift factor because it has been less than a day since the last calibration.

auth.log
Mar 2 00:34:48 vm3 systemd-logind[1668]: Watching system buttons on /dev/input/event1 (Power Button)
Mar 2 00:34:48 vm3 systemd-logind[1668]: Watching system buttons on /dev/input/event0 (Power Button)
Mar 2 00:34:48 vm3 systemd-logind[1668]: New seat seat0.

Please help me!!

LnxBil · Mar 2, 2016

@lumen: Please create your own thread for the problem.

LnxBil · Mar 2, 2016

techsolo said:

Code:

root@solo-prox-01:~# service kdump-tools status
● kdump-tools.service - Kernel crash dump capture service
  Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled)
  Active: active (exited) since Wed 2016-03-02 07:48:16 CET; 1h 59min ago
  Process: 3140 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
Main PID: 3140 (code=exited, status=0/SUCCESS)
  CGroup: /system.slice/kdump-tools.service

Mar 02 07:48:15 solo-prox-01 kdump-tools[3140]: Starting kdump-tools: Could not find an installed debug vmlinux image and
Mar 02 07:48:15 solo-prox-01 kdump-tools[3140]: DEBUG_KERNEL is not specified in /etc/default/kdump-tools
Mar 02 07:48:15 solo-prox-01 kdump-tools[3140]: makedumpfile may be limited to -d 1 ... (warning).
Mar 02 07:48:16 solo-prox-01 kdump-tools[3140]: loaded kdump kernel.


root@solo-prox-01:~# dmesg | grep -i crash
[  0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.2.8-1-pve root=ZFS=/ROOT/pve-1 ro boot=zfs root=ZFS=rpool/ROOT/pve-1 boot=zfs crashkernel=128M nmi_watchdog=1 quiet
[  0.000000] Reserving 128MB of memory at 704MB for crashkernel (System RAM: 32741MB)
[  0.000000] Kernel command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.2.8-1-pve root=ZFS=/ROOT/pve-1 ro boot=zfs root=ZFS=rpool/ROOT/pve-1 boot=zfs crashkernel=128M nmi_watchdog=1 quiet

Looks good.

Do you have any entries in /var/crash ?
Where did you get your message crash: cannot find booted kernel -- please enter namelist argument ?

techsolo · Mar 2, 2016

LnxBil said:
Looks good.

Do you have any entries in /var/crash ?
Where did you get your message crash: cannot find booted kernel -- please enter namelist argument ?

by simple running "crash"

techsolo · Mar 2, 2016

LnxBil said:
Looks good.

Do you have any entries in /var/crash ?
Where did you get your message crash: cannot find booted kernel -- please enter namelist argument ?

One file

root@solo-prox-01:~# cat /var/crash/kexec_cmd
/sbin/kexec -p --command-line="BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.2.8-1-pve root=ZFS=/ROOT/pve-1 ro boot=zfs root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 quiet irqpoll maxcpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/boot/initrd.img-4.2.8-1-pve /boot/vmlinuz-4.2.8-1-pve

LnxBil · Mar 2, 2016

It is normal, that crash cannot work on PVE kernels, Proxmox does not provide the necessary kernels with debug symbols.

Normally you get a directory with the number only date with time of the crash and therein the kernel buffer log (dmesg) with the crash information:

Code:

drwxr-xr-x 2 root root 4096 Feb 21 21:10 201602212105
lrwxrwxrwx 1 root root    8 Mär  2 09:08 kernel_link -> /vmlinuz
-rw-r--r-- 1 root root  284 Mär  2 09:08 kexec_cmd

It is often sufficient for analysis. So your crashkernel does not work correctly. Please try to configure it and test it by crashing your system on purpose.

I have these entries in my /etc/default/kdump-tools:

Code:

USE_KDUMP=1
KDUMP_COREDIR="/var/crash"
KDUMP_SYSCTL="kernel.panic_on_oops=1 kernel.panic_on_unrecovered_nmi=1"
DEBUG_KERNEL=/vmlinuz
MAKEDUMP_ARGS="-c --message-level 7 -d 11,31"

Random Reboots PVE 4.1

New Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

Distinguished Member

Famous Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

Distinguished Member

Distinguished Member

New Member

New Member

Distinguished Member