Random Restarting

moscore

New Member
Dec 11, 2016
8
0
1
For the last month or so Proxmox has been randomly restarting itself. It started when I started getting kvm errors after I updated from the free pve repo but I fixed it by changing the CPU type on my VMs from host to kvm64. It's still restarting every few days and it doesn't look like there's anything too out of the ordinary in the logs. Here's the portion of the log from when it restarted at 20:17:37 two days ago. I think somehow an update got messed up and is causing it. I've kept it up to date hoping a new update will fix it but so far no good. Any ideas on how to fix this?
Code:
Dec  8 19:22:36 proxmox rrdcached[4083]: flushing old values
Dec  8 19:22:36 proxmox rrdcached[4083]: rotating journals
Dec  8 19:22:36 proxmox rrdcached[4083]: started new journal /var/lib/rrdcached/journal/rrd.journal.1481242956.890250
Dec  8 19:22:36 proxmox rrdcached[4083]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1481235756.890309
Dec  8 19:22:40 proxmox smartd[4025]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 107 to 106
Dec  8 19:22:40 proxmox smartd[4025]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 99
Dec  8 19:22:40 proxmox smartd[4025]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 148 to 144
Dec  8 19:22:41 proxmox smartd[4025]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 108 to 107
Dec  8 19:22:41 proxmox smartd[4025]: Device: /dev/sdd [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Dec  8 19:22:41 proxmox smartd[4025]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Temperature_Case changed from 78 to 77
Dec  8 19:42:10 proxmox systemd-timesyncd[3641]: interval/delta/delay/jitter/drift 2048s/-0.000s/0.042s/0.018s/+21ppm
Dec  8 19:52:42 proxmox smartd[4025]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 99 to 100
Dec  8 19:52:43 proxmox smartd[4025]: Device: /dev/sdd [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Dec  8 19:52:43 proxmox smartd[4025]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Temperature_Case changed from 77 to 78
Dec  8 19:54:31 proxmox pve-firewall[4282]: firewall update time (19.856 seconds)
Dec  8 19:54:36 proxmox pve-ha-crm[4302]: loop take too long (33 seconds)
Dec  8 19:54:36 proxmox pve-ha-lrm[4313]: loop take too long (33 seconds)
Dec  8 19:54:41 proxmox pvestatd[4283]: could not activate storage 'local-zfs', zfs error: use the form 'zpool import <pool | id> <newpool>' to give it a new name
Dec  8 19:54:41 proxmox pvestatd[4283]: status update time (39.576 seconds)
Dec  8 20:03:06 proxmox pvestatd[4283]: command 'zfs get -o value -Hp available,used storage/vm-storage' failed: got timeout
Dec  8 20:03:06 proxmox pvestatd[4283]: status update time (15.002 seconds)
Dec  8 20:13:24 proxmox pvestatd[4283]: status update time (8.328 seconds)
Dec  8 20:15:54 proxmox pvestatd[4283]: status update time (8.239 seconds)
Dec  8 20:16:18 proxmox systemd-timesyncd[3641]: interval/delta/delay/jitter/drift 2048s/+0.001s/0.039s/0.019s/+21ppm
Dec  8 20:17:37 proxmox rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="4095" x-info="http://www.rsyslog.com"] start
Dec  8 20:17:37 proxmox systemd-modules-load[2068]: Module 'fuse' is builtin
Dec  8 20:17:37 proxmox systemd-modules-load[2068]: Inserted module 'vhost_net'
Dec  8 20:17:37 proxmox kernel: [  0.000000] Initializing cgroup subsys cpuset
Dec  8 20:17:37 proxmox kernel: [  0.000000] Initializing cgroup subsys cpu
Dec  8 20:17:37 proxmox kernel: [  0.000000] Initializing cgroup subsys cpuacct
Dec  8 20:17:37 proxmox kernel: [  0.000000] Linux version 4.4.35-1-pve (root@nora) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Tue Dec 6 09:55:45 CET 2016 ()
Dec  8 20:17:37 proxmox kernel: [  0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.4.35-1-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet elevator=noop
 
Yes indeed we have the same behaviour with our Supermicro Server since about six month.
Perhaps every week (more or less) the Proxmox system do a reset (and the VMs without autostart are killed and offline)
And the main problem for me was to explain it. I can't find a "suspicious fact", no errors in the Log files, no linux test suite can find anything...
We have do a lot of work to find out and tried everything, like Different setups, BIOS..., Power
Now the server is in the "hardware support" then again since many weeks ....
It is a supermicro two cpu board, for the exactly hardware I have to look and will post it later.

maxprox
 
I'm using a Dell PowerEdge T20 with a Xeon E3-1225 v3 and have a mirrored ZFS rpool and Raid-Z storage setup. I tried removing the l2arc and slog to see if that was causing it but that wasn't it. I updated to PVE 4.4 the day it came out and it's still randomly rebooting. At this point I might reinstall Proxmox but I really don't want to do that.
 
Last few time the system restarted itself caused something with ZFS. The pool with VM images HDD had huge load, system load avg1 jumps to 40+. ZFS blocks writes to that pool and z_wr_iss used all CPU.

Last log:
Code:
Dec 15 02:50:25 nmz-lt kernel: [703318.353176]       Tainted: P           O    4.4.35-1-pve #1
Dec 15 02:50:26 nmz-lt kernel: [703318.353185] txg_sync        D ffff880c06adbac8     0  5373      2 0x00000000
Dec 15 02:50:26 nmz-lt kernel: [703318.353193]  ffff880c06adc000 ffff880c3fd97180 7fffffffffffffff ffff8801afee87b8
Dec 15 02:50:26 nmz-lt kernel: [703318.353197] Call Trace:
Dec 15 02:50:26 nmz-lt kernel: [703318.353208]  [<ffffffff81856155>] schedule+0x35/0x80
Dec 15 02:50:26 nmz-lt kernel: [703318.353213]  [<ffffffff8102d736>] ? __switch_to+0x256/0x5c0
Dec 15 02:50:26 nmz-lt kernel: [703318.353219]  [<ffffffff8185564b>] io_schedule_timeout+0xbb/0x140
Dec 15 02:50:26 nmz-lt kernel: [703318.353231]  [<ffffffff810c4190>] ? wait_woken+0x90/0x90
Dec 15 02:50:26 nmz-lt kernel: [703318.353236]  [<ffffffffc00fbd88>] __cv_wait_io+0x18/0x20 [spl]
Dec 15 02:50:26 nmz-lt kernel: [703318.353278]  [<ffffffffc024499f>] zio_wait+0x10f/0x1f0 [zfs]

I suspect it started with 4.4 kernel. I never had problem with previous kernels (4.2 kernel was used too short)
 
Hi, same here, running proxmox on kernel : 4.4.35-1.pve

Does someone has a workaround ?

Dave
 
Have you limited the amount of memory ZFS can use for its cache?

Yes, my ARC size is 10G for a long time.

After last crash I limited zfs_dirty_data_max to 512M (was ~5G) and zfs_txg_timeout to 2 seconds. Write performance dropped but system runs with no crash so far.
 
Interesting Nemesiz, I saw the exact same behaviour you did, but haven't seen any crashes since limiting the ARC. I still see significant pauses with large writes, but hadn't yet limited other settings as you have.

It would be helpful if this issue got some more attention.
 
I'm using a Dell PowerEdge T20 with a Xeon E3-1225 v3 and have a mirrored ZFS rpool and Raid-Z storage setup. I tried removing the l2arc and slog to see if that was causing it but that wasn't it. I updated to PVE 4.4 the day it came out and it's still randomly rebooting. At this point I might reinstall Proxmox but I really don't want to do that.

According to our experience the system resets when ZFS puts high memory pressure on a system that is already short on RAM. It has also been reported that swap on ZFS is connected to this. Since we set these variables upon installation...

Code:
zfs set primarycache=metadata rpool/swap
zfs set secondarycache=metadata rpool/swap
zfs set compression=off rpool/swap
zfs set sync=disabled rpool/swap
zfs set checksum=off rpool/swap

...and carefully guard that the ARC never gets the system's memory full...

Code:
nano /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=10737418240
update-initramfs -u

...we have managed to avoid the reboots. Don't blindly copy our zfs_arc_max value, set one that's appropriate for your system's RAM size and expected maximum memory pressure. arcstat.py is your friend!
 
@gkovacs I already set primarycache=metadata and secondarycache=metadata on my swap. I've also seen people say weird things happen when you run low on RAM if you're running ZFS. I initially had my ARC limited to 8GB but I just remembered the problems started when I upped it to 10GB. It could be just a coincidence but I've lowered it back to 8GB and updated the system so we'll see what happens now. If it still restarts I'll change the rest of my swap properties to match yours. I removed my l2arc since I don't really need it at this point.
I occasionally glance at arc_summary.py because it provides much more information then arcstat.py. I'm also seeing a higher then expeced IO delay (25%-50%) even though I have an Intel DC S3500 over provisioned to 8GB for my SLOG, which is only slightly lower then when I disable it. But I guess that's a different issue.
 
For me it would be very interesting to know whether this error also appears in systems WITHOUT ZFS (like me)?

here is our Hardware. We use the hardware onboard Raid Controller, ext4 with LVM (and on this machine nothing with ZFS):

Supermicro Server, with a X10DRC-LN4+ Mainboard ( Intel C612 chipset),
onboard it has 10xSATA 6G and a LSI 3108 SAS Chip with 8x SAS3 (12Gbps) Ports,
(like the LSI MegaRAID SAS3 9361-8i 12Gb/s RAID Controller)
2x Intel Xeon E5-2609v3 6-Core 1,9GHz 15MB 6.4GT/s; 64 GB RAM
HDDs:
2x Intel SSD DC S3700, 100GB,
4x HGST Ultrastar 7K4000 HUS724020ALS640 2TB 3.5" SAS-2 7200 U/min 64MB

regards,
maxprox
 
I reduced my max ARC value back down to 8GB and I'm at 11 days of uptime which is more then double what I had before. Looks like that might've fixed it.

Not sure what to say maxprox. Is there still nothing peculiar in your syslog right before it restarts?
 
Check your BIOS for the hardware watchdog and turn it off. I had that ghost reboot non-sense happen to me due to hardware watchdog on Supermicro X8 and X9's. Though the ZFS stuff is interesting, since run all SSDs(samsung 840s) I turn my arc down to 2Gb as i'm more concerned with having memory for VM's than VM disk speed. Only thing is, if it was zfs and arc, you should see an OOM error in dmesg/syslog/messages. Not just a ghost reboot with no errors(which is why i suspect hardware watchdog).

I'm interested to see the result...mostly because i'll be scaling out to 30+ nodes(already at 11), and i care not to deal with this kind of thing when running at full capacity.
 
I'm not sure if what happens to you is the same that happened to me some months ago. It was just after upgrading from 3.X to 4.x (I think maybe 4.2). The machine got stalled and my hosting provider (ovh) just rebooted it.

I was happening each 4 or 5 days. I didn't found the problem, but I just seen in the network stats that I was receiving several attachs accounting some MB per second. I just installed "fail2ban" in all my VMs and in the Proxmox host, taking care of ssh login attempts and the problem just dissapeared.

Just give it a look.
 
I reduced my max ARC value back down to 8GB and I'm at 11 days of uptime which is more then double what I had before. Looks like that might've fixed it.

Not sure what to say maxprox. Is there still nothing peculiar in your syslog right before it restarts?

Check your BIOS for the hardware watchdog and turn it off. I had that ghost reboot non-sense happen to me due to hardware watchdog on Supermicro X8 and X9's. Though the ZFS stuff is interesting, since run all SSDs(samsung 840s) I turn my arc down to 2Gb as i'm more concerned with having memory for VM's than VM disk speed. Only thing is, if it was zfs and arc, you should see an OOM error in dmesg/syslog/messages. Not just a ghost reboot with no errors(which is why i suspect hardware watchdog).

I'm interested to see the result...mostly because i'll be scaling out to 30+ nodes(already at 11), and i care not to deal with this kind of thing when running at full capacity.

Spontaneous reboots with nothing peculiar in syslog are most likely caused by ZFS during high memory pressure situations.

We have had many of these occur on wildly different hardware architectures, on both 3.x (kernel 2.6.32) and 4.x (kernel 4.4), where the only common thing was both system and swap on ZFS. In particular if ARC was unlimited, memory pressure from KVM guests was high and the system started swapping out, the system eventually rebooted without any reason.

We have employed several tactics to solve this, but not sure which of these are real solutions, as we probably only prevent the high memory pressure situations:

1. Limit the ARC size
2. Limit guest memory pressure (so together with arc_max they always fit in physical memory)
3. Prevent unnecessary swapping (vm.swappiness=0)
4. Disabling ARC re-caching of the swap zvol (zfs set primarycache=metadata rpool/swap, zfs set secondarycache=metadata rpool/swap)

We don't know that swapping itself, or particularly swap on ZFS is needed to induce the spontaneous reboot. We have had reboots when all these 4 steps were in effect, but memory still got above physical limit. Someone who has ZFS test nodes should probably try putting swap on a different disk (not ZFS), without the above 4 steps, and see if the reboots happen or not when memory runs out.
 
Last edited:
Spontaneous reboots with nothing peculiar in syslog are most likely caused by ZFS during high memory pressure situations.
...
In particular if ARC was unlimited, memory pressure from KVM guests was high and the system started swapping out, the system eventually rebooted without any reason.
...
We don't know that swapping itself, or particularly swap on ZFS is needed to induce the spontaneous reboot. We have had reboots when all these 4 steps were in effect, but memory still got above physical limit.
Now I realize the ARC memory pressure issue was my problem. I've had swappiness=1 since before the issue started and it's been fine and even now my ZFS swap always has a few MBs and it's still fine.

The devs should really change the default to primarycache=metadata rpool/swap and zfs set secondarycache=metadata rpool/swap though.
 
Don't set vm.swappiness to 0, set it to 1 for safety's sake.

vm.swappiness = 0 The kernel will swap only to avoid an out of memory condition, when free memory will be below vm.min_free_kbytes limit. See the "VM Sysctl documentation".
vm.swappiness = 1 Kernel version 3.5 and over, as well as Red Hat kernel version 2.6.32-303 and over: Minimum amount of swapping without disabling it entirely.

Also, think about setting a minimum limit to the ARC size. I've had more consistent write(!?!) performance with a 512MB ARC, than with the 32MB default.

-J

Spontaneous reboots with nothing peculiar in syslog are most likely caused by ZFS during high memory pressure situations.



1. Limit the ARC size
2. Limit guest memory pressure (so together with arc_max they always fit in physical memory)
3. Prevent unnecessary swapping (vm.swappiness=0)
4. Disabling ARC re-caching of the swap zvol (zfs set primarycache=metadata rpool/swap, zfs set secondarycache=metadata rpool/swap)
 
  • Like
Reactions: gkovacs
I'm sure that all the tips and settings described here with "zfs set" and "vm.swappiness" are good to know and are useful. But that they are neither the cause nor the solution to the real "random restarting" problem.
My affected server was a factory new Supermicro server with two CPUs and 64 GB RAM and for testing only a small Windows Server 2008 VM, there was almost no load...
Maybe watchdog maybe BIOS maybe hardware or kernel I do not know
Our server came today from the warranty workshop, with the statement the problem is solved by BIOS adjustments
I do not believe that yet, the next days I will test again and report, if there is something to report.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!