Seemingly random reboots (IO wait related?)

NxwJfm

Member
Jul 6, 2016
9
1
23
49
Hello.

I've been using proxmox in production for some time now, and for the most part I'm very happy with it.

This morning though I've had my second "random" reboot of a host server. Last time was another server on August 23 but since it was still running on older 4.1 release instead of the latest 4.2 I figured it wasn't really worth mentioning here.
But a few hours ago the server that rebooted was running the very latest 4.2 (enterprise repository) since I had upgraded it just last week. So I figured something else was going on.
I was awake enough this time to go and check the server load graph, and it looked like that:
http://imgur.com/a/JIt7z

So it looks like there were a lot of IO wait going on, and I guess the software watchdog decided that it should reboot.

One thing about these servers, they are Dell running on their dual SD cards system, which are great when there is very little IO going on (like with ESXi) but are very very slow when you do linux stuff like apt-get update for example. The VMs themselves are all on iSCSI with multipathing and I have 3 separate networks (LACP for Proxmox related stuff, LACP for VMs, and 2 eth for iSCSI) so there shouldn't be any IO generated by regular day to day virtual machine stuff.

Looking at /etc/cron.d/pveupdate:
36 4 * * * root /usr/bin/pveupdate
Yep, server rebooted at 4:38...

I've looked at the other 7 servers, and they all have a small IO wait spike between 3 and 6 every morning, which always is at the time pveupdate is scheduled to run.

Since this is production I need to find a solution quickly, even if this is quick and dirty (and temporary). I guess I'll go back to regular hard drives instead of the SDs, but in the meantime I can't really afford to have my hosts randomly reboot.
I see two ways to do that:
- Make the watchdog less sensitive (the softdog documentation is very lacking, so I'm not so sure how to do that)
- Disable all the stuff that require local IO like pveupdate. I could simply remove /etc/cron.d/pveupdate but I don't know if the file is going to be generated again by some other process, so I'd very much like to know if there is an official way to disable this.

Does anyone else have experience with this? If there are better solutions I'm all ears :)

Thanks for your help!
 
Why don't you just boot Proxmox from iSCSI as well (if your hardware supports it)?

You need to provide more information about crashes, e.g. dmesg output of the time of the crash or a real crash dump (you want to write this to the network, not on the SD cards). Without any real logging information, anything can go wrong.

Concerning the sd card setup, it's interesting to see that it is actually used, but I can totally understand why this is dead slow. I always feel like back in 1990 with disk speed when I use a RPi. Deleting the cron job will help, yet you need to have a better storage. Normally, the update of debian package lists is also performed, such that it could be good idea to move the files to a ramdisk to speed things up. Maybe you can also apply tactics for the RPi to reduce IO load in the same mannter to Proxmox. Yet you're right, Proxmox VE is not made with SD cards in mind.