Proxmox Crash (hung_task_timeout_secs for corosync)

Alex66955

New Member
Mar 3, 2016
6
0
1
37
Hey,

I have to evaluate ProxmoxVe for my company and for that I build a small test environment on our productive server.

The setup:
Because of the resource limitations I build a nested test environment. The ProxmoxVE-Server is running on a QEMU Host hypervisor. I give the ProxmoxVe VM the host cpu with all flags.
I build a small cluster with an additional laptop for testing cases.
There are about 4 LXC's and 2 VM's running..

The Problem:
I get every night a Proxmox crash (not on the laptop node).
  • The virtual machines are not responding
  • the ProxmoxVE webinterface is not responding
  • The ssh connection to the ProxmoxVe-Server works

More details:
  • Kernel Output
Code:
Apr  8 03:51:32 vp-proxmoxS2 systemd-timesyncd[1634]: interval/delta/delay/jitter/drift 2048s/+0.012s/0.056s/0.017s/+18ppm
Apr  8 03:55:01 vp-proxmoxS2 CRON[23033]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr  8 04:05:01 vp-proxmoxS2 CRON[23908]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.447233] INFO: task corosync:2281 blocked for more than 600 seconds.
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.449373]       Tainted: P           O    4.2.8-1-pve #1
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.449549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.449743] corosync        D ffff8800babf0000     0  2281      1 0x00000000
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.449751]  ffff8800ba647e78 0000000000000086 ffff880232b59b80 ffff8800babf0000
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.449757]  ffff8800babf0000 ffff8800ba648000 ffff8800ba647ee8 ffffffff821051e0
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.449761]  00000000000004e0 000055a3289c19d0 ffff8800ba647e98 ffffffff818069f7
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.449766] Call Trace:
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.451187]  [<ffffffff818069f7>] schedule+0x37/0x80
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.452381]  [<ffffffff8105f976>] kvm_async_pf_task_wait+0x1a6/0x230
Apr  8 04:06:39 vp-proxmoxS2 kernel: [71608.452747]  [<ffffffff810a66f0>] ? wake_up_q+0x70/0x70
  • On the host machine there are several "backup" cronjobs at this time. Can this cause this issue?
  • Host sar output at the crashtime 04:05:01 (idle0% and kbdirty falling and %guest 99,60% on crashtime)
Code:
00:00:01        CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle
03:55:01         15      0,35      0,00      1,35      2,10      0,00      0,00      0,01      1,29      0,00     94,91
04:05:01        all      0,55      0,00      2,85      1,52      0,00      0,00      0,02     11,64      0,00     83,42
04:05:01          0      0,27      0,00      4,78      2,59      0,00      0,00      0,06      0,36      0,00     91,94
04:05:01          1      2,02      0,00      3,65      1,79      0,00      0,00      0,03     16,05      0,00     76,46
04:05:01          2      1,32      0,00      4,07      2,21      0,00      0,00      0,03      4,65      0,00     87,72
04:05:01          3      0,00      0,00      0,49      0,00      0,00      0,00      0,02     99,49      0,00      0,00
04:05:01          4      1,06      0,00      5,79      2,52      0,00      0,00      0,04      0,53      0,00     90,06
04:05:01          5      0,94      0,00      5,70      2,22      0,00      0,00      0,03      2,91      0,00     88,21
04:05:01          6      0,18      0,00      2,52      1,78      0,00      0,00      0,02     21,93      0,00     73,55
04:05:01          7      0,56      0,00      4,93      3,09      0,00      0,00      0,03      4,90      0,00     86,50

00:00:01    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
03:55:01       181380  32724272     99,45   4877488  20447132  30088364     73,90  11686132  16496504    738576
04:05:01       179052  32726600     99,46   4974528  20621728  30086092     73,89  11779152  16406220   1074908
04:15:01       175808  32729844     99,47   5070740  20698664  30084200     73,89   9519196  18671356    808932
04:25:01       175564  32730088     99,47   5156956  20689968  30117036     73,97  11588312  16584712    755108
04:35:01       178180  32727472     99,46   5378360  20408392  30085616     73,89  11836064  16293196     63884
04:45:01       172920  32732732     99,47   5527116  20143312  30116620     73,97   9976676  17980968     96348
04:55:01      2139548  30766104     93,50   5556120  18294352  30080700     73,88   9869980  16258724       224
05:05:01      2031256  30874396     93,83   5575984  18394112  30059260     73,83   9962708  16273540       352

Code:
proxmox-ve: 4.1-41 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-41
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
openvswitch-switch: 2.3.2-2

Solving Tries
  • vm.dirty_background_ratio = 5
  • vm.dirty_ratio = 10
 

Attachments

Last edited: