Hi,
I need your advice because we have a problem where we don't have any solution.
For an unkown reason some VM become unavailable : the load on the VM grows up to a point where it's impossible to do anything. The only way to unblock the situation is to reboot it.
As all our VM a monitored thanks to centreon, I have some graphs
I already find some threads speaking about a similare problem but I don't find a solution.
https://forum.proxmox.com/threads/12982-qemu-nbd-bug?highlight=kworker
https://forum.proxmox.com/threads/21354-Why-are-my-VMs-dying-with-quot-hung_task_timeout_secs-quot
There is 4 servers, configurated on a cluster. Each servers have the same configuration.
There is no load on the host when the problem arrive on the VM.
Some informations about a VM when the problem has just arrived.
I find somes kernel error on syslog of the VM, but not
Somes processus are blocked
There is no io-wait
Thks for your advices.
I need your advice because we have a problem where we don't have any solution.
For an unkown reason some VM become unavailable : the load on the VM grows up to a point where it's impossible to do anything. The only way to unblock the situation is to reboot it.
As all our VM a monitored thanks to centreon, I have some graphs
I already find some threads speaking about a similare problem but I don't find a solution.
https://forum.proxmox.com/threads/12982-qemu-nbd-bug?highlight=kworker
https://forum.proxmox.com/threads/21354-Why-are-my-VMs-dying-with-quot-hung_task_timeout_secs-quot
There is 4 servers, configurated on a cluster. Each servers have the same configuration.
Code:
$ pveversion -v
proxmox-ve-2.6.32: 3.4-156 (running kernel: 3.10.0-5-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-156
pve-kernel-3.10.0-5-pve: 3.10.0-19
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-17
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
There is no load on the host when the problem arrive on the VM.
Code:
top - 10:58:04 up 25 days, 22:35, 1 user, load average: 0.19, 0.21, 0.23
Tasks: 616 total, 1 running, 615 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.1 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem: 257927 total, 31034 used, 226893 free, 0 buffers
MiB Swap: 243 total, 0 used, 243 free, 3618 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4846 root 20 0 32.5g 2.0g 4024 S 3 0.8 866:49.94 kvm
4557 root 20 0 8737m 2.8g 4008 S 3 1.1 827:31.35 kvm
4963 root 20 0 32.6g 1.6g 4024 S 2 0.6 951:55.92 kvm
53389 root 20 0 4645m 1.3g 4024 S 2 0.5 860:33.95 kvm
4897 root 20 0 32.5g 1.6g 4148 S 2 0.6 917:29.60 kvm
31560 root 20 0 287m 61m 6232 S 2 0.0 0:02.64 pvedaemon worke
4704 root 20 0 4616m 1.2g 3988 S 2 0.5 991:31.37 kvm
4313 root 20 0 4616m 1.5g 3984 S 1 0.6 534:31.11 kvm
4462 root 20 0 4616m 1.3g 4108 S 1 0.5 949:10.53 kvm
4607 root 20 0 4617m 1.4g 3984 S 1 0.5 500:35.04 kvm
4732 root 20 0 4637m 1.2g 4000 S 1 0.5 724:27.02 kvm
29185 www-data 20 0 285m 61m 5160 S 1 0.0 0:03.53 pveproxy worker
3103 root 20 0 353m 54m 33m S 1 0.0 24:57.06 pmxcfs
3905 root 20 0 4616m 1.8g 3988 S 1 0.7 737:33.64 kvm
3 root 20 0 0 0 0 S 0 0.0 17:21.56 ksoftirqd/0
2183 root 20 0 0 0 0 S 0 0.0 12:47.38 xfsaild/dm-7
2982 root 20 0 371m 2328 888 S 0 0.0 18:25.42 rrdcached
3308 root 0 -20 201m 66m 42m S 0 0.0 64:04.29 corosync
4594 root 20 0 0 0 0 S 0 0.0 6:12.00 vhost-4557
5264 root 20 0 4630m 1.4g 3996 S 0 0.6 959:41.76 kvm
9205 root 20 0 4546m 4.1g 4148 S 0 1.6 28:06.42 kvm
16144 root 20 0 0 0 0 S 0 0.0 0:44.51 kworker/6:1
20152 root 20 0 4617m 1.1g 3984 S 0 0.4 298:02.00 kvm
Code:
$ sudo pveperf
CPU BOGOMIPS: 175688.20
REGEX/SECOND: 989918
HD SIZE: 0.95 GB (/dev/mapper/vg01-root)
BUFFERED READS: 325.33 MB/sec
AVERAGE SEEK TIME: 0.04 ms
FSYNCS/SECOND: 13047.64
DNS EXT: 29.53 ms
Code:
$ sudo pveperf /srv/vms/
CPU BOGOMIPS: 175688.20
REGEX/SECOND: 1046214
HD SIZE: 199.90 GB (/dev/mapper/vg01-vms)
BUFFERED READS: 445.96 MB/sec
AVERAGE SEEK TIME: 0.19 ms
FSYNCS/SECOND: 103.74
DNS EXT: 31.61 ms
Code:
$ sudo qm list | grep -v stopped
VMID NAME STATUS MEM(MB) BOOTDISK(GB) PID
102 server-102 running 4096 10.00 3905
103 server-103 running 4096 10.00 4313
104 server-104 running 4096 10.00 20152
105 server-105 running 4096 10.00 4462
106 server-106 running 4096 10.00 9205
107 server-107 running 8192 10.00 4557
108 server-108 running 4096 10.00 4607
109 server-109 running 4096 10.00 4704
110 server-110 running 4096 10.00 4732
111 server-111 running 4096 10.00 53389
112 server-112 running 32768 10.00 4846
113 server-113 running 32768 10.00 4897
114 server-114 running 32768 10.00 4963
137 server-137 running 4096 10.00 5264
Code:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:d3:c1:fc:c3:50
Slave queue ID: 0
Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:d3:c1:fc:c3:51
Slave queue ID: 0
Code:
$ cat /etc/pve/qemu-server/106.conf
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 4096
name: server-106
net0: virtio=72:33:AA:5F:BB:11,bridge=vmbr0,tag=226
onboot: 1
ostype: l26
smbios1: uuid=071915f2-544c-49ba-a3c3-f3c52f5188d4
sockets: 1
virtio0: vms:106/vm-106-disk-1.qcow2,format=qcow2,size=10G
Some informations about a VM when the problem has just arrived.
Code:
$ uname -a
Linux ######### 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux
I find somes kernel error on syslog of the VM, but not
Code:
Jun 11 21:51:49 ######### kernel: [1762080.352106] INFO: task kworker/0:3:6362 blocked for more than 120 seconds.
Jun 11 21:51:49 ######### kernel: [1762080.352807] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 21:51:49 ######### kernel: [1762080.353202] kworker/0:3 D ffff88013fc13780 0 6362 2 0x00000000
Jun 11 21:51:49 ######### kernel: [1762080.353205] ffff880137a147c0 0000000000000046 0000000100000008 ffff880136fd80c0
Jun 11 21:51:49 ######### kernel: [1762080.353208] 0000000000013780 ffff880120f31fd8 ffff880120f31fd8 ffff880137a147c0
Jun 11 21:51:49 ######### kernel: [1762080.353210] ffff88013a340200 ffffffff8107116d 0000000000000202 ffff880139e0edc0
Jun 11 21:51:49 ######### kernel: [1762080.353213] Call Trace:
Jun 11 21:51:49 ######### kernel: [1762080.353219] [<ffffffff8107116d>] ? arch_local_irq_save+0x11/0x17
Jun 11 21:51:49 ######### kernel: [1762080.353255] [<ffffffffa0150a8f>] ? xlog_wait+0x51/0x67 [xfs]
Jun 11 21:51:49 ######### kernel: [1762080.353258] [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
Jun 11 21:51:49 ######### kernel: [1762080.353268] [<ffffffffa015322c>] ? _xfs_log_force_lsn+0x1cd/0x205 [xfs]
Jun 11 21:51:49 ######### kernel: [1762080.353277] [<ffffffffa0150502>] ? xfs_trans_commit+0x10a/0x205 [xfs]
Jun 11 21:51:49 ######### kernel: [1762080.353285] [<ffffffffa011d7d4>] ? xfs_sync_worker+0x3a/0x6a [xfs]
Jun 11 21:51:49 ######### kernel: [1762080.353288] [<ffffffff8105b5f7>] ? process_one_work+0x161/0x269
Jun 11 21:51:49 ######### kernel: [1762080.353290] [<ffffffff8105aba3>] ? cwq_activate_delayed_work+0x3c/0x48
Jun 11 21:51:49 ######### kernel: [1762080.353292] [<ffffffff8105c5c0>] ? worker_thread+0xc2/0x145
Jun 11 21:51:49 ######### kernel: [1762080.353294] [<ffffffff8105c4fe>] ? manage_workers.isra.25+0x15b/0x15b
Jun 11 21:51:49 ######### kernel: [1762080.353296] [<ffffffff8105f701>] ? kthread+0x76/0x7e
Jun 11 21:51:49 ######### kernel: [1762080.353299] [<ffffffff813575b4>] ? kernel_thread_helper+0x4/0x10
Jun 11 21:51:49 ######### kernel: [1762080.353302] [<ffffffff8105f68b>] ? kthread_worker_fn+0x139/0x139
Jun 11 21:51:49 ######### kernel: [1762080.353304] [<ffffffff813575b0>] ? gs_change+0x13/0x13
Somes processus are blocked
Code:
ps auxf | grep -E ' [DR]'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2787 0.0 0.0 0 0 ? D May22 0:07 \_ [flush-253:4]
root 6362 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:3]
root 7257 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:2]
root 8036 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:1]
root 8262 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:4]
root 8263 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:5]
root 32191 0.0 0.0 0 0 ? D Jun12 0:12 \_ [kworker/0:8]
root 26983 0.0 0.0 0 0 ? D Jun12 1:24 \_ [kworker/0:9]
(… and some others process ...)
There is no io-wait
Code:
vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 3078348 2708 462720 0 0 1 89 15 18 0 0 99 0
0 0 0 3078340 2708 462720 0 0 0 0 289 718 0 0 100 0
0 0 0 3078340 2708 462720 0 0 0 0 289 722 0 0 100 0
0 0 0 3078340 2708 462720 0 0 0 0 291 730 0 0 100 0
Thks for your advices.
Last edited: