Hi,
I have a server with 4.4-22 installed (updated since I read somewhere that it should help, but it doesn't) and sometimes have extremely slow writes in the guests. For example sometimes it works, and most of the time when a larger file (50MB+ size) is written, the guest goes into soft lockup.
When I do in the guest:
[root@ada ~]# dd if=/dev/zero of=/root/testfile bs=500M count=1 oflag=direct
1+0 records in
1+0 records out
524288000 bytes (524 MB) copied, 1.60407 s, 327 MB/s
[root@ada ~]# rm /root/testfile
rm: remove regular file â/root/testfileâ? y
[root@ada ~]# dd if=/dev/zero of=/root/testfile bs=500M count=1 oflag=direct
Then the ssh connection is broken and in the log to the guest I see:
Feb 7 14:28:41 ada kernel: CPU: 11 PID: 0 Comm: swapper/11 ve: 0 Tainted: G OEL ------------ 3.10.0-427.18.2.lve1.4.27.el7.x86_64 #1 15.2
Feb 7 14:28:41 ada kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Feb 7 14:28:41 ada kernel: task: ffff88042924b2c0 ti: ffff880429260000 task.ti: ffff880429260000
Feb 7 14:28:41 ada kernel: RIP: 0010:[<ffffffff81058ea6>] [<ffffffff81058ea6>] native_safe_halt+0x6/0x10
Feb 7 14:28:41 ada kernel: RSP: 0018:ffff880429263e98 EFLAGS: 00000286
Feb 7 14:28:41 ada kernel: RAX: 00000000ffffffed RBX: ffff880429260000 RCX: 0100000000000000
Feb 7 14:28:41 ada kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 7 14:28:41 ada kernel: RBP: ffff880429263e98 R08: 0000000000000000 R09: 0000000000000000
Feb 7 14:28:41 ada kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000b
Feb 7 14:28:41 ada kernel: R13: ffff880429260000 R14: ffff880429260000 R15: 0000000000000000
Feb 7 14:28:41 ada kernel: FS: 0000000000000000(0000) GS:ffff88042f4c0000(0000) knlGS:0000000000000000
Feb 7 14:28:41 ada kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 7 14:28:41 ada kernel: CR2: 00007fe33f2ba000 CR3: 00000004275ed000 CR4: 00000000000006e0
Feb 7 14:28:41 ada kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 7 14:28:41 ada kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 7 14:28:41 ada kernel: Stack:
Feb 7 14:28:41 ada kernel: ffff880429263eb8 ffffffff8101e12f ffff880429260000 ffffffff81a6a600
...
I have ZFS as the underlying storage, one SSD disk as cache:
zpool status
pool: zfs-pool
state: ONLINE
scan: scrub in progress since Wed Feb 7 10:22:17 2018
197G scanned out of 896G at 13.3M/s, 14h53m to go
0 repaired, 22.04% done
config:
NAME STATE READ WRITE CKSUM
zfs-pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
logs
sda4 ONLINE 0 0 0
cache
sda5 ONLINE 0 0 0
errors: No known data errors
I have tried scrub but as until now it hasn't found any issues with physical hard drives. I have also checked them with smartctl and nothing strange is reported.
Sometimes there is nothing in the host's kern.log when this happens on the guest, but for some cases I have found out entries in host's log files like at the very same time when the guest is stuck:
Feb 6 20:10:07 prog7 kernel: [17761.712491] INFO: task kvm:3374 blocked for more than 120 seconds.
Feb 6 20:10:07 prog7 kernel: [17761.714306] Tainted: P IO 4.4.98-5-pve #1
Feb 6 20:10:07 prog7 kernel: [17761.716228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 6 20:10:07 prog7 kernel: [17761.718219] kvm D ffff8813afa1f680 0 3374 1 0x00000000
Feb 6 20:10:07 prog7 kernel: [17761.718223] ffff8813afa1f680 0000000000000000 ffff8813f16f6600 ffff8813ecad6600
Feb 6 20:10:07 prog7 kernel: [17761.718226] ffff8813afa20000 ffff8813f1326a10 ffff8813f1326a38 ffff8813f1326b58
Feb 6 20:10:07 prog7 kernel: [17761.718229] 0000000000000000 ffff8813afa1f698 ffffffff818651b5 ffff8813f1326b50
Feb 6 20:10:07 prog7 kernel: [17761.718232] Call Trace:
Feb 6 20:10:07 prog7 kernel: [17761.718236] [<ffffffff818651b5>] schedule+0x35/0x80
Feb 6 20:10:07 prog7 kernel: [17761.718245] [<ffffffffc0414cf4>] cv_wait_common+0xf4/0x130 [spl]
Feb 6 20:10:07 prog7 kernel: [17761.718248] [<ffffffff810c67b0>] ? wait_woken+0x90/0x90
Feb 6 20:10:07 prog7 kernel: [17761.718257] [<ffffffffc0414d45>] __cv_wait+0x15/0x20 [spl]
Feb 6 20:10:07 prog7 kernel: [17761.718303] [<ffffffffc0acadfa>] txg_wait_open+0xaa/0xf0 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718337] [<ffffffffc0a88990>] dmu_tx_wait+0x330/0x340 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718376] [<ffffffffc0a88a27>] dmu_tx_assign+0x87/0x490 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718424] [<ffffffffc0b1eac6>] zvol_request+0x2e6/0x670 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718428] [<ffffffff813cd913>] ? generic_make_request_checks+0x243/0x4f0
I have tried setting
vm.min_free_kbytes = 524288
vm.dirty_background_ratio = 5
vm.swappiness = 2
on the host, but still the same behaviour.
I have also increased min arc size to 1 GB (it was set to 32 MB before)
with this "echo 1073741824 >> /sys/module/zfs/parameters/zfs_arc_min" but same situation is still there.
I have been pulling my hair for a few days now and any help would be greatly appreciated.
Thanks, Matej
I have a server with 4.4-22 installed (updated since I read somewhere that it should help, but it doesn't) and sometimes have extremely slow writes in the guests. For example sometimes it works, and most of the time when a larger file (50MB+ size) is written, the guest goes into soft lockup.
When I do in the guest:
[root@ada ~]# dd if=/dev/zero of=/root/testfile bs=500M count=1 oflag=direct
1+0 records in
1+0 records out
524288000 bytes (524 MB) copied, 1.60407 s, 327 MB/s
[root@ada ~]# rm /root/testfile
rm: remove regular file â/root/testfileâ? y
[root@ada ~]# dd if=/dev/zero of=/root/testfile bs=500M count=1 oflag=direct
Then the ssh connection is broken and in the log to the guest I see:
Feb 7 14:28:41 ada kernel: CPU: 11 PID: 0 Comm: swapper/11 ve: 0 Tainted: G OEL ------------ 3.10.0-427.18.2.lve1.4.27.el7.x86_64 #1 15.2
Feb 7 14:28:41 ada kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Feb 7 14:28:41 ada kernel: task: ffff88042924b2c0 ti: ffff880429260000 task.ti: ffff880429260000
Feb 7 14:28:41 ada kernel: RIP: 0010:[<ffffffff81058ea6>] [<ffffffff81058ea6>] native_safe_halt+0x6/0x10
Feb 7 14:28:41 ada kernel: RSP: 0018:ffff880429263e98 EFLAGS: 00000286
Feb 7 14:28:41 ada kernel: RAX: 00000000ffffffed RBX: ffff880429260000 RCX: 0100000000000000
Feb 7 14:28:41 ada kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 7 14:28:41 ada kernel: RBP: ffff880429263e98 R08: 0000000000000000 R09: 0000000000000000
Feb 7 14:28:41 ada kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000b
Feb 7 14:28:41 ada kernel: R13: ffff880429260000 R14: ffff880429260000 R15: 0000000000000000
Feb 7 14:28:41 ada kernel: FS: 0000000000000000(0000) GS:ffff88042f4c0000(0000) knlGS:0000000000000000
Feb 7 14:28:41 ada kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 7 14:28:41 ada kernel: CR2: 00007fe33f2ba000 CR3: 00000004275ed000 CR4: 00000000000006e0
Feb 7 14:28:41 ada kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 7 14:28:41 ada kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 7 14:28:41 ada kernel: Stack:
Feb 7 14:28:41 ada kernel: ffff880429263eb8 ffffffff8101e12f ffff880429260000 ffffffff81a6a600
...
I have ZFS as the underlying storage, one SSD disk as cache:
zpool status
pool: zfs-pool
state: ONLINE
scan: scrub in progress since Wed Feb 7 10:22:17 2018
197G scanned out of 896G at 13.3M/s, 14h53m to go
0 repaired, 22.04% done
config:
NAME STATE READ WRITE CKSUM
zfs-pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
logs
sda4 ONLINE 0 0 0
cache
sda5 ONLINE 0 0 0
errors: No known data errors
I have tried scrub but as until now it hasn't found any issues with physical hard drives. I have also checked them with smartctl and nothing strange is reported.
Sometimes there is nothing in the host's kern.log when this happens on the guest, but for some cases I have found out entries in host's log files like at the very same time when the guest is stuck:
Feb 6 20:10:07 prog7 kernel: [17761.712491] INFO: task kvm:3374 blocked for more than 120 seconds.
Feb 6 20:10:07 prog7 kernel: [17761.714306] Tainted: P IO 4.4.98-5-pve #1
Feb 6 20:10:07 prog7 kernel: [17761.716228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 6 20:10:07 prog7 kernel: [17761.718219] kvm D ffff8813afa1f680 0 3374 1 0x00000000
Feb 6 20:10:07 prog7 kernel: [17761.718223] ffff8813afa1f680 0000000000000000 ffff8813f16f6600 ffff8813ecad6600
Feb 6 20:10:07 prog7 kernel: [17761.718226] ffff8813afa20000 ffff8813f1326a10 ffff8813f1326a38 ffff8813f1326b58
Feb 6 20:10:07 prog7 kernel: [17761.718229] 0000000000000000 ffff8813afa1f698 ffffffff818651b5 ffff8813f1326b50
Feb 6 20:10:07 prog7 kernel: [17761.718232] Call Trace:
Feb 6 20:10:07 prog7 kernel: [17761.718236] [<ffffffff818651b5>] schedule+0x35/0x80
Feb 6 20:10:07 prog7 kernel: [17761.718245] [<ffffffffc0414cf4>] cv_wait_common+0xf4/0x130 [spl]
Feb 6 20:10:07 prog7 kernel: [17761.718248] [<ffffffff810c67b0>] ? wait_woken+0x90/0x90
Feb 6 20:10:07 prog7 kernel: [17761.718257] [<ffffffffc0414d45>] __cv_wait+0x15/0x20 [spl]
Feb 6 20:10:07 prog7 kernel: [17761.718303] [<ffffffffc0acadfa>] txg_wait_open+0xaa/0xf0 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718337] [<ffffffffc0a88990>] dmu_tx_wait+0x330/0x340 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718376] [<ffffffffc0a88a27>] dmu_tx_assign+0x87/0x490 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718424] [<ffffffffc0b1eac6>] zvol_request+0x2e6/0x670 [zfs]
Feb 6 20:10:07 prog7 kernel: [17761.718428] [<ffffffff813cd913>] ? generic_make_request_checks+0x243/0x4f0
I have tried setting
vm.min_free_kbytes = 524288
vm.dirty_background_ratio = 5
vm.swappiness = 2
on the host, but still the same behaviour.
I have also increased min arc size to 1 GB (it was set to 32 MB before)
with this "echo 1073741824 >> /sys/module/zfs/parameters/zfs_arc_min" but same situation is still there.
I have been pulling my hair for a few days now and any help would be greatly appreciated.
Thanks, Matej