Web Interface not responding, kernel errors? During backup to NFS

JustaGuy

Renowned Member
Jan 1, 2010
324
2
83
I see this in syslog just before the web interface froze:
Code:
Aug 19 09:23:00 bascule kernel: INFO: task kswapd0:84 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kswapd0       D 0000000000000000     0    84      2 0x00000000
kernel: ffff88085b0a1750 0000000000000046 0000000000000000 0000000000000000
kernel: 0000000000000004 ffff8808596be400 ffff880859124740 0000000000000010
kernel: ffff88085b0a16e0 000000000000fb08 ffff88085b0a1fd8 ffff88085b0aade0
kernel: Call Trace:
kernel: [<ffffffff810c9d2d>] ? cpu_quiet_msk+0x7d/0x130
kernel: [<ffffffff8156de02>] io_schedule+0x52/0x70
kernel: [<ffffffffa02d349e>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
kernel: [<ffffffff8156e662>] __wait_on_bit+0x62/0x90
kernel: [<ffffffffa02d3490>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
kernel: [<ffffffffa02d3490>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
kernel: [<ffffffff8156e709>] out_of_line_wait_on_bit+0x79/0x90
kernel: [<ffffffff81085be0>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffffa02d347f>] nfs_wait_on_request+0x2f/0x40 [nfs]
kernel: [<ffffffffa02d8a93>] nfs_sync_mapping_wait+0x113/0x260 [nfs]
kernel: [<ffffffffa02d8c6b>] nfs_wb_page+0x8b/0xf0 [nfs]
kernel: [<ffffffffa02c7b00>] nfs_release_page+0x60/0x80 [nfs]
kernel: [<ffffffff810f3592>] try_to_release_page+0x32/0x60
kernel: [<ffffffff81101b4d>] shrink_page_list+0x57d/0x840
kernel: [<ffffffff8113d2d3>] ? mem_cgroup_del_lru_list+0x23/0xb0
kernel: [<ffffffff8113d3d9>] ? mem_cgroup_del_lru+0x39/0x40
kernel: [<ffffffff811010a8>] ? isolate_pages_global+0x198/0x290
kernel: [<ffffffff8110247b>] shrink_list+0x2fb/0x8d0
kernel: [<ffffffff810fd087>] ? get_dirty_limits+0x27/0x2d0
kernel: [<ffffffff81102dfa>] shrink_zone+0x3aa/0x550
kernel: [<ffffffff81103cbd>] kswapd+0x70d/0x800
kernel: [<ffffffff81100f10>] ? isolate_pages_global+0x0/0x290
kernel: [<ffffffff81085ba0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff811035b0>] ? kswapd+0x0/0x800
kernel: [<ffffffff811035b0>] ? kswapd+0x0/0x800
kernel: [<ffffffff810857f6>] kthread+0x96/0xb0
kernel: [<ffffffff8101422a>] child_rip+0xa/0x20
kernel: [<ffffffff81085760>] ? kthread+0x0/0xb0
Aug 19 09:23:00 bascule kernel: [<ffffffff81014220>] ? child_rip+0x0/0x20
Aug 19 09:23:10 bascule proxwww[21639]: Starting new child 21639
Aug 19 09:23:34 bascule proxwww[21682]: Starting new child 21682
Aug 19 09:23:37 bascule proxwww[21686]: Starting new child 21686
Aug 19 09:23:44 bascule proxwww[21698]: Starting new child 21698
Aug 19 09:25:00 bascule kernel: INFO: task kswapd0:84 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kswapd0       D 0000000000000000     0    84      2 0x00000000
kernel: ffff88085b0a1750 0000000000000046 0000000000000000 0000000000000000
kernel: 0000000000000004 ffff8808596be400 ffff880859124740 0000000000000010
kernel: ffff88085b0a16e0 000000000000fb08 ffff88085b0a1fd8 ffff88085b0aade0
kernel: Call Trace:
kernel: [<ffffffff810c9d2d>] ? cpu_quiet_msk+0x7d/0x130
kernel: [<ffffffff8156de02>] io_schedule+0x52/0x70
kernel: [<ffffffffa02d349e>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
kernel: [<ffffffff8156e662>] __wait_on_bit+0x62/0x90
kernel: [<ffffffffa02d3490>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
kernel: [<ffffffffa02d3490>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
kernel: [<ffffffff8156e709>] out_of_line_wait_on_bit+0x79/0x90
kernel: [<ffffffff81085be0>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffffa02d347f>] nfs_wait_on_request+0x2f/0x40 [nfs]
kernel: [<ffffffffa02d8a93>] nfs_sync_mapping_wait+0x113/0x260 [nfs]
kernel: [<ffffffffa02d8c6b>] nfs_wb_page+0x8b/0xf0 [nfs]
kernel: [<ffffffffa02c7b00>] nfs_release_page+0x60/0x80 [nfs]
kernel: [<ffffffff810f3592>] try_to_release_page+0x32/0x60
kernel: [<ffffffff81101b4d>] shrink_page_list+0x57d/0x840
kernel: [<ffffffff8113d2d3>] ? mem_cgroup_del_lru_list+0x23/0xb0
kernel: [<ffffffff8113d3d9>] ? mem_cgroup_del_lru+0x39/0x40
kernel: [<ffffffff811010a8>] ? isolate_pages_global+0x198/0x290
kernel: [<ffffffff8110247b>] shrink_list+0x2fb/0x8d0
kernel: [<ffffffff810fd087>] ? get_dirty_limits+0x27/0x2d0
kernel: [<ffffffff81102dfa>] shrink_zone+0x3aa/0x550
kernel: [<ffffffff81103cbd>] kswapd+0x70d/0x800
kernel: [<ffffffff81100f10>] ? isolate_pages_global+0x0/0x290
kernel: [<ffffffff81085ba0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff811035b0>] ? kswapd+0x0/0x800
kernel: [<ffffffff811035b0>] ? kswapd+0x0/0x800
kernel: [<ffffffff810857f6>] kthread+0x96/0xb0
kernel: [<ffffffff8101422a>] child_rip+0xa/0x20
kernel: [<ffffffff81085760>] ? kthread+0x0/0xb0
kernel: [<ffffffff81014220>] ? child_rip+0x0/0x20
kernel: INFO: task cstream:20710 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: cstream       D ffff88002830fc60     0 20710  20708 0x00000000
kernel: ffff880027837148 0000000000000082 ffff880027837118 ffff880027837114
kernel: 0000000000000000 0000000000000000 ffff8800278370e8 0000000000000097
kernel: 0000000000000000 000000000000fb08 ffff880027837fd8 ffff880858405bc0
kernel: Call Trace:
kernel: [<ffffffff8156de02>] io_schedule+0x52/0x70
kernel: [<ffffffffa02d349e>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
kernel: [<ffffffff8156e662>] __wait_on_bit+0x62/0x90
kernel: [<ffffffffa02d3490>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
kernel: [<ffffffffa02d3490>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
kernel: [<ffffffff8156e709>] out_of_line_wait_on_bit+0x79/0x90
kernel: [<ffffffff81085be0>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffffa02d347f>] nfs_wait_on_request+0x2f/0x40 [nfs]
kernel: [<ffffffffa02d8a93>] nfs_sync_mapping_wait+0x113/0x260 [nfs]
kernel: [<ffffffffa02d8c6b>] nfs_wb_page+0x8b/0xf0 [nfs]
kernel: [<ffffffffa02c7b00>] nfs_release_page+0x60/0x80 [nfs]
kernel: [<ffffffff810f3592>] try_to_release_page+0x32/0x60
kernel: [<ffffffff81101b4d>] shrink_page_list+0x57d/0x840
kernel: [<ffffffff8113d2d3>] ? mem_cgroup_del_lru_list+0x23/0xb0
kernel: [<ffffffff8113d3d9>] ? mem_cgroup_del_lru+0x39/0x40
kernel: [<ffffffff811010a8>] ? isolate_pages_global+0x198/0x290
kernel: [<ffffffff8110247b>] shrink_list+0x2fb/0x8d0
kernel: [<ffffffff8113d3d9>] ? mem_cgroup_del_lru+0x39/0x40
kernel: [<ffffffff81101001>] ? isolate_pages_global+0xf1/0x290
kernel: [<ffffffff81102dfa>] shrink_zone+0x3aa/0x550
kernel: [<ffffffff810f99e7>] ? get_page_from_freelist+0x157/0x850
kernel: [<ffffffff81104222>] do_try_to_free_pages+0xc2/0x3c0
kernel: [<ffffffff81104636>] try_to_free_pages+0x76/0x80
kernel: [<ffffffff81100f10>] ? isolate_pages_global+0x0/0x290
kernel: [<ffffffff810fa601>] __alloc_pages_nodemask+0x3f1/0x700
kernel: [<ffffffff8112b8fc>] alloc_pages_current+0x8c/0xe0
kernel: [<ffffffff811336f7>] new_slab+0x247/0x300
kernel: [<ffffffff81135d37>] __slab_alloc+0x137/0x480
kernel: [<ffffffff812b96eb>] ? radix_tree_preload+0x3b/0xb0
kernel: [<ffffffff812b96eb>] ? radix_tree_preload+0x3b/0xb0
kernel: [<ffffffff811362fa>] kmem_cache_alloc+0x12a/0x140
kernel: [<ffffffff812b96eb>] radix_tree_preload+0x3b/0xb0
kernel: [<ffffffff810f46fa>] add_to_page_cache_locked+0x7a/0x160
kernel: [<ffffffff810f480e>] add_to_page_cache_lru+0x2e/0x90
kernel: [<ffffffff810f5b89>] grab_cache_page_write_begin+0x99/0xc0
kernel: [<ffffffffa02d9158>] ? nfs_updatepage+0x1f8/0x570 [nfs]
kernel: [<ffffffffa02c7bfc>] nfs_write_begin+0x7c/0x1f0 [nfs]
kernel: [<ffffffff810f4cd6>] generic_file_buffered_write+0x116/0x290
kernel: [<ffffffff810f6299>] __generic_file_aio_write+0x259/0x470
kernel: [<ffffffff810f6512>] generic_file_aio_write+0x62/0xd0
kernel: [<ffffffffa02c89d6>] nfs_file_write+0x136/0x210 [nfs]
kernel: [<ffffffff811443c9>] do_sync_write+0xf9/0x140
kernel: [<ffffffff81085ba0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff8156d728>] ? thread_return+0x51/0x6d9
kernel: [<ffffffff81253526>] ? security_file_permission+0x16/0x20
kernel: [<ffffffff81144a3b>] vfs_write+0xcb/0x1a0
kernel: [<ffffffff81144c05>] sys_write+0x55/0x90
Aug 19 09:25:00 bascule kernel: [<ffffffff810131f2>] system_call_fastpath+0x16/0x1b
Note: The above output is from syslog, I removed the repeating timestamps to adhere to the 10k character forum post limit.

Both htop & iotop respond normally, system load on PVE is only at 4.05.
There's no iotop on the NFS box, and it's htop also responds, says it's load is 10.30.
I restarted pvedaemon & it didn't help.

I've been having I/O errors
(see this thread: http://forum.proxmox.com/threads/4385-ext3-I-O-error)
preventing backups from completing, so I'm doing them manually now in preparation for a PVE re-install.
Would someone kindly explain what this is & if it's related?
 
This is happening repeatedly. I can't even shutdown from the physical terminal afterward. I have to hold the power button down & reboot.

This is using 2.6.32 variant.
When I tried using 2.6.18, my fileserver VM wouldn't start & I couldn't work on backing up.

What is this?!
 
I had a similar experience and whilst I didn't spend too much time on it I could reproduce the behaviour by flooding the NICs. In my case my storage is on a SAN and if I set off 10 'dd if=/dev/zero' jobs on the host then I could kiss goodbye to the server.

Might be a red herring, but are you maxing out your NICs?
 
Maybe, but I didn't think I could max a NIC so easily.

Summary of activity:

1. Both times this happened today I had a backup job going from local to NFS, with a bandwidth limit of 200000.
2. The first time happened sooner in the process, and there was the overhead involved with a dozen or so VMs running with drives mapped to 1, sometimes 2 VM fileservers, plus half a dozen mapped to a SMB share on a desktop- all using vmbr1.
3. There was a copy operation moving 'old VM fileserver' content to a SMB share on a firewire-connected external drive.
4. Also another copy operation from 'new VM fileserver' to a SMB share on an internal SATAIII drive.

Second time still backing up local to NFS, all VMs were off, vzdump was the only thing going on, and the bwlimit was still at 200000. The freeze seemed to happen later than the first try.

Third try didn't freeze & was the same as the second try. It would have completed if it weren't for a disk space error.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!