Issue When backing up SBS2011 Standard KVM using Snapshot Mode

  • Thread starter Thread starter jesterfett
  • Start date Start date
J

jesterfett

Guest
Physical Environment:
  • Motherboard -- Gigabyte 880GA-UD3H
  • Processor -- Phenom II X6 1100T
  • RAM -- 16GB (4x4GB) Corsair XMS3 DDR3 1333 (Model# CMX8GX3M2A1333C9)
  • Proxmox v1.8-6070-5
Virtual Environment:
  • 8GB/512MB RAM Fedora14 VZ
  • 400GB Disk/8GB RAM KVM running SBS 2011 Standard
  • SBS 2011 Standard seems to run pretty well as KVM. I only have 1 Core assigned to this KVM.
Issue:
  • When running automated Backup via Proxmox Scheduled Backup Utility (Snapshot) A linux VZ and a SBS 2011 KVM, I receive the following messages in the syslog. I believe the first bit of code is related to the linux openvz and the second bit of code is related to the SBS2011 KVM
OpenVZ:
Code:
Aug 30 21:35:23 server101 kernel: INFO: task umount:3851 blocked for more than 120 seconds.Aug 30 21:35:23 server101 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 30 21:35:23 server101 kernel: umount        D ffff88040a49b000     0  3851   3787 0x00000000
Aug 30 21:35:23 server101 kernel: ffff88040d0f0000 0000000000000086 0000000000000000 0000000000000000
Aug 30 21:35:23 server101 kernel: ffff880013416940 0000000000000086 000000000000fa40 ffff8801efa6dfd8
Aug 30 21:35:23 server101 kernel: 0000000000016940 0000000000016940 ffff88040a49b000 ffff88040a49b2f8
Aug 30 21:35:23 server101 kernel: Call Trace:
Aug 30 21:35:23 server101 kernel: [<ffffffff810b6bca>] ? find_get_pages_tag+0x46/0xdd
Aug 30 21:35:23 server101 kernel: [<ffffffff8110ba07>] ? bdi_sched_wait+0x0/0xe
Aug 30 21:35:23 server101 kernel: [<ffffffff8110ba10>] ? bdi_sched_wait+0x9/0xe
Aug 30 21:35:23 server101 kernel: [<ffffffff81314cb7>] ? __wait_on_bit+0x41/0x70
Aug 30 21:35:23 server101 kernel: [<ffffffff8110ba07>] ? bdi_sched_wait+0x0/0xe
Aug 30 21:35:23 server101 kernel: [<ffffffff81314d51>] ? out_of_line_wait_on_bit+0x6b/0x77
Aug 30 21:35:23 server101 kernel: [<ffffffff81066a44>] ? wake_bit_function+0x0/0x23
Aug 30 21:35:23 server101 kernel: [<ffffffff8110ba88>] ? sync_inodes_sb+0x73/0x12a
Aug 30 21:35:23 server101 kernel: [<ffffffff8110f718>] ? __sync_filesystem+0x4c/0x72
Aug 30 21:35:23 server101 kernel: [<ffffffff810f39a5>] ? generic_shutdown_super+0x25/0x11f
Aug 30 21:35:23 server101 kernel: [<ffffffff810f3ac1>] ? kill_block_super+0x22/0x3a
Aug 30 21:35:23 server101 kernel: [<ffffffff810f3f78>] ? deactivate_super+0x60/0x78
Aug 30 21:35:23 server101 kernel: [<ffffffff8110744c>] ? sys_umount+0x2bb/0x2e6
Aug 30 21:35:23 server101 kernel: [<ffffffff81010c12>] ? system_call_fastpath+0x16/0x1b
Aug 30 21:35:29 server101 proxwww[3860]: Starting new child 3860
Aug 30 21:36:06 server101 proxwww[3861]: Starting new child 3861
Aug 30 21:36:44 server101 proxwww[3865]: Starting new child 3865
Aug 30 21:37:22 server101 proxwww[3866]: Starting new child 3866
Aug 30 21:37:23 server101 kernel: INFO: task umount:3851 blocked for more than 120 seconds.
Aug 30 21:37:23 server101 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 30 21:37:23 server101 kernel: umount        D ffff88040a49b000     0  3851   3787 0x00000000
Aug 30 21:37:23 server101 kernel: ffff88040d0f0000 0000000000000086 0000000000000000 0000000000000000
Aug 30 21:37:23 server101 kernel: ffff880013416940 0000000000000086 000000000000fa40 ffff8801efa6dfd8
Aug 30 21:37:23 server101 kernel: 0000000000016940 0000000000016940 ffff88040a49b000 ffff88040a49b2f8
Aug 30 21:37:23 server101 kernel: Call Trace:
Aug 30 21:37:23 server101 kernel: [<ffffffff810b6bca>] ? find_get_pages_tag+0x46/0xdd
Aug 30 21:37:23 server101 kernel: [<ffffffff8110ba07>] ? bdi_sched_wait+0x0/0xe
Aug 30 21:37:23 server101 kernel: [<ffffffff8110ba10>] ? bdi_sched_wait+0x9/0xe
Aug 30 21:37:23 server101 kernel: [<ffffffff81314cb7>] ? __wait_on_bit+0x41/0x70
Aug 30 21:37:23 server101 kernel: [<ffffffff8110ba07>] ? bdi_sched_wait+0x0/0xe
Aug 30 21:37:23 server101 kernel: [<ffffffff81314d51>] ? out_of_line_wait_on_bit+0x6b/0x77
Aug 30 21:37:23 server101 kernel: [<ffffffff81066a44>] ? wake_bit_function+0x0/0x23
Aug 30 21:37:23 server101 kernel: [<ffffffff8110ba88>] ? sync_inodes_sb+0x73/0x12a
Aug 30 21:37:23 server101 kernel: [<ffffffff8110f718>] ? __sync_filesystem+0x4c/0x72
Aug 30 21:37:23 server101 kernel: [<ffffffff810f39a5>] ? generic_shutdown_super+0x25/0x11f
Aug 30 21:37:23 server101 kernel: [<ffffffff810f3ac1>] ? kill_block_super+0x22/0x3a
Aug 30 21:37:23 server101 kernel: [<ffffffff810f3f78>] ? deactivate_super+0x60/0x78
Aug 30 21:37:23 server101 kernel: [<ffffffff8110744c>] ? sys_umount+0x2bb/0x2e6
Aug 30 21:37:23 server101 kernel: [<ffffffff81010c12>] ? system_call_fastpath+0x16/0x1b
Aug 30 21:38:00 server101 proxwww[3867]: Starting new child 3867



SBS2011:
Code:
Aug 30 22:10:00 server101 proxwww[4096]: Starting new child 4096Aug 30 22:10:01 server101 /USR/SBIN/CRON[4098]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)
Aug 30 22:10:23 server101 kernel: BUG: Bad page state in process vmtar  pfn:9c53f
Aug 30 22:10:23 server101 kernel: page:ffffea0002bf79b8 flags:0100000000000000 count:0 mapcount:-134217728 mapping:(null) index:5890741
Aug 30 22:10:23 server101 kernel: Pid: 3954, comm: vmtar Not tainted 2.6.32-4-pve #1
Aug 30 22:10:23 server101 kernel: Call Trace:
Aug 30 22:10:23 server101 kernel: [<ffffffff810ba9b4>] ? bad_page+0x116/0x129
Aug 30 22:10:23 server101 kernel: [<ffffffff810bca67>] ? get_page_from_freelist+0x481/0x68b
Aug 30 22:10:23 server101 kernel: [<ffffffff81118625>] ? do_mpage_readpage+0x410/0x421
Aug 30 22:10:23 server101 kernel: [<ffffffff8103a31c>] ? enqueue_task+0x5f/0x68
Aug 30 22:10:23 server101 kernel: [<ffffffff810bcfe4>] ? __alloc_pages_nodemask+0x128/0x6a8
Aug 30 22:10:23 server101 kernel: [<ffffffffa012bdf9>] ? raid1_congested+0x1b/0x8e [raid1]
Aug 30 22:10:23 server101 kernel: [<ffffffff810be9fd>] ? __do_page_cache_readahead+0x9b/0x1b4
Aug 30 22:10:23 server101 kernel: [<ffffffff810beb32>] ? ra_submit+0x1c/0x20
Aug 30 22:10:23 server101 kernel: [<ffffffff810bee21>] ? page_cache_async_readahead+0x75/0xad
Aug 30 22:10:23 server101 kernel: [<ffffffff810b83e4>] ? generic_file_aio_read+0x23a/0x538
Aug 30 22:10:23 server101 kernel: [<ffffffff810b81b8>] ? generic_file_aio_read+0xe/0x538
Aug 30 22:10:23 server101 kernel: [<ffffffff810f15fd>] ? do_sync_read+0xce/0x113
Aug 30 22:10:23 server101 kernel: [<ffffffff81078071>] ? set_held_pages+0x11/0x1a
Aug 30 22:10:23 server101 kernel: [<ffffffff81066a16>] ? autoremove_wake_function+0x0/0x2e
Aug 30 22:10:23 server101 kernel: [<ffffffff8101172e>] ? apic_timer_interrupt+0xe/0x20
Aug 30 22:10:23 server101 kernel: [<ffffffff810f214e>] ? vfs_read+0xa6/0xff
Aug 30 22:10:23 server101 kernel: [<ffffffff810f2518>] ? fget_light+0x22/0x7e
Aug 30 22:10:23 server101 kernel: [<ffffffff810f22c1>] ? sys_read+0x49/0xc4
Aug 30 22:10:23 server101 kernel: [<ffffffff8101172e>] ? apic_timer_interrupt+0xe/0x20
Aug 30 22:10:23 server101 kernel: [<ffffffff81010c12>] ? system_call_fastpath+0x16/0x1b
Aug 30 22:10:23 server101 kernel: Disabling lock debugging due to kernel taint
Aug 30 22:10:39 server101 proxwww[4104]: Starting new child 4104



Note:
  • I have also monitored the KVM during the backup via the console and the KVM performance is not impacted and the virtual remains responsive during the backup.
  • The proxmox system seems fine and there is no kernel panic, but should I be concerned about these messages/errors? Or, are they just benign and not important?
Any assistance would be greatly appreciated.

Thanks in advance.
 
Last edited by a moderator:
pve-manager: 1.8-18 (pve-manager/1.8/6070)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.32: 1.8-33
pve-kernel-2.6.32-4-pve: 2.6.32-33
qemu-server: 1.1-30
pve-firmware: 1.0-11
libpve-storage-perl: 1.0-17
vncterm: 0.9-2
vzctl: 3.0.28-1pve1
vzdump: 1.2-14
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.14.1-1
ksm-control-daemon: 1.0-6
 
Update: I am still seeing similar syslog entries when backing up via snapshot. Also of interest is that every few days the SBS2011 Standard Virtual randomly BSOD's. It is the most frustrating thing ever. What is weird about it is that I have been running an exact (i.e. copy of the .raw file) KVM Virtual of this box on older AMD hardware, (dual core I believe) and it has no problems what-so-ever other than being quite a bit slower.

I have already eliminated memory and motherboard as I have tried it with different, but identical hardware and have the same issue. I found it just weird that it works fine (i.e. is stable) on older AMD hardware, but the new stuff (i.e. AMD Phenom II x6 1100T) causes it to barf every few days.

Any suggestions from the group would be greatly appreciated as I am losing hair quickly?
 
The first thought I had after reading your posts is you have some bad ram.
Would explain the bsod and the strange kernel messages.

Try running memtest and see what it says.
At a minimum let it run 1 full pass of the default tests.

If you want stability I highly recommend ECC.
I much prefer getting an EDAC message informing me to replace ram than having random issues.
Many ASUS boards support ECC, just the other day in another thread we were discussing boards and it seems the consensus was no Gigabyte board support ECC.
 
e100,

Thank you for the suggestion. Unfortunately I have already done that. I originally had 16GB of GSkill Ram installed and ran memtest and got all kinds of errors. I then opted for the Corsair memory and a new Motherboard and ran 2 successful passes on the default tests without any errors with memtest.

This thing is a real head-scratcher as I have several Proxmox boxes with identical hardware and have not had the KVM issues as with the SBS2011 box. However, this is my first install of SBS2011. All the previous KVM's have been Server 2008 R2 or Server 2003 R2.

I will take your suggestion though and in the future opt for the ECC RAM.


Any other suggestions?
 
I hate to say the same thing again but really I think your problem is a hardware issue related to your ram.

"BUG: Bad page state in process" sounds like a bit flip, search google to see what others think.

Just because memtest did not detect an error does not mean no errors exist.
memtest test ram, not the whole system.

This issue happens while you are running a backup process that is memory, CPU and disk IO intensive.
Maybe your power supply is not able to keep up and that is causing the problem.
Maybe your motherboard has flaky voltage regulators and is not keeping up under load.

You mentioned using different but identical hardware and you had the same issue.
Maybe both sets of hardware are bad, or there is a compatibility issue with the hardware you have.
Maybe this is a BIOS bug and the high IO and ram usage triggers it.
The windows VM runs flawless on completely different older hardware, sounds like the new hardware is to blame.