vzdump using lvm snapshot - kills the box

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
Hi
we're running into problems when PVE 2.0 does some automated full openvz-backups.
(post split into several posts as the post-limit is 10,000 chars)


pveversion -v
pve-manager: 2.0-57 (pve-manager/2.0/ff6cd700)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-65
pve-kernel-2.6.32-10-pve: 2.6.32-63
pve-kernel-2.6.32-11-pve: 2.6.32-65
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.88-2pve2
clvm: 2.02.88-2pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-36
pve-firmware: 1.0-15
libpve-common-perl: 1.0-25
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-17
vncterm: 1.0-2
vzctl: 3.0.30-2pve2
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1


it seems the previous backup is finished just fine, but then the LVM for the next snapshot runs into problems with the cow-device:
Apr 7 06:40:51 k77 vzdump[819692]: INFO: Finished Backup of VM 77002 (02:32:35)
Apr 7 06:40:51 k77 vzdump[819692]: INFO: Starting Backup of VM 77003 (openvz)
Apr 7 06:40:52 k77 kernel: EXT3-fs: barriers disabled
Apr 7 06:40:52 k77 kernel: kjournald starting. Commit interval 5 seconds
Apr 7 06:40:52 k77 kernel: EXT3-fs (dm-4): using internal journal
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 127787244
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 121443158
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 121443132
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 121443072
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 121442188
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 121439638
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 121439614
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 106040480
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 106040479
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 106040478
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 106039848
Apr 7 06:40:52 k77 kernel: ext3_orphan_cleanup: deleting unreferenced inode 106038228
Apr 7 06:40:52 k77 kernel: EXT3-fs (dm-4): 12 orphan inodes deleted
Apr 7 06:40:52 k77 kernel: EXT3-fs (dm-4): recovery complete
Apr 7 06:40:52 k77 kernel: EXT3-fs (dm-4): mounted filesystem with ordered data mode
Apr 7 06:41:38 k77 kernel: __ratelimit: 208 callbacks suppressed
Apr 7 06:41:38 k77 kernel: device-mapper: ioctl: unable to remove open device pve-vzsnap--k77--0-cow
Apr 7 06:45:07 k77 kernel: INFO: task flush-253:2:1180 blocked for more than 120 seconds.


after that it only gets worse and within five minutes the whole system is unresponsive (no reboot works, only a power-reset via IPMI saves the day):


Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: flush-253:2 D ffff88083987a480 0 1180 2 0 0x00000000
Apr 7 06:45:07 k77 kernel: ffff88083b889800 0000000000000046 0000000000000000 ffffffff81404e6c
Apr 7 06:45:07 k77 kernel: ffff880dbc12ccc0 ffff880839aa2078 0000000000000008 00000000f3a48918
Apr 7 06:45:07 k77 kernel: ffff88083b889860 ffff88083987aa20 ffff88083b889fd8 ffff88083b889fd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81404e6c>] ? dm_table_unplug_all+0x5c/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0045>] sync_buffer+0x45/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b62a>] __wait_on_bit_lock+0x5a/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0400>] ? end_buffer_async_write+0x0/0x180
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b708>] out_of_line_wait_on_bit_lock+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811c01c6>] __lock_buffer+0x36/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0c54>] __block_write_full_page+0x374/0x3a0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0400>] ? end_buffer_async_write+0x0/0x180
Apr 7 06:45:07 k77 kernel: [<ffffffff811c16be>] block_write_full_page_endio+0xce/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0400>] ? end_buffer_async_write+0x0/0x180
Apr 7 06:45:07 k77 kernel: [<ffffffff811c1705>] block_write_full_page+0x15/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bbfcd>] ext3_ordered_writepage+0x1ed/0x240 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff81135ef7>] __writepage+0x17/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff8113749b>] write_cache_pages+0x1cb/0x480
Apr 7 06:45:07 k77 kernel: [<ffffffff81135ee0>] ? __writepage+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff8125e179>] ? cpumask_next_and+0x29/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff81137774>] generic_writepages+0x24/0x30
Apr 7 06:45:07 k77 kernel: [<ffffffff811377b5>] do_writepages+0x35/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811b612d>] __writeback_single_inode+0xdd/0x2c0
Apr 7 06:45:07 k77 kernel: [<ffffffff811b6393>] writeback_single_inode+0x83/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811a5d50>] ? iput+0x30/0x70
Apr 7 06:45:07 k77 kernel: [<ffffffff811b664f>] writeback_sb_inodes+0xff/0x1e0
Apr 7 06:45:07 k77 kernel: [<ffffffff811b6880>] writeback_inodes_wb+0x150/0x1a0
Apr 7 06:45:07 k77 kernel: [<ffffffff811b6b7b>] wb_writeback+0x2ab/0x400
Apr 7 06:45:07 k77 kernel: [<ffffffff8150a5be>] ? thread_return+0x4e/0x7d0
Apr 7 06:45:07 k77 kernel: [<ffffffff811b6e79>] wb_do_writeback+0x1a9/0x250
Apr 7 06:45:07 k77 kernel: [<ffffffff8107cfa0>] ? process_timeout+0x0/0x10
Apr 7 06:45:07 k77 kernel: [<ffffffff811b6f83>] bdi_writeback_task+0x63/0x1b0
Apr 7 06:45:07 k77 kernel: [<ffffffff81094157>] ? bit_waitqueue+0x17/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff8114a790>] ? bdi_start_fn+0x0/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff8114a816>] bdi_start_fn+0x86/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff8114a790>] ? bdi_start_fn+0x0/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff81093ca6>] kthread+0x96/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff81093c10>] ? kthread+0x0/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Apr 7 06:45:07 k77 kernel: INFO: task kjournald:1181 blocked for more than 120 seconds.
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: kjournald D ffff88083987af40 0 1181 2 0 0x00000000
Apr 7 06:45:07 k77 kernel: ffff88083a667c50 0000000000000046 ffff88083a667c10 ffffffff81404e6c
Apr 7 06:45:07 k77 kernel: ffff88083a667bc0 ffffffff81012af9 ffff88083a667c00 ffffffff8109ecf9
Apr 7 06:45:07 k77 kernel: ffff88083a667cb0 ffff88083987b4e0 ffff88083a667fd8 ffff88083a667fd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81404e6c>] ? dm_table_unplug_all+0x5c/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff81012af9>] ? read_tsc+0x9/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff8109ecf9>] ? ktime_get_ts+0xa9/0xe0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0045>] sync_buffer+0x45/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b77f>] __wait_on_bit+0x5f/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b828>] out_of_line_wait_on_bit+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811bfff6>] __wait_on_buffer+0x26/0x30
Apr 7 06:45:07 k77 kernel: [<ffffffffa009af1e>] journal_commit_transaction+0x9fe/0x12f0 [jbd]
Apr 7 06:45:07 k77 kernel: [<ffffffff8107ce8c>] ? lock_timer_base+0x3c/0x70
Apr 7 06:45:07 k77 kernel: [<ffffffff8107dacb>] ? try_to_del_timer_sync+0x7b/0xe0
Apr 7 06:45:07 k77 kernel: [<ffffffffa00a0288>] kjournald+0xe8/0x250 [jbd]
Apr 7 06:45:07 k77 kernel: [<ffffffff81094280>] ? autoremove_wake_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffffa00a01a0>] ? kjournald+0x0/0x250 [jbd]
Apr 7 06:45:07 k77 kernel: [<ffffffff81093ca6>] kthread+0x96/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff81093c10>] ? kthread+0x0/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Apr 7 06:45:07 k77 kernel: INFO: task mysqld:879440 blocked for more than 120 seconds.
 

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: mysqld D ffff88103a078640 0 879440 5155 77002 0x00000000
Apr 7 06:45:07 k77 kernel: ffff8808303c7938 0000000000000086 0000000000000000 ffffffff81404e6c
Apr 7 06:45:07 k77 kernel: ffff8808303c78e8 ffffffff81053365 ffff8808303c7a48 0000000000000000
Apr 7 06:45:07 k77 kernel: ffff8808303c79e8 ffff88103a078be0 ffff8808303c7fd8 ffff8808303c7fd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81404e6c>] ? dm_table_unplug_all+0x5c/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff81053365>] ? select_idle_sibling+0x95/0x150
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0045>] sync_buffer+0x45/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b62a>] __wait_on_bit_lock+0x5a/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff81052776>] ? enqueue_task+0x66/0x80
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b708>] out_of_line_wait_on_bit_lock+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811c01c6>] __lock_buffer+0x36/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811c10f8>] __sync_dirty_buffer+0xc8/0xf0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c1133>] sync_dirty_buffer+0x13/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffffa0099bbe>] journal_dirty_data+0x1de/0x270 [jbd]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc470>] ext3_journal_dirty_data+0x20/0x50 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc4c5>] journal_dirty_data_fn+0x25/0x30 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bb3b7>] walk_page_buffers+0x87/0xc0 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc4a0>] ? journal_dirty_data_fn+0x0/0x30 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bff0b>] ext3_ordered_write_end+0xab/0x1c0 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff81010b5e>] ? __copy_from_user_inatomic+0xe/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff81120fd4>] generic_file_buffered_write+0x194/0x2c0
Apr 7 06:45:07 k77 kernel: [<ffffffff81122b20>] __generic_file_aio_write+0x240/0x470
Apr 7 06:45:07 k77 kernel: [<ffffffff81122dbf>] generic_file_aio_write+0x6f/0xe0
Apr 7 06:45:07 k77 kernel: [<ffffffff8118ad7a>] do_sync_write+0xfa/0x140
Apr 7 06:45:07 k77 kernel: [<ffffffff81094280>] ? autoremove_wake_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff8118b058>] vfs_write+0xb8/0x1a0
Apr 7 06:45:07 k77 kernel: [<ffffffff8118bb22>] sys_pwrite64+0x82/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Apr 7 06:45:07 k77 kernel: INFO: task mysqld:885509 blocked for more than 120 seconds.
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: mysqld D ffff88083bb3e740 0 885509 5155 77002 0x00000000
Apr 7 06:45:07 k77 kernel: ffff88073a695cc8 0000000000000086 0000000000000000 ffff880000066630
Apr 7 06:45:07 k77 kernel: 0000000000000001 ffff88073a695d58 ffff88073a695dd0 0000000000000282
Apr 7 06:45:07 k77 kernel: ffff88073a695c78 ffff88083bb3ece0 ffff88073a695fd8 ffff88073a695fd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81120480>] ? sync_page+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811204bd>] sync_page+0x3d/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b77f>] __wait_on_bit+0x5f/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff81120673>] wait_on_page_bit+0x73/0x80
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff81138735>] ? pagevec_lookup_tag+0x25/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff81120c5b>] wait_on_page_writeback_range+0xfb/0x190
Apr 7 06:45:07 k77 kernel: [<ffffffff81137774>] ? generic_writepages+0x24/0x30
Apr 7 06:45:07 k77 kernel: [<ffffffff811377b5>] ? do_writepages+0x35/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff81120dab>] ? __filemap_fdatawrite_range+0x5b/0x60
Apr 7 06:45:07 k77 kernel: [<ffffffff81120e28>] filemap_write_and_wait_range+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff811bbaaa>] vfs_fsync_range+0xba/0x190
Apr 7 06:45:07 k77 kernel: [<ffffffff811bbbed>] vfs_fsync+0x1d/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff811bbc50>] do_fsync+0x60/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff811bbcc0>] sys_fsync+0x10/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Apr 7 06:45:07 k77 kernel: INFO: task mysqld:885558 blocked for more than 120 seconds.
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: mysqld D ffff88083b1144c0 0 885558 5155 77002 0x00000000
Apr 7 06:45:07 k77 kernel: ffff88052d87d938 0000000000000086 0000000000000000 ffffffff81404e6c
Apr 7 06:45:07 k77 kernel: 0000000000000002 ffff880000054240 0000000000000000 0000000000000050
Apr 7 06:45:07 k77 kernel: ffff88052d87d9c8 ffff88083b114a60 ffff88052d87dfd8 ffff88052d87dfd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81404e6c>] ? dm_table_unplug_all+0x5c/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0045>] sync_buffer+0x45/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b62a>] __wait_on_bit_lock+0x5a/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff81052776>] ? enqueue_task+0x66/0x80
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b708>] out_of_line_wait_on_bit_lock+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811c01c6>] __lock_buffer+0x36/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811c10f8>] __sync_dirty_buffer+0xc8/0xf0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c1133>] sync_dirty_buffer+0x13/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffffa0099bbe>] journal_dirty_data+0x1de/0x270 [jbd]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc470>] ext3_journal_dirty_data+0x20/0x50 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc4c5>] journal_dirty_data_fn+0x25/0x30 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bb3b7>] walk_page_buffers+0x87/0xc0 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc4a0>] ? journal_dirty_data_fn+0x0/0x30 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bff0b>] ext3_ordered_write_end+0xab/0x1c0 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff81010b5e>] ? __copy_from_user_inatomic+0xe/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff81120fd4>] generic_file_buffered_write+0x194/0x2c0
Apr 7 06:45:07 k77 kernel: [<ffffffff81122b20>] __generic_file_aio_write+0x240/0x470
Apr 7 06:45:07 k77 kernel: [<ffffffff81122dbf>] generic_file_aio_write+0x6f/0xe0
Apr 7 06:45:07 k77 kernel: [<ffffffff8118ad7a>] do_sync_write+0xfa/0x140
Apr 7 06:45:07 k77 kernel: [<ffffffff81094280>] ? autoremove_wake_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff8118b058>] vfs_write+0xb8/0x1a0
Apr 7 06:45:07 k77 kernel: [<ffffffff8118bb22>] sys_pwrite64+0x82/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Apr 7 06:45:07 k77 kernel: INFO: task dccd:5750 blocked for more than 120 seconds.
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: dccd D ffff88103af1cd00 0 5750 2186 77002 0x00000000
Apr 7 06:45:07 k77 kernel: ffff88102ca1fac8 0000000000000086 0000000000000000 ffffffff811c8b97
Apr 7 06:45:07 k77 kernel: ffff88102ca1fb18 ffffffff811c950d ffffffffa00be850 0000000000000282
Apr 7 06:45:07 k77 kernel: ffff88102ca1fa78 ffff88103af1d2a0 ffff88102ca1ffd8 ffff88102ca1ffd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff811c8b97>] ? mpage_bio_submit+0x27/0x30
Apr 7 06:45:07 k77 kernel: [<ffffffff811c950d>] ? mpage_readpages+0x11d/0x140
Apr 7 06:45:07 k77 kernel: [<ffffffffa00be850>] ? ext3_get_block+0x0/0x120 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff81120480>] ? sync_page+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811204bd>] sync_page+0x3d/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b77f>] __wait_on_bit+0x5f/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff81120673>] wait_on_page_bit+0x73/0x80
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff8112071a>] __lock_page_or_retry+0x3a/0x60
Apr 7 06:45:07 k77 kernel: [<ffffffff81121847>] filemap_fault+0x2c7/0x510
Apr 7 06:45:07 k77 kernel: [<ffffffff81151d0e>] __do_fault+0x7e/0x610
Apr 7 06:45:07 k77 kernel: [<ffffffff811a3c57>] ? __d_lookup+0xa7/0x150
Apr 7 06:45:07 k77 kernel: [<ffffffff81152399>] handle_pte_fault+0xf9/0xd10
Apr 7 06:45:07 k77 kernel: [<ffffffff81198cd5>] ? path_to_nameidata+0x25/0x60
Apr 7 06:45:07 k77 kernel: [<ffffffff8119a89e>] ? __link_path_walk+0x88e/0x1050
Apr 7 06:45:07 k77 kernel: [<ffffffff811ab730>] ? mntput_no_expire+0x30/0x110
Apr 7 06:45:07 k77 kernel: [<ffffffff81153194>] handle_mm_fault+0x1e4/0x2b0
Apr 7 06:45:07 k77 kernel: [<ffffffff811ab730>] ? mntput_no_expire+0x30/0x110
Apr 7 06:45:07 k77 kernel: [<ffffffff81042a79>] __do_page_fault+0x139/0x490
Apr 7 06:45:07 k77 kernel: [<ffffffff8119c102>] ? user_path_at+0x62/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff810ec685>] ? call_rcu_sched+0x15/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff815100fe>] do_page_fault+0x3e/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8150d465>] page_fault+0x25/0x30
Apr 7 06:45:07 k77 kernel: INFO: task apache2:885550 blocked for more than 120 seconds.
 

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: apache2 D ffff88103b76d4c0 0 885550 885218 77002 0x00000000
Apr 7 06:45:07 k77 kernel: ffff88083b8bba78 0000000000000086 ffff88083b8bba38 ffffffff81404e6c
Apr 7 06:45:07 k77 kernel: ffff8807f96ff2c0 ffff88083999f808 ffff88083b8bba18 ffffffff8140414e
Apr 7 06:45:07 k77 kernel: ffff8807f96ff2c0 ffff88103b76da60 ffff88083b8bbfd8 ffff88083b8bbfd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81404e6c>] ? dm_table_unplug_all+0x5c/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff8140414e>] ? dm_request+0x17e/0x1b0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0045>] sync_buffer+0x45/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b77f>] __wait_on_bit+0x5f/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b828>] out_of_line_wait_on_bit+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811bfff6>] __wait_on_buffer+0x26/0x30
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bb7da>] wait_on_buffer+0x2a/0x40 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bcbb9>] __ext3_get_inode_loc+0x1a9/0x350 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bcde5>] ext3_iget+0x85/0x480 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00c3ff8>] ext3_lookup+0xa8/0x130 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff81199d5c>] do_lookup+0x1ac/0x210
Apr 7 06:45:07 k77 kernel: [<ffffffff8119a869>] __link_path_walk+0x859/0x1050
Apr 7 06:45:07 k77 kernel: [<ffffffff8119b29a>] path_walk+0x6a/0xe0
Apr 7 06:45:07 k77 kernel: [<ffffffff8119b46b>] do_path_lookup+0x5b/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8119c0f7>] user_path_at+0x57/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8150d13b>] ? _spin_unlock_bh+0x1b/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff811896c0>] sys_faccessat+0xd0/0x1d0
Apr 7 06:45:07 k77 kernel: [<ffffffff811897d8>] sys_access+0x18/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Apr 7 06:45:07 k77 kernel: INFO: task apache2:885705 blocked for more than 120 seconds.
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: apache2 D ffff88103a7c8a00 0 885705 882933 77001 0x00020000
Apr 7 06:45:07 k77 kernel: ffff88052d95fcc8 0000000000000082 0000000000000000 ffff880839104998
Apr 7 06:45:07 k77 kernel: 0000000000000001 ffff88052d95fd58 ffff88052d95fdd0 0000000000000282
Apr 7 06:45:07 k77 kernel: ffff88052d95fc78 ffff88103a7c8fa0 ffff88052d95ffd8 ffff88052d95ffd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81120480>] ? sync_page+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811204bd>] sync_page+0x3d/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b77f>] __wait_on_bit+0x5f/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff81120673>] wait_on_page_bit+0x73/0x80
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff81138735>] ? pagevec_lookup_tag+0x25/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff81120c5b>] wait_on_page_writeback_range+0xfb/0x190
Apr 7 06:45:07 k77 kernel: [<ffffffff81137774>] ? generic_writepages+0x24/0x30
Apr 7 06:45:07 k77 kernel: [<ffffffff811377b5>] ? do_writepages+0x35/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff81120dab>] ? __filemap_fdatawrite_range+0x5b/0x60
Apr 7 06:45:07 k77 kernel: [<ffffffff81120e28>] filemap_write_and_wait_range+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff811bbaaa>] vfs_fsync_range+0xba/0x190
Apr 7 06:45:07 k77 kernel: [<ffffffff811bbbed>] vfs_fsync+0x1d/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff811bbc50>] do_fsync+0x60/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff811bbcc0>] sys_fsync+0x10/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff81049393>] ia32_sysret+0x0/0x5
Apr 7 06:45:07 k77 kernel: INFO: task apache2:885855 blocked for more than 120 seconds.
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: apache2 D ffff88103b96a280 0 885855 882933 77001 0x00020000
Apr 7 06:45:07 k77 kernel: ffff880068f6f938 0000000000000086 0000000000000000 ffffffff81404e6c
Apr 7 06:45:07 k77 kernel: 0000000000000002 ffff880000054240 0000000000000000 0000000000000050
Apr 7 06:45:07 k77 kernel: ffff880068f6f9c8 ffff88103b96a820 ffff880068f6ffd8 ffff880068f6ffd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81404e6c>] ? dm_table_unplug_all+0x5c/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0045>] sync_buffer+0x45/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b62a>] __wait_on_bit_lock+0x5a/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b708>] out_of_line_wait_on_bit_lock+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811c01c6>] __lock_buffer+0x36/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811c10f8>] __sync_dirty_buffer+0xc8/0xf0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c1133>] sync_dirty_buffer+0x13/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffffa0099bbe>] journal_dirty_data+0x1de/0x270 [jbd]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc470>] ext3_journal_dirty_data+0x20/0x50 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc4c5>] journal_dirty_data_fn+0x25/0x30 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bb3b7>] walk_page_buffers+0x87/0xc0 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bc4a0>] ? journal_dirty_data_fn+0x0/0x30 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00bff0b>] ext3_ordered_write_end+0xab/0x1c0 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff81010b5e>] ? __copy_from_user_inatomic+0xe/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff81120fd4>] generic_file_buffered_write+0x194/0x2c0
Apr 7 06:45:07 k77 kernel: [<ffffffff81122b20>] __generic_file_aio_write+0x240/0x470
Apr 7 06:45:07 k77 kernel: [<ffffffff8118ac3b>] ? do_sync_readv_writev+0xfb/0x140
Apr 7 06:45:07 k77 kernel: [<ffffffff81122dbf>] generic_file_aio_write+0x6f/0xe0
Apr 7 06:45:07 k77 kernel: [<ffffffff8118ad7a>] do_sync_write+0xfa/0x140
Apr 7 06:45:07 k77 kernel: [<ffffffff81094280>] ? autoremove_wake_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff8118b058>] vfs_write+0xb8/0x1a0
Apr 7 06:45:07 k77 kernel: [<ffffffff8118ba61>] sys_write+0x51/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff810c7e4a>] ? compat_sys_gettimeofday+0x9a/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff81049393>] ia32_sysret+0x0/0x5
Apr 7 06:45:07 k77 kernel: INFO: task apache2:885980 blocked for more than 120 seconds.
Apr 7 06:45:07 k77 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 7 06:45:07 k77 kernel: apache2 D ffff8809788e47c0 0 885980 882933 77001 0x00020000
Apr 7 06:45:07 k77 kernel: ffff880862c65918 0000000000000086 0000000000000000 ffffffff81404e6c
Apr 7 06:45:07 k77 kernel: ffff8809788e4fe8 ffff8809788e47c0 ffff8809788e47c0 0000000000000008
Apr 7 06:45:07 k77 kernel: ffff88083c62a4f0 ffff8809788e4d60 ffff880862c65fd8 ffff880862c65fd8
Apr 7 06:45:07 k77 kernel: Call Trace:
Apr 7 06:45:07 k77 kernel: [<ffffffff81404e6c>] ? dm_table_unplug_all+0x5c/0x100
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150adb3>] io_schedule+0x73/0xc0
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0045>] sync_buffer+0x45/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b77f>] __wait_on_bit+0x5f/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff811c0000>] ? sync_buffer+0x0/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8150b828>] out_of_line_wait_on_bit+0x78/0x90
Apr 7 06:45:07 k77 kernel: [<ffffffff810942c0>] ? wake_bit_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff811bfff6>] __wait_on_buffer+0x26/0x30
Apr 7 06:45:07 k77 kernel: [<ffffffffa00becdc>] ext3_bread+0x6c/0x90 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffffa00c3954>] ext3_dx_find_entry+0xc4/0x2a0 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff8109463c>] ? remove_wait_queue+0x3c/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff811a0696>] ? free_poll_entry+0x26/0x30
Apr 7 06:45:07 k77 kernel: [<ffffffffa00c3ded>] ext3_find_entry+0x2bd/0x420 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff8148a7c1>] ? tcp_recvmsg+0x841/0xf60
Apr 7 06:45:07 k77 kernel: [<ffffffff811a0e00>] ? __pollwait+0x0/0xf0
Apr 7 06:45:07 k77 kernel: [<ffffffffa00c3f9d>] ext3_lookup+0x4d/0x130 [ext3]
Apr 7 06:45:07 k77 kernel: [<ffffffff81199d5c>] do_lookup+0x1ac/0x210
Apr 7 06:45:07 k77 kernel: [<ffffffff8119a869>] __link_path_walk+0x859/0x1050
Apr 7 06:45:07 k77 kernel: [<ffffffff81431ef9>] ? sock_common_recvmsg+0x39/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff8119b29a>] path_walk+0x6a/0xe0
Apr 7 06:45:07 k77 kernel: [<ffffffff8119b46b>] do_path_lookup+0x5b/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff8119c0f7>] user_path_at+0x57/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff81094280>] ? autoremove_wake_function+0x0/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff81190726>] vfs_fstatat+0x46/0x80
Apr 7 06:45:07 k77 kernel: [<ffffffff8119079b>] vfs_stat+0x1b/0x20
Apr 7 06:45:07 k77 kernel: [<ffffffff8104a264>] sys32_stat64+0x24/0x40
Apr 7 06:45:07 k77 kernel: [<ffffffff8109ea5a>] ? do_gettimeofday+0x1a/0x50
Apr 7 06:45:07 k77 kernel: [<ffffffff810c7e4a>] ? compat_sys_gettimeofday+0x9a/0xa0
Apr 7 06:45:07 k77 kernel: [<ffffffff81049393>] ia32_sysret+0x0/0x5
 

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
this is it - calling "lvdisplay" to see the problem in detail doesn't respond anything (task hangs), lvremove for the presumed snapshot doesn't work either. this is really messy and any input on how to resolve this is greatly appreciated.


regards
hk
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
17,110
509
133
Austria
www.proxmox.com
this is it - calling "lvdisplay" to see the problem in detail doesn't respond anything (task hangs), lvremove for the presumed snapshot doesn't work either. this is really messy and any input on how to resolve this is greatly appreciated.

How often does that happen - is it reproducible? If so, can you reproduce it on another server/hardware?
 

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
hi
definitely reproduceable - it seems if we do manual backup using snapshots this does virtually not happen, but if vzdump is doing all backups it seems that lvm is much more unstable (maybe it needs some time to recover after a snapshot has been removed?) and this seems reproducable on at least now two machines (that are also running some load, those with no load seem to have no trouble here).

regards
hk
 

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
Yes of course it's hardware-raid.

load since last (forced) powercycle: uptime
01:37:03 up 1 day, 14:50, 1 user, load average: 0.32, 0.35, 0.36


now we have seen the very same behaviour on another system (totally different: cpu, ram, raid-controller, ... all different).

different syslog though:

Apr 8 22:00:01 k81 vzdump[3090]: <root@pam> starting task UPID:k81:00000C13:035A66CB:4F81EE41:vzdump::root@pam:
Apr 8 22:00:01 k81 vzdump[3091]: INFO: starting new backup job: vzdump --quiet 1 --mode snapshot --compress lzo --storage backup --node k81 --all 1
Apr 8 22:00:01 k81 vzdump[3091]: INFO: Starting Backup of VM 11102 (openvz)
Apr 8 22:00:02 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:02 k81 kernel: BUG: Quota files for 11102 are broken: no quota engine running
Apr 8 22:00:02 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 9179645
Apr 8 22:00:02 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:02 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 5813065
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 5813064
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 5769142
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 10499222
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 5807376
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 5807375
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 5807374
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 5807371
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11104 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 5768640
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11103 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 10616849
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11103 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 10616848
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11103 are broken: no quota engine running
Apr 8 22:00:03 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 10616847
Apr 8 22:00:03 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:03 k81 kernel: BUG: Quota files for 11103 are broken: no quota engine running
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 10616846
Apr 8 22:00:04 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:04 k81 kernel: BUG: Quota files for 11103 are broken: no quota engine running
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 10616845
Apr 8 22:00:04 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:04 k81 kernel: BUG: Quota files for 11103 are broken: no quota engine running
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 11403478
Apr 8 22:00:04 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:04 k81 kernel: BUG: Quota files for 11102 are broken: no quota engine running
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 6300169
Apr 8 22:00:04 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:04 k81 kernel: BUG: Quota files for 11102 are broken: no quota engine running
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 6300168
Apr 8 22:00:04 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:04 k81 kernel: BUG: Quota files for 11102 are broken: no quota engine running
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 6300167
Apr 8 22:00:04 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:04 k81 kernel: BUG: Quota files for 11102 are broken: no quota engine running
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 6300165
Apr 8 22:00:04 k81 kernel: VZDQ: Tried to clean orphans on qmblk with 1 state
Apr 8 22:00:04 k81 kernel: BUG: Quota files for 11102 are broken: no quota engine running
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): ext4_orphan_cleanup: deleting unreferenced inode 6300164
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): 21 orphan inodes deleted
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): recovery complete
Apr 8 22:00:04 k81 kernel: EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts:
Apr 8 22:08:03 k81 kernel: INFO: task jbd2/dm-2-8:920 blocked for more than 120 seconds.
Apr 8 22:08:03 k81 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 8 22:08:03 k81 kernel: jbd2/dm-2-8 D ffff8806b7aa8140 0 920 2 0 0x00000000
Apr 8 22:08:03 k81 kernel: ffff8806bb629c10 0000000000000046 ffff8806bb629bd0 ffffffff81404e6c
Apr 8 22:08:03 k81 kernel: ffff8806bb629b80 ffffffff81012af9 ffff8806bb629bc0 ffffffff8109ecf9
Apr 8 22:08:03 k81 kernel: ffff8806bb629c70 ffff8806b7aa86e0 ffff8806bb629fd8 ffff8806bb629fd8
Apr 8 22:08:03 k81 kernel: Call Trace:

... and so on....

regards
hk
 

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
we have seen the problem on ext3 (the first three posts reported exactly that) and on ext4 (the second report I shortened and therefore got pointed to the bugzilla-entry, but this bug 2085 doesn't seem to send the lvm on this machine to nirvana).we have had (as mentioned) the issue on totally different hardware (even the raid controller is once a ServeRAID 8k and once a LSI MegaRAID 8708EM2).I can only point my finger to lvm that seems to have trouble here.We also do have smaller issues on old PVE systems (1.9) with lvm, but only once in a few months did we get a total kill, usually only some vzdump-jobs can't be done successfully.For what its worth: in comparison we are running an openvz-only-host on 2.6.18-274.7.1.el5.028stab095.1 (and now 2.6.18-308.el5.028stab099.3) - no problems doing any snapshots since many many months.If we can debug anything further please let me know.Regardshk
 

e100

Renowned Member
Nov 6, 2010
1,248
33
68
Columbus, Ohio
ulbuilder.wordpress.com
On your first post there is one line that sticks out as the largest clue:
device-mapper: ioctl: unable to remove open device pve-vzsnap--k77--0-cow

So the snapshot is trying to be removed, but the system thinks that snapshot is still in use "open"
That is the deadlock right there.


Is your backup storage on the same LVM group as what you are backing up?
After restarting when this happens what is the output of lvdisplay?
 

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
Hi,
I agree on the snapshot problem, but "lvdisplay" simply hangs in this situation and reports nothing back.

the backup-storage is a mounted nfs-target in one situation and a separate volumegroup mounted as /backup/ in the other situation.

Question here would be: Could vzdump handle such a situation more gracefully - if not imho it should simply throw an error, but as the lvm call hangs I suppose it doesn't get so far as to error-handling for the script.

Probably suspend or shutdown would be the way to go for the backup, but I'd much prefer some sort of online-backup, but I guess we will have to go for in-VE-backup as the snapshotting isn't as stable for us as it should be.

regards
hk
 

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
Now I get you - yes both servers had the snapshot still there (and active) - we had to lvremove it after the reboot.

regards
hk
 

e100

Renowned Member
Nov 6, 2010
1,248
33
68
Columbus, Ohio
ulbuilder.wordpress.com
When it hangs, can you run the mount command and see what things are still mounted?

Maybe the filesystem is not getting dismounted before the LVM is trying to be removed.
This is something different when backing up OpenVZ machines from backing up KVM machines(which is mostly all I run).
I have not seen this issue backing up KVM VMs.
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
17,110
509
133
Austria
www.proxmox.com
Question here would be: Could vzdump handle such a situation more gracefully - if not imho it should simply throw an error, but as the lvm call hangs I suppose it doesn't get so far as to error-handling for the script.

We already try to handle that gracefully, but we can't do anything if lvm hangs (the process script is simply stopped by the kernel in that case).
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
17,110
509
133
Austria
www.proxmox.com

hk@

Active Member
Feb 10, 2010
221
5
38
Vienna
kapper.net
Hiwe tested the vzdump for both machines last night and it seems to work without any trouble.I'm still not totally confident, that we might not run into the same problem again, but I hope it was something related to the previous 2.6.32-10 kernel.we're now running:pveversion -vpve-manager: 2.0-57 (pve-manager/2.0/ff6cd700)running kernel: 2.6.32-11-pveproxmox-ve-2.6.32: 2.0-65pve-kernel-2.6.32-11-pve: 2.6.32-65pve-kernel-2.6.32-7-pve: 2.6.32-60lvm2: 2.02.88-2pve2clvm: 2.02.88-2pve2corosync-pve: 1.4.1-1openais-pve: 1.1.4-2libqb: 0.10.1-2redhat-cluster-pve: 3.1.8-3resource-agents-pve: 3.9.2-3fence-agents-pve: 3.1.7-2pve-cluster: 1.0-26qemu-server: 2.0-37pve-firmware: 1.0-15libpve-common-perl: 1.0-25libpve-access-control: 1.0-17libpve-storage-perl: 2.0-17vncterm: 1.0-2vzctl: 3.0.30-2pve2vzprocps: 2.0.11-2vzquota: 3.0.12-3pve-qemu-kvm: 1.0-9ksm-control-daemon: 1.1-1and the hardware is at the latest firmware for the raid-controllers:ServeRAID 8k: Controller Version Information -------------------------------------------------------- BIOS : 5.2-0 (16002) Firmware : 5.2-0 (16002) Driver : 1.1-7 (28000) Boot Flash : 5.1-0 (15411)LSI MegaRAID:System Operating System: Linux version 2.6.32-11-pve Driver Version: 00.00.05.40-rh2 CLI Version: 8.02.16Hardware Controller ProductName : MegaRAID SAS 8708EM2(Bus 0, Dev 0) SAS Address : 500605b004497d70 FW Package Version: 11.0.1-0042 Status : Optimalthe driver (Driver Version: 00.00.05.40-rh2) included with proxmox for LSI might be a problem according to http://www.anchor.com.au/blog/2012/...rnel-megaraid_sas-driver-from-crash-to-patch/ - but I have to admit I'm not an expert for this.btw. we're getting our arcconf and megacli from this repo and simply added this to our apt sources.list:# HW-Raiddeb http://hwraid.le-vert.net/debian squeeze mainand I agree with the forum member marotori that it would be great to get the potentially fixed megaraid-driver into the kernel :)regardshk
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!