i/o problem

udi · Dec 12, 2011

hi,
today's backup task doesn't want to end.
the io delay of the node is displayed around 50% and the icons of the vms are black but they are running.
here's the syslog:

Code:

Dec 11 23:28:09 genya vzdump[279919]: INFO: Starting Backup of VM 109 (qemu)
Dec 11 23:28:09 genya qm[282317]: <root@pam> update VM 109: -lock backup
Dec 11 23:28:11 genya qm[282331]: <root@pam> starting task UPID:genya:00044EDC:010C0F34:4EE52E7B:qmsuspend:109:root@pam:
Dec 11 23:28:11 genya qm[282332]: suspend VM 109: UPID:genya:00044EDC:010C0F34:4EE52E7B:qmsuspend:109:root@pam:
Dec 11 23:28:11 genya qm[282331]: <root@pam> end task UPID:genya:00044EDC:010C0F34:4EE52E7B:qmsuspend:109:root@pam: OK
Dec 11 23:28:12 genya qm[282391]: <root@pam> starting task UPID:genya:00044F18:010C0F95:4EE52E7C:qmresume:109:root@pam:
Dec 11 23:28:12 genya qm[282392]: resume VM 109: UPID:genya:00044F18:010C0F95:4EE52E7C:qmresume:109:root@pam:
Dec 11 23:28:12 genya qm[282391]: <root@pam> end task UPID:genya:00044F18:010C0F95:4EE52E7C:qmresume:109:root@pam: OK
Dec 11 23:28:25 genya pvestatd[2146]: status update time (5.661 seconds)
Dec 11 23:28:35 genya pvestatd[2146]: status update time (6.015 seconds)
Dec 11 23:32:17 genya pvestatd[2146]: status update time (7.153 seconds)
Dec 11 23:32:25 genya pvestatd[2146]: status update time (6.422 seconds)
Dec 11 23:32:35 genya pvestatd[2146]: status update time (5.765 seconds)
Dec 11 23:32:45 genya pvestatd[2146]: status update time (5.923 seconds)
Dec 11 23:32:54 genya pvestatd[2146]: status update time (5.521 seconds)
Dec 11 23:34:35 genya pvestatd[2146]: status update time (5.484 seconds)
Dec 11 23:34:45 genya pvestatd[2146]: status update time (6.319 seconds)
Dec 11 23:34:58 genya pvestatd[2146]: status update time (9.222 seconds)
Dec 11 23:35:04 genya pvestatd[2146]: status update time (5.222 seconds)
Dec 11 23:35:14 genya pvestatd[2146]: status update time (5.201 seconds)
Dec 11 23:35:25 genya pvestatd[2146]: status update time (5.814 seconds)
Dec 11 23:35:36 genya pvestatd[2146]: status update time (7.018 seconds)
Dec 11 23:37:25 genya pvedaemon[280295]: <root@pam> successful auth for user 'root@pam'
Dec 11 23:38:16 genya kernel: INFO: task lvremove:283008 blocked for more than 120 seconds.
Dec 11 23:38:16 genya kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 11 23:38:16 genya kernel: lvremove      D ffff8802bb8fd3e0     0 283008 279919    0 0x00000000
Dec 11 23:38:16 genya kernel: ffff8804488a3b78 0000000000000086 0000000000000000 ffff8805294cd500
Dec 11 23:38:16 genya kernel: ffff8804488a3b48 ffffffff813f473c 0000000000000008 000000010a7aef82
Dec 11 23:38:16 genya kernel: ffff8802bb8fd9a8 ffff8804488a3fd8 000000000000f788 ffff8802bb8fd9a8
Dec 11 23:38:16 genya kernel: Call Trace:
Dec 11 23:38:16 genya kernel: [<ffffffff813f473c>] ? dm_table_unplug_all+0x5c/0xd0
Dec 11 23:38:16 genya kernel: [<ffffffff814e7a93>] io_schedule+0xa3/0x110
Dec 11 23:38:16 genya kernel: [<ffffffff811c328e>] __blockdev_direct_IO+0x6fe/0xc20
Dec 11 23:38:16 genya kernel: [<ffffffff81242dcd>] ? get_disk+0x7d/0xf0
Dec 11 23:38:16 genya kernel: [<ffffffff811c0e97>] blkdev_direct_IO+0x57/0x60
Dec 11 23:38:16 genya kernel: [<ffffffff811c0060>] ? blkdev_get_blocks+0x0/0xc0
Dec 11 23:38:16 genya kernel: [<ffffffff8111f3ab>] generic_file_aio_read+0x70b/0x780
Dec 11 23:38:16 genya kernel: [<ffffffff811c1971>] ? blkdev_open+0x71/0xc0
Dec 11 23:38:16 genya kernel: [<ffffffff81184753>] ? __dentry_open+0x113/0x330
Dec 11 23:38:16 genya kernel: [<ffffffff8121ece8>] ? devcgroup_inode_permission+0x48/0x50
Dec 11 23:38:16 genya kernel: [<ffffffff811870da>] do_sync_read+0xfa/0x140
Dec 11 23:38:16 genya kernel: [<ffffffff81198252>] ? user_path_at+0x62/0xa0
Dec 11 23:38:16 genya kernel: [<ffffffff810922d0>] ? autoremove_wake_function+0x0/0x40
Dec 11 23:38:16 genya kernel: [<ffffffff811c042c>] ? block_ioctl+0x3c/0x40
Dec 11 23:38:16 genya kernel: [<ffffffff8119a862>] ? vfs_ioctl+0x22/0xa0
Dec 11 23:38:16 genya kernel: [<ffffffff8119aa0a>] ? do_vfs_ioctl+0x8a/0x5d0
Dec 11 23:38:16 genya kernel: [<ffffffff81187ae5>] vfs_read+0xb5/0x1a0
Dec 11 23:38:16 genya kernel: [<ffffffff81187c21>] sys_read+0x51/0x90
Dec 11 23:38:16 genya kernel: [<ffffffff8100b242>] system_call_fastpath+0x16/0x1b
Dec 11 23:38:16 genya kernel: INFO: task vgs:283018 blocked for more than 120 seconds.
Dec 11 23:38:16 genya kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 11 23:38:16 genya kernel: vgs           D ffff8805390a0ee0     0 283018   2146    0 0x00000000
Dec 11 23:38:16 genya kernel: ffff88033f04bb78 0000000000000082 ffff88036c7c1f80 ffff8805294cd500
Dec 11 23:38:16 genya kernel: ffff88033f04bb48 ffffffff813f473c 0000000000000008 0000000000001000
Dec 11 23:38:16 genya kernel: ffff8805390a14a8 ffff88033f04bfd8 000000000000f788 ffff8805390a14a8
Dec 11 23:38:16 genya kernel: Call Trace:
Dec 11 23:38:16 genya kernel: [<ffffffff813f473c>] ? dm_table_unplug_all+0x5c/0xd0
Dec 11 23:38:16 genya kernel: [<ffffffff8109cc89>] ? ktime_get_ts+0xa9/0xe0
Dec 11 23:38:16 genya kernel: [<ffffffff814e7a93>] io_schedule+0xa3/0x110
Dec 11 23:38:16 genya kernel: [<ffffffff811c328e>] __blockdev_direct_IO+0x6fe/0xc20
Dec 11 23:38:16 genya kernel: [<ffffffff81242dcd>] ? get_disk+0x7d/0xf0
Dec 11 23:38:16 genya kernel: [<ffffffff811c0e97>] blkdev_direct_IO+0x57/0x60
Dec 11 23:38:16 genya kernel: [<ffffffff811c0060>] ? blkdev_get_blocks+0x0/0xc0
Dec 11 23:38:16 genya kernel: [<ffffffff8111f3ab>] generic_file_aio_read+0x70b/0x780
Dec 11 23:38:16 genya kernel: [<ffffffff811c1971>] ? blkdev_open+0x71/0xc0
Dec 11 23:38:16 genya kernel: [<ffffffff81184753>] ? __dentry_open+0x113/0x330
Dec 11 23:38:16 genya kernel: [<ffffffff8121ece8>] ? devcgroup_inode_permission+0x48/0x50
Dec 11 23:38:16 genya kernel: [<ffffffff811870da>] do_sync_read+0xfa/0x140
Dec 11 23:38:16 genya kernel: [<ffffffff81198252>] ? user_path_at+0x62/0xa0
Dec 11 23:38:16 genya kernel: [<ffffffff810922d0>] ? autoremove_wake_function+0x0/0x40
Dec 11 23:38:16 genya kernel: [<ffffffff811c042c>] ? block_ioctl+0x3c/0x40
Dec 11 23:38:16 genya kernel: [<ffffffff8119a862>] ? vfs_ioctl+0x22/0xa0
Dec 11 23:38:16 genya kernel: [<ffffffff8119aa0a>] ? do_vfs_ioctl+0x8a/0x5d0
Dec 11 23:38:16 genya kernel: [<ffffffff81187ae5>] vfs_read+0xb5/0x1a0
Dec 11 23:38:16 genya kernel: [<ffffffff81187c21>] sys_read+0x51/0x90
Dec 11 23:38:16 genya kernel: [<ffffffff8100b242>] system_call_fastpath+0x16/0x1b

what to do? i didn't reboot, yet.

dietmar · Dec 12, 2011

What kernel version is that exactly

# uname -a

udi · Dec 12, 2011

things are worse: now i cannot login to the gui (Login failed. Please try again).

lvdisplay and other lvm related hang the ssh console.

pveperf on the raid array is normal but on the backup hdd is ~80 mB/s. yesterdays backup files are all there.

the vms are still running.

# uname -a
Linux genya 2.6.32-6-pve #1 SMP Tue Dec 6 11:06:22 CET 2011 x86_64 GNU/Linux

# pveversion -v
pve-manager: 2.0-14 (pve-manager/2.0/6a150142)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-54
pve-kernel-2.6.32-6-pve: 2.6.32-54
lvm2: 2.02.86-1pve2
clvm: 2.02.86-1pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-1
libqb: 0.6.0-1
redhat-cluster-pve: 3.1.7-1
pve-cluster: 1.0-12
qemu-server: 2.0-11
pve-firmware: 1.0-13
libpve-common-perl: 1.0-10
libpve-access-control: 1.0-3
libpve-storage-perl: 2.0-9
vncterm: 1.0-2
vzctl: 3.0.29-3pve7
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-1
ksm-control-daemon: 1.1-1

udi · Dec 12, 2011

any thoughts?

udi · Dec 12, 2011

hi,
today i could not wait longer and i had to reset the server (it did not simply reboot).
i was afraid what will happen but everything came back and all is fine now.

i would be glad if someone could help me to find out if this was a hardware or a software issue.
the machine is an x3650 with serveraid 8k and 1tb wd re4 disks, with latest bios and firmwares.

thank you
u.

bread-baker · Dec 13, 2011

backup causes this

This has been randomly occurring to us.

When we use snapshot backup the containers can not be entered into or stop. this has happened since the 1-st beta on one system, and I thought it was due to hardware issues as it had older disks.

but after todays backup it happened to another system. here is the last part of dmesg

Code:

CT: 8002: stopped
Ub 8002 helds 2312 in tcpsndbuf on put
UB: leaked beancounter 8002 (ffff88063becc8c0)
CT: 8002: started
device veth8002.0 entered promiscuous mode
vmbr0: port 2(veth8002.0) entering forwarding state
veth8002.0: no IPv6 routers present
eth0: no IPv6 routers present
EXT3-fs: barriers disabled
kjournald starting.  Commit interval 5 seconds
EXT3-fs (dm-4): using internal journal
ext3_orphan_cleanup: deleting unreferenced inode 4310159
ext3_orphan_cleanup: deleting unreferenced inode 4310158
ext3_orphan_cleanup: deleting unreferenced inode 4310157
ext3_orphan_cleanup: deleting unreferenced inode 4310156
ext3_orphan_cleanup: deleting unreferenced inode 4310155
ext3_orphan_cleanup: deleting unreferenced inode 3473643
ext3_orphan_cleanup: deleting unreferenced inode 2959601
ext3_orphan_cleanup: deleting unreferenced inode 2811924
ext3_orphan_cleanup: deleting unreferenced inode 2811923
ext3_orphan_cleanup: deleting unreferenced inode 2811922
ext3_orphan_cleanup: deleting unreferenced inode 2811921
ext3_orphan_cleanup: deleting unreferenced inode 2811920
ext3_orphan_cleanup: deleting unreferenced inode 4179611
ext3_orphan_cleanup: deleting unreferenced inode 4179609
ext3_orphan_cleanup: deleting unreferenced inode 4179604
ext3_orphan_cleanup: deleting unreferenced inode 4179602
ext3_orphan_cleanup: deleting unreferenced inode 4179578
EXT3-fs (dm-4): 17 orphan inodes deleted
EXT3-fs (dm-4): recovery complete
EXT3-fs (dm-4): mounted filesystem with ordered data mode
INFO: task kjournald:1080 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff88033b1ba980     0  1080      2    0 0x00000000
 ffff880337f53c30 0000000000000046 0000000000000000 ffff88063be9b300
 ffff880337f53c00 ffffffff813f473c ffff880337f53be0 0000000112e2395e
 ffff88033b1baf48 ffff880337f53fd8 000000000000f788 ffff88033b1baf48
Call Trace:
 [<ffffffff813f473c>] ? dm_table_unplug_all+0x5c/0xd0
 [<ffffffff814e7a93>] io_schedule+0xa3/0x110
 [<ffffffff811bae80>] ? sync_buffer+0x0/0x50
 [<ffffffff811baec5>] sync_buffer+0x45/0x50
 [<ffffffff814e832f>] __wait_on_bit+0x5f/0x90
 [<ffffffff811bae80>] ? sync_buffer+0x0/0x50
 [<ffffffff814e83d8>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff81092310>] ? wake_bit_function+0x0/0x40
 [<ffffffff811bae76>] __wait_on_buffer+0x26/0x30
 [<ffffffffa0075eee>] journal_commit_transaction+0x9ce/0x1130 [jbd]
 [<ffffffff8107afbc>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8107bc1b>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa0078fc8>] kjournald+0xe8/0x250 [jbd]
 [<ffffffff810922d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0078ee0>] ? kjournald+0x0/0x250 [jbd]
 [<ffffffff81091cf6>] kthread+0x96/0xa0
 [<ffffffff8100c2ca>] child_rip+0xa/0x20
 [<ffffffff81091c60>] ? kthread+0x0/0xa0
 [<ffffffff8100c2c0>] ? child_rip+0x0/0x20
INFO: task flush-253:2:1452 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-253:2   D ffff88033abaae60     0  1452      2    0 0x00000000
 ffff880337963940 0000000000000046 0000000000000000 ffffffff813f473c
 ffff88029942ef00 0000000000000001 ffff8802b5091080 0000000112e27fe5
 ffff88033abab428 ffff880337963fd8 000000000000f788 ffff88033abab428
Call Trace:
 [<ffffffff813f473c>] ? dm_table_unplug_all+0x5c/0xd0
 [<ffffffff8111d4b0>] ? sync_page+0x0/0x50
 [<ffffffff814e7a93>] io_schedule+0xa3/0x110
 [<ffffffff8111d4ed>] sync_page+0x3d/0x50
 [<ffffffff814e81da>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff8111d487>] __lock_page+0x67/0x70
 [<ffffffff81092310>] ? wake_bit_function+0x0/0x40
 [<ffffffff81133e3a>] write_cache_pages+0x36a/0x480
 [<ffffffff811326e0>] ? __writepage+0x0/0x40
 [<ffffffff81133f74>] generic_writepages+0x24/0x30
 [<ffffffff81133fb5>] do_writepages+0x35/0x40
 [<ffffffff811b1f6d>] __writeback_single_inode+0xdd/0x2c0
 [<ffffffff811b21d3>] writeback_single_inode+0x83/0xc0
 [<ffffffff811a1e30>] ? iput+0x30/0x70
 [<ffffffff811b2436>] writeback_sb_inodes+0xe6/0x1a0
 [<ffffffff811b259b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811b294b>] wb_writeback+0x2ab/0x400
 [<ffffffff814e71da>] ? thread_return+0x4e/0x864
 [<ffffffff811b2c49>] wb_do_writeback+0x1a9/0x250
 [<ffffffff8107b0d0>] ? process_timeout+0x0/0x10
 [<ffffffff811b2d53>] bdi_writeback_task+0x63/0x1b0
 [<ffffffff810921a7>] ? bit_waitqueue+0x17/0xc0
 [<ffffffff81146bc0>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81146c46>] bdi_start_fn+0x86/0x100
 [<ffffffff81146bc0>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81091cf6>] kthread+0x96/0xa0
 [<ffffffff8100c2ca>] child_rip+0xa/0x20
 [<ffffffff81091c60>] ? kthread+0x0/0xa0
 [<ffffffff8100c2c0>] ? child_rip+0x0/0x20
INFO: task rs:main Q:Reg:460298 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rs:main Q:Reg D ffff88033b15a280     0 460298 460192 8002 0x00000000
 ffff88015557b938 0000000000000082 0000000000000000 ffff88063be9b300
 ffff88015557b908 ffffffff813f473c 0000000000000000 0000000112e2399a
 ffff88033b15a848 ffff88015557bfd8 000000000000f788 ffff88033b15a848
Call Trace:
 [<ffffffff813f473c>] ? dm_table_unplug_all+0x5c/0xd0
 [<ffffffff811bae80>] ? sync_buffer+0x0/0x50
 [<ffffffff814e7a93>] io_schedule+0xa3/0x110
 [<ffffffff811baec5>] sync_buffer+0x45/0x50
 [<ffffffff814e81da>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811bae80>] ? sync_buffer+0x0/0x50
 [<ffffffff814e82b8>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff81092310>] ? wake_bit_function+0x0/0x40
 [<ffffffff811bb046>] __lock_buffer+0x36/0x40
 [<ffffffff811bbf68>] __sync_dirty_buffer+0xc8/0xf0
 [<ffffffff811bbfa3>] sync_dirty_buffer+0x13/0x20
 [<ffffffffa0074bbe>] journal_dirty_data+0x1de/0x270 [jbd]
 [<ffffffffa008efb0>] ext3_journal_dirty_data+0x20/0x50 [ext3]
 [<ffffffffa008f005>] journal_dirty_data_fn+0x25/0x30 [ext3]
 [<ffffffffa008e097>] walk_page_buffers+0x87/0xc0 [ext3]
 [<ffffffffa008efe0>] ? journal_dirty_data_fn+0x0/0x30 [ext3]
 [<ffffffffa00925b4>] ext3_ordered_write_end+0x84/0x170 [ext3]
 [<ffffffff8111e004>] generic_file_buffered_write+0x194/0x2c0
 [<ffffffff8111faf0>] __generic_file_aio_write+0x240/0x470
 [<ffffffff8111fd8f>] generic_file_aio_write+0x6f/0xe0
 [<ffffffff81186f9a>] do_sync_write+0xfa/0x140
 [<ffffffff810922d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8119e988>] ? d_free+0x58/0x60
 [<ffffffff811a7660>] ? mntput_no_expire+0x30/0x110
 [<ffffffff81187278>] vfs_write+0xb8/0x1a0
 [<ffffffff81187cb1>] sys_write+0x51/0x90
 [<ffffffff8100b242>] system_call_fastpath+0x16/0x1b
INFO: task lvremove:825718 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lvremove      D ffff88043aec6c20     0 825718 816722    0 0x00000000
 ffff8804458e1b78 0000000000000086 0000000000000000 ffff88063be9b300
 ffff8804458e1b48 ffffffff813f473c 0000000000000008 0000000112e224a0
 ffff88043aec71e8 ffff8804458e1fd8 000000000000f788 ffff88043aec71e8
...

bread-baker · Dec 13, 2011

Re: backup causes this

root@fbc186 /etc # pveversion -v
pve-manager: 2.0-14 (pve-manager/2.0/6a150142)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-54
pve-kernel-2.6.32-6-pve: 2.6.32-54
lvm2: 2.02.86-1pve2
clvm: 2.02.86-1pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-1
libqb: 0.6.0-1
redhat-cluster-pve: 3.1.7-1
pve-cluster: 1.0-12
qemu-server: 2.0-11
pve-firmware: 1.0-13
libpve-common-perl: 1.0-10
libpve-access-control: 1.0-3
libpve-storage-perl: 2.0-9
vncterm: 1.0-2
vzctl: 3.0.29-3pve7
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-1
ksm-control-daemon: 1.1-1

bread-baker · Dec 13, 2011

Re: backup causes this

pve backup log:

INFO: starting new backup job: vzdump --quiet 1 --mailto xxxx --mode snapshot --node fbc186 --all 1 --compress 1 --maxfiles 1 --storage fbc186storage
INFO: Starting Backup of VM 10001 (openvz)
INFO: CTID 10001 exist mounted running
INFO: status = running
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating lvm snapshot of /dev/mapper/pve-data ('/dev/pve/vzsnap-fbc186-0')
INFO: Logical volume "vzsnap-fbc186-0" created
INFO: creating archive '/data/fbc186storage/dump/vzdump-openvz-10001-2011_12_13-06_00_02.tgz'
INFO: Total bytes written: 5996288000 (5.6GiB, 8.8MiB/s)
INFO: archive file size: 5.17GB
INFO: delete old backup '/data/fbc186storage/dump/vzdump-openvz-10001-2011_12_02-11_12_00.tgz'
INFO: delete old backup '/data/fbc186storage/dump/vzdump-openvz-10001-2011_12_05-07_00_03.tgz'

bread-baker · Dec 13, 2011

Re: backup causes this

none of the other ct's is accessable.

can not use vzctl or ssh to enter.

those other ct's had not had a backup done.

the system got stuck during the 1-st backup.

bread-baker · Dec 13, 2011

Re: backup causes this

sys logs do not help, here is today's vzctl.log:

2011-12-12T21:01:21-0500 vzctl : CT 8002 : Configure veth devices: veth8002.0
2011-12-12T21:01:22-0500 vzctl : CT 8002 : Container start in progress...
2011-12-13T07:09:29-0500 vzctl : CT 8002 : Stopping container ...
2011-12-13T07:12:29-0500 vzctl : CT 8002 : Unable to stop container: operation timed out
2011-12-13T07:12:53-0500 vzctl : CT 8002 : Stopping container ...
2011-12-13T07:15:53-0500 vzctl : CT 8002 : Unable to stop container: operation timed out
2011-12-13T07:30:12-0500 vzctl : CT 8002 : Stopping container ...
2011-12-13T07:33:12-0500 vzctl : CT 8002 : Unable to stop container: operation timed out

bread-baker · Dec 13, 2011

try to change your backup type to suspend, as I do not think it uses lvm snapshot.

otherwise use a stop type backup.

eredhal · Dec 13, 2011

Recently after upgrade to 1.9 i had same issues with backups. One of it have written enourmous amount of text errors to system logs, making about 90GB of log files 30GB x3. That made i/o problems, i had to change backup to "stop" mode because suspend doesn't help neither. After those operation backup this week gone quite smoothly.

And because of full /mnt/pve-root i could use most of commnads on proxmox like ssh etc, no space for disk operation.

udi · Dec 16, 2011

bread-baker said:
try to change your backup type to suspend, as I do not think it uses lvm snapshot.

otherwise use a stop type backup.

suspend or stop is not an option for me

anyway, i changed the backup job to compress files and it doesn't produce the io errors.

except one thing: a stopped vm generates an error:

Code:

INFO: Starting Backup of VM 111 (qemu)
INFO: status = stopped
INFO: backup mode: stop
INFO: bandwidth limit: 1024000 KB/s
INFO: ionice priority: 7
ERROR: Backup of VM 111 failed - no such volume 'vgvirt:vm-111-disk-1'

bread-baker · Dec 16, 2011

try setting

/etc/vzdump.conf

size: 4000

-------------------------
I think that the max size is the Vfree shown in the output of 'vgs' . here is one node here:

Code:

root@fbc1 /etc # vgs
  VG   #PV #LV #SN Attr   VSize   VFree 
  fbc1   1   1   0 wz--n- 831.30g 31.30g
  pve    1   3   0 wz--n-  99.50g  4.00g

so check the amount Vfree on the volume group you are using and set 'size' to close to that.

udi · Dec 25, 2011

yesterday it happened again, the same as in the first post.

the similarity i noticed is that the backup hang at a windows machine which have 2 disks.
the difference is that i tried to make compressed backups.
the node had the 2.6.32-54 kernel.

i could use some help from the staff

thank you
u.

bazzi · Dec 28, 2011

Exact same problems here. It accours on different systems and compression was on. The cluster is running on a IMS and backupping trough a NFS share to a synology DS810+.

dietmar · Dec 30, 2011

I also observed that bug here, but so far I am unable to reproduce it.

bazzi · Dec 30, 2011

We have complete disabled the backups for now so the cluster will function as normal. The stranged thing is that it will acceur on different nodes every time...

e100 · Jan 2, 2012

Seems like this is a problem with lvm.

bread-baker and udi output both reports that lvremove is hanging.

A google search related to lvremove hang leads me here:
https://www.multifake.net/2011/12/debian-squeeze-lvm-udev-etc-buggy/

Interesting that this guy is using Squeeze with Xen and having very similar problems.

If you can reproduce the problem can you run other lvm commands like lvs lvdisplay, lvscan while the backup hangs?

EDIT:
Found this bug:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=618016 that links to:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=549691 which might have a workaround

bread-baker · Jan 2, 2012

"If you can reproduce the problem can you run other lvm commands like lvs lvdisplay, lvscan while the backup hangs?

As far as I can remember , no lvm commands worked .

and the only way to get the system working again was to do a hardware power reset .

after reading https://www.multifake.net/2011/12/debian-squeeze-lvm-udev-etc-buggy/ , I wonder if the issue is related to having a usb drive attached... I often have a 500gb drive , formatted ext3 attached to proxmox 2.0 test servers. [we use to transport pve dumps off site ].

i/o problem

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Member

Member

Member

Member

Member

Member

eredhal

Guest

Active Member

Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Renowned Member

Member