Hi Guys
Im struggling with a problem for a few days now and I cant find a solution.
When restoring a VM dump from a USB drive after a few minutes of the restore process the infiniband interface on the IPoIB layer becomes non responsive and I can restore it with a
. Im not sure whether the triggering of this issue is only possible from this scenario as I have had an IB port react in the same way when putting a average load onto the ceph storage cluster.
here is the kern.log from the time i mount the USB until the IB interface stops working.
Any help will be greatly appreciated. Even a tap in the right direction.
Thaks
Shaun
Im struggling with a problem for a few days now and I cant find a solution.
When restoring a VM dump from a USB drive after a few minutes of the restore process the infiniband interface on the IPoIB layer becomes non responsive and I can restore it with a
Code:
service networking restart
here is the kern.log from the time i mount the USB until the IB interface stops working.
Code:
Nov 13 15:03:14 jhb-tc-pve-a kernel: [ 1714.145101] EXT4-fs (sdh1): recovery complete
Nov 13 15:03:14 jhb-tc-pve-a kernel: [ 1714.145281] EXT4-fs (sdh1): mounted filesystem with ordered data mode. Opts: (null)
Nov 13 15:03:26 jhb-tc-pve-a pvedaemon[2325]: <rootatpam> starting task UPID:jhb-tc-pve-a:000037FD:0002A255:5645DF9E:qmrestore:104:rootpam:
Nov 13 15:03:30 jhb-tc-pve-a kernel: [ 1729.707159] Key type ceph registered
Nov 13 15:03:30 jhb-tc-pve-a kernel: [ 1729.707470] libceph: loaded (mon/osd proto 15/24)
Nov 13 15:03:30 jhb-tc-pve-a kernel: [ 1729.709242] rbd: loaded (major 251)
Nov 13 15:03:30 jhb-tc-pve-a kernel: [ 1729.712515] libceph: client744804 fsid acaae5c1-7e7d-4482-8993-bcbbb04e7870
Nov 13 15:03:30 jhb-tc-pve-a kernel: [ 1729.713698] libceph: mon0 10.10.10.10:6789 session established
Nov 13 15:03:30 jhb-tc-pve-a kernel: [ 1729.741190] rbd: rbd0: added with size 0x3200000000
Nov 13 15:03:30 jhb-tc-pve-a kernel: [ 1730.329258] rbd: rbd1: added with size 0x7d00000000
Nov 13 15:04:06 jhb-tc-pve-a kernel: [ 1766.569633] libceph: mon0 10.10.10.10:6789 socket closed (con state OPEN)
Nov 13 15:04:06 jhb-tc-pve-a kernel: [ 1766.569660] libceph: mon0 10.10.10.10:6789 session lost, hunting for new mon
Nov 13 15:04:28 jhb-tc-pve-a kernel: [ 1787.948248] libceph: mon1 10.10.10.11:6789 socket closed (con state CONNECTING)
Nov 13 15:04:36 jhb-tc-pve-a kernel: [ 1796.574077] libceph: mon0 10.10.10.10:6789 socket closed (con state OPEN)
Nov 13 15:04:42 jhb-tc-pve-a kernel: [ 1802.553924] libceph: mon2 10.10.10.12:6789 socket closed (con state CONNECTING)
Nov 13 15:04:52 jhb-tc-pve-a kernel: [ 1812.519390] libceph: mon1 10.10.10.11:6789 socket closed (con state CONNECTING)
Nov 13 15:05:01 jhb-tc-pve-a kernel: [ 1821.520016] libceph: mon1 10.10.10.11:6789 socket closed (con state CONNECTING)
Nov 13 15:05:18 jhb-tc-pve-a kernel: [ 1837.774144] libceph: mon1 10.10.10.11:6789 socket closed (con state CONNECTING)
Nov 13 15:05:22 jhb-tc-pve-a kernel: [ 1841.762128] libceph: mon2 10.10.10.12:6789 socket closed (con state CONNECTING)
Nov 13 15:05:36 jhb-tc-pve-a kernel: [ 1856.582368] libceph: mon0 10.10.10.10:6789 socket closed (con state OPEN)
Nov 13 15:05:48 jhb-tc-pve-a kernel: [ 1867.824753] libceph: mon2 10.10.10.12:6789 socket closed (con state CONNECTING)
Nov 13 15:05:51 jhb-tc-pve-a kernel: [ 1871.536953] libceph: mon1 10.10.10.11:6789 socket closed (con state CONNECTING)
Nov 13 15:06:16 jhb-tc-pve-a kernel: [ 1896.071937] libceph: mon2 10.10.10.12:6789 socket closed (con state CONNECTING)
Nov 13 15:06:26 jhb-tc-pve-a kernel: [ 1906.589293] libceph: mon0 10.10.10.10:6789 socket closed (con state OPEN)
Nov 13 15:06:31 jhb-tc-pve-a kernel: [ 1911.073091] libceph: mon2 10.10.10.12:6789 socket closed (con state CONNECTING)
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.074129] libceph: mon2 10.10.10.12:6789 socket closed (con state CONNECTING)
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.529877] INFO: task kworker/u26:2:291 blocked for more than 120 seconds.
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.529921] Tainted: P O 4.2.2-1-pve #1
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.529945] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.529980] kworker/u26:2 D 0000000000000006 0 291 2 0x00000000
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.529992] Workqueue: writeback wb_workfn (flush-251:0)
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.529995] ffff880869e274f8 0000000000000046 ffff88086b6aee00 ffff88086a44ee00
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.529998] 0000000000000000 ffff880869e28000 ffff88087fc16a00 7fffffffffffffff
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530000] ffff880869e27718 ffffe8ffffc0b900 ffff880869e27518 ffffffff817cc077
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530003] Call Trace:
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530014] [<ffffffff817cc077>] schedule+0x37/0x80
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530018] [<ffffffff817ceeb1>] schedule_timeout+0x201/0x2a0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530023] [<ffffffff81380337>] ? blk_flush_plug_list+0xc7/0x220
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530030] [<ffffffff8101cc99>] ? read_tsc+0x9/0x10
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530033] [<ffffffff817cb66b>] io_schedule_timeout+0xbb/0x140
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530037] [<ffffffff8138ae69>] bt_get+0x129/0x1b0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530044] [<ffffffff810b76e0>] ? wait_woken+0x90/0x90
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530047] [<ffffffff8138b247>] blk_mq_get_tag+0x97/0xc0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530050] [<ffffffff81384cb2>] ? ll_back_merge_fn+0x132/0x190
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530053] [<ffffffff81386d1b>] __blk_mq_alloc_request+0x1b/0x1f0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530055] [<ffffffff81388a4b>] blk_mq_map_request+0x17b/0x1c0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530061] [<ffffffff811790f5>] ? mempool_alloc_slab+0x15/0x20
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530063] [<ffffffff81389953>] blk_sq_make_request+0x73/0x330
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530067] [<ffffffff8137bb0c>] ? generic_make_request_checks+0x1dc/0x3a0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530070] [<ffffffff8137bd9c>] generic_make_request+0xcc/0x110
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530073] [<ffffffff8137be56>] submit_bio+0x76/0x180
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530079] [<ffffffff813d39c6>] ? __percpu_counter_add+0x56/0x70
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530083] [<ffffffff81223b4c>] submit_bh_wbc+0x14c/0x180
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530088] [<ffffffff81225bbd>] __block_write_full_page.constprop.37+0x11d/0x3c0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530091] [<ffffffff81388602>] ? __blk_mq_run_hw_queue+0x1d2/0x360
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530093] [<ffffffff812261d0>] ? I_BDEV+0x20/0x20
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530096] [<ffffffff812261d0>] ? I_BDEV+0x20/0x20
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530098] [<ffffffff81225fab>] block_write_full_page+0x14b/0x170
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530102] [<ffffffff81226b78>] blkdev_writepage+0x18/0x20
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530104] [<ffffffff811815a7>] __writepage+0x17/0x40
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530107] [<ffffffff811837a5>] write_cache_pages+0x215/0x480
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530109] [<ffffffff81181590>] ? wb_position_ratio+0x1f0/0x1f0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530112] [<ffffffff81183a50>] generic_writepages+0x40/0x60
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530114] [<ffffffff8118478e>] do_writepages+0x1e/0x30
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530117] [<ffffffff8121a645>] __writeback_single_inode+0x45/0x290
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530120] [<ffffffff8121ad68>] writeback_sb_inodes+0x228/0x480
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530123] [<ffffffff8121b049>] __writeback_inodes_wb+0x89/0xc0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530126] [<ffffffff8121b268>] wb_writeback+0x1e8/0x280
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530128] [<ffffffff8121badd>] wb_workfn+0x2fd/0x470
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530135] [<ffffffff8108f807>] process_one_work+0x157/0x3f0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530138] [<ffffffff81090229>] worker_thread+0x69/0x480
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530141] [<ffffffff810901c0>] ? rescuer_thread+0x310/0x310
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530145] [<ffffffff810957db>] kthread+0xdb/0x100
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530147] [<ffffffff81095700>] ? kthread_create_on_node+0x1c0/0x1c0
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530151] [<ffffffff817d019f>] ret_from_fork+0x3f/0x70
Nov 13 15:06:40 jhb-tc-pve-a kernel: [ 1920.530153] [<ffffffff81095700>] ? kthread_create_on_node+0x1c0/0x1c0
Nov 13 15:06:56 jhb-tc-pve-a kernel: [ 1936.593369] libceph: mon0 10.10.10.10:6789 socket closed (con state OPEN)
Any help will be greatly appreciated. Even a tap in the right direction.
Thaks
Shaun