Proxmox 2.2 crash: unable to connect to VM socket - timeout after 31 retries

sebastien

New Member
Feb 14, 2013
3
0
1
Hi there,

I've been using proxmox for couple years without any major issue. I've just replaced my proxmox server with a brand new Intel Xeon CPU E31220, RAM 8GB. Since then I've experienced multpile crashes. Here is the message displayed on console:

Code:
Feb 14 10:06:24 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 102 socket - timeout after 31 retries
Feb 14 10:06:27 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 101 socket - timeout after 31 retries
Feb 14 10:06:30 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 100 socket - timeout after 31 retries
Feb 14 10:06:30 proxmox-plateforme pvestatd[1915]: status update time (12.053 seconds)
Feb 14 10:06:33 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 103 socket - timeout after 31 retries
Feb 14 10:06:36 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 102 socket - timeout after 31 retries
Feb 14 10:06:39 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 101 socket - timeout after 31 retries
Feb 14 10:06:42 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 100 socket - timeout after 31 retries
Feb 14 10:06:42 proxmox-plateforme pvestatd[1915]: status update time (12.052 seconds)
Feb 14 10:06:45 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 103 socket - timeout after 31 retries
Feb 14 10:06:48 proxmox-plateforme kernel: INFO: task scsi_eh_1:305 blocked for more than 120 seconds.
Feb 14 10:06:48 proxmox-plateforme kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 14 10:06:48 proxmox-plateforme kernel: scsi_eh_1     D ffff88020a24afd0     0   305      2    0 0x00000000
Feb 14 10:06:48 proxmox-plateforme kernel: ffff8802095b1bb0 0000000000000046 ffffffffa000a456 ffff88000001a680
Feb 14 10:06:48 proxmox-plateforme kernel: ffff8802095b1cb0 0000000000000000 0000000000000001 ffff880000033a00
Feb 14 10:06:48 proxmox-plateforme kernel: ffff88020e0b6090 ffff88020a24b580 ffff8802095b1fd8 ffff8802095b1fd8
Feb 14 10:06:48 proxmox-plateforme kernel: Call Trace:
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8109cacf>] ? up+0x2f/0x50
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81528ce5>] schedule_timeout+0x215/0x2e0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81528953>] wait_for_common+0x123/0x190
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81059ed0>] ? default_wake_function+0x0/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffffa0001b4b>] ? enqueue_cmd_and_start_io+0x11b/0x180 [hpsa]
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81528a7d>] wait_for_completion+0x1d/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffffa0004016>] hpsa_eh_device_reset_handler+0x116/0x3c0 [hpsa]
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8136f5cc>] scsi_eh_ready_devs+0x23c/0x860
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff813702e3>] scsi_error_handler+0x4f3/0x6d0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8136fdf0>] ? scsi_error_handler+0x0/0x6d0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff810964f6>] kthread+0x96/0xa0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81096460>] ? kthread+0x0/0xa0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: INFO: task kjournald:386 blocked for more than 120 seconds.
Feb 14 10:06:48 proxmox-plateforme kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 14 10:06:48 proxmox-plateforme kernel: kjournald     D ffff880209aa0200     0   386      2    0 0x00000000
Feb 14 10:06:48 proxmox-plateforme kernel: ffff880209a99c50 0000000000000046 ffff880209a99c10 ffffffff8141e08c
Feb 14 10:06:48 proxmox-plateforme kernel: ffff880209a99bc0 ffffffff81012b79 ffff880209a99c00 ffffffff810a1959
Feb 14 10:06:48 proxmox-plateforme kernel: 0000000003d3d0e8 ffff880209aa07b0 ffff880209a99fd8 ffff880209a99fd8
Feb 14 10:06:48 proxmox-plateforme kernel: Call Trace:
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8141e08c>] ? dm_table_unplug_all+0x5c/0x100
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81012b79>] ? read_tsc+0x9/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff810a1959>] ? ktime_get_ts+0xa9/0xe0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caab0>] ? sync_buffer+0x0/0x50
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff815285f3>] io_schedule+0x73/0xc0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caaf5>] sync_buffer+0x45/0x50
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81528fbf>] __wait_on_bit+0x5f/0x90
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caab0>] ? sync_buffer+0x0/0x50
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81529068>] out_of_line_wait_on_bit+0x78/0x90
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81096b10>] ? wake_bit_function+0x0/0x40
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811cbce6>] __wait_on_buffer+0x26/0x30
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffffa00b1f1e>] journal_commit_transaction+0x9fe/0x12f0 [jbd]
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8107fbfc>] ? lock_timer_base+0x3c/0x70
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8108085b>] ? try_to_del_timer_sync+0x7b/0xe0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffffa00b7358>] kjournald+0xe8/0x250 [jbd]
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81096ad0>] ? autoremove_wake_function+0x0/0x40
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffffa00b7270>] ? kjournald+0x0/0x250 [jbd]
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff810964f6>] kthread+0x96/0xa0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81096460>] ? kthread+0x0/0xa0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: INFO: task flush-253:0:920 blocked for more than 120 seconds.
Feb 14 10:06:48 proxmox-plateforme kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 14 10:06:48 proxmox-plateforme kernel: flush-253:0   D ffff88020cc46d10     0   920      2    0 0x00000000
Feb 14 10:06:48 proxmox-plateforme kernel: ffff88020d2d77b0 0000000000000046 0000000000000000 ffffffff8141e08c
Feb 14 10:06:48 proxmox-plateforme kernel: ffff8802094e6338 0000000000000008 0000000003d2ba98 0000000000800000
Feb 14 10:06:48 proxmox-plateforme kernel: ffff88020d2d7800 ffff88020cc472c0 ffff88020d2d7fd8 ffff88020d2d7fd8
Feb 14 10:06:48 proxmox-plateforme kernel: Call Trace:
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8141e08c>] ? dm_table_unplug_all+0x5c/0x100
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caab0>] ? sync_buffer+0x0/0x50
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff815285f3>] io_schedule+0x73/0xc0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caaf5>] sync_buffer+0x45/0x50
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81528e6a>] __wait_on_bit_lock+0x5a/0xc0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caab0>] ? sync_buffer+0x0/0x50
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81528f48>] out_of_line_wait_on_bit_lock+0x78/0x90
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81096b10>] ? wake_bit_function+0x0/0x40
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8113df1c>] ? test_clear_page_writeback+0x8c/0x180
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811cc0a0>] ? end_buffer_async_write+0x0/0x180
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811cbe66>] __lock_buffer+0x36/0x40
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811cca04>] __block_write_full_page+0x484/0x4b0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81125f34>] ? end_page_writeback+0x44/0x60
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811cc0a0>] ? end_buffer_async_write+0x0/0x180
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caa40>] ? generic_submit_bh_handler+0x0/0x10
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caa40>] ? generic_submit_bh_handler+0x0/0x10
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811caa40>] ? generic_submit_bh_handler+0x0/0x10
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811cd4d7>] generic_block_write_full_page+0x137/0x140
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811cd4f8>] block_write_full_page_endio+0x18/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811cd515>] block_write_full_page+0x15/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffffa00d37ad>] ext3_ordered_writepage+0x1ed/0x240 [ext3]
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8113be47>] __writepage+0x17/0x40
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8113cd4b>] write_cache_pages+0x1cb/0x480
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8113be30>] ? __writepage+0x0/0x40
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8113d024>] generic_writepages+0x24/0x30
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8113d065>] do_writepages+0x35/0x40
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811c195d>] __writeback_single_inode+0xdd/0x290
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811c1b4a>] writeback_single_inode+0x3a/0xc0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811c1e41>] writeback_sb_inodes+0xf1/0x210
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811c20b0>] writeback_inodes_wb+0x150/0x1a0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811c23db>] wb_writeback+0x2db/0x430
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81527e54>] ? thread_return+0xba/0x7e6
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811c26d9>] wb_do_writeback+0x1a9/0x250
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8107fd10>] ? process_timeout+0x0/0x10
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811c27e3>] bdi_writeback_task+0x63/0x1b0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff810969a7>] ? bit_waitqueue+0x17/0xc0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811511f0>] ? bdi_start_fn+0x0/0x110
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81151285>] bdi_start_fn+0x95/0x110
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff811511f0>] ? bdi_start_fn+0x0/0x110
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff810964f6>] kthread+0x96/0xa0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff81096460>] ? kthread+0x0/0xa0
Feb 14 10:06:48 proxmox-plateforme kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 14 10:06:48 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 102 socket - timeout after 31 retries
Feb 14 10:06:51 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 101 socket - timeout after 31 retries
Feb 14 10:06:54 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 100 socket - timeout after 31 retries
Feb 14 10:06:54 proxmox-plateforme pvestatd[1915]: status update time (12.053 seconds)
Feb 14 10:06:57 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 103 socket - timeout after 31 retries
Feb 14 10:07:00 proxmox-plateforme pvestatd[1915]: WARNING: unable to connect to VM 102 socket - timeout after 31 retries

Could the "writeback*" entries suggest some kind of disk controller issue (the machine has a Smart Array G6 controller) ?

Any help appreciated...
 
Did you find a solution? I'm now experiencing something similar with the latest bits. Twice in the last 24 hours a bunch of our machines (30 VM's out of 86) have stopped responding and when I look in the daemon.log file, I find lots of entries of the form:

Code:
pvestatd[568060]: WARNING: unable to connect to VM 149 socket - timeout after 31 retries

The non responsive VM's are a combination of KVM and OpenVZ machines.

Our system is a 6 node PVE cluster. All have the latest bits installed and all have been rebooted within the last two weeks. Here is the pveversion -v from one of the nodes:

Code:
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-18-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-19-pve: 2.6.32-93
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1

Do we have something misconfigured? Is there a bug?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!