kernel: BUG: soft lockup - CPU#6 stuck for 67s!

Alexis

New Member
Jun 13, 2014
12
0
1
Hi,
I saw a few post with people with a similar problem but they seem to have this problem on their VMs. Me its on my PVE host. I get this when I have multiple containers using a lot of CPU. I decided to test it with running the stress command on multiple containers. Usually it takes about 12 hours before the proxmox crashed and get this errors all over the screen.
kernel: BUG: soft lockup - CPU#6 stuck for 67s! [flush-0:27:265503]
Its always the CPU#6. Is it a hardware problem?
This the call trace in syslog :
Code:
Aug 20 08:01:13 safecloud101 kernel: BUG: soft lockup - CPU#6 stuck for 67s! [flush-0:27:265503]
Aug  20 08:01:13 safecloud101 kernel: Modules linked in: vzethdev vznetdev  pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat  nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota vzmon vzdev  ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl  xt_tcpmss xt_TCPMSS iptable_mangle xt_limit xt_dscp ipt_REJECT vhost_net  tun macvtap macvlan kvm_intel kvm xt_multiport iptable_filter ip_tables  dlm configfs openvswitch vxlan vzevent ib_iser rdma_cm ib_addr iw_cm  ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi  scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc  bonding 8021q garp ipv6 fuse iTCO_wdt iTCO_vendor_support snd_pcsp  snd_pcm snd_page_alloc snd_timer snd soundcore sb_edac edac_core lpc_ich  mfd_core shpchp i2c_i801 ioatdma wmi acpi_pad ext3 mbcache jbd sg  usb_storage isci mpt2sas raid_class ahci libsas ixgbe scsi_transport_sas  igb i2c_algo_bit i2c_core dca [last unloaded: scsi_wait_scan]
Aug 20 08:01:13 safecloud101 kernel: CPU 6 
Aug  20 08:01:13 safecloud101 kernel: Modules linked in: vzethdev vznetdev  pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat  nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota vzmon vzdev  ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl  xt_tcpmss xt_TCPMSS iptable_mangle xt_limit xt_dscp ipt_REJECT vhost_net  tun macvtap macvlan kvm_intel kvm xt_multiport iptable_filter ip_tables  dlm configfs openvswitch vxlan vzevent ib_iser rdma_cm ib_addr iw_cm  ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi  scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc  bonding 8021q garp ipv6 fuse iTCO_wdt iTCO_vendor_support snd_pcsp  snd_pcm snd_page_alloc snd_timer snd soundcore sb_edac edac_core lpc_ich  mfd_core shpchp i2c_i801 ioatdma wmi acpi_pad ext3 mbcache jbd sg  usb_storage isci mpt2sas raid_class ahci libsas ixgbe scsi_transport_sas  igb i2c_algo_bit i2c_core dca [last unloaded: scsi_wait_scan]
Aug 20 08:01:13 safecloud101 kernel: 
Aug  20 08:01:13 safecloud101 kernel: Pid: 265503, comm: flush-0:27 veid: 0  Not tainted 2.6.32-29-pve #1 042stab088_4 Supermicro B9DRP/B9DRP
Aug 20 08:01:13 safecloud101 kernel: RIP: 0010:[<ffffffff8155edde>]  [<ffffffff8155edde>] _spin_lock+0x1e/0x30
Aug 20 08:01:13 safecloud101 kernel: RSP: 0018:ffff88083f251c50  EFLAGS: 00000297
Aug 20 08:01:13 safecloud101 kernel: RAX: 000000000000c63f RBX: ffff88083f251c50 RCX: 0000000000000000
Aug 20 08:01:13 safecloud101 kernel: RDX: 000000000000c63e RSI: ffff88083f251cf0 RDI: ffffffff823ec660
Aug 20 08:01:13 safecloud101 kernel: RBP: ffffffff8100bc4e R08: ffff88107a4169c0 R09: 7fffffffffffffff
Aug 20 08:01:13 safecloud101 kernel: R10: ffff880e13bb36c0 R11: ffff880e18da0e00 R12: ffff88083f251c50
Aug 20 08:01:13 safecloud101 kernel: R13: 0000000000000000 R14: 0000000000000046 R15: ffff880866fbf198
Aug 20 08:01:13 safecloud101 kernel: FS:  0000000000000000(0000) GS:ffff8808a0700000(0000) knlGS:0000000000000000
Aug 20 08:01:13 safecloud101 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Aug 20 08:01:13 safecloud101 kernel: CR2: 00007f13f8151010 CR3: 0000000001a85000 CR4: 00000000000407e0
Aug 20 08:01:13 safecloud101 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 20 08:01:13 safecloud101 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug  20 08:01:13 safecloud101 kernel: Process flush-0:27 (pid: 265503, veid:  0, threadinfo ffff88083f250000, task ffff880853f0cf30)
Aug 20 08:01:13 safecloud101 kernel: Stack:
Aug 20 08:01:13 safecloud101 kernel: ffff88083f251ca0 ffffffff811db51f ffff88083f251c90 ffffffff81192728
Aug 20 08:01:13 safecloud101 kernel: <d> ffff88087fc00380 ffff880deb0eec80 ffff880866fbf168 0000000000000000
Aug 20 08:01:13 safecloud101 kernel: <d> ffffffff823ec660 ffff88083f251cf0 ffff88083f251db0 ffffffff811db993
Aug 20 08:01:13 safecloud101 kernel: Call Trace:
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff811db51f>] ? writeback_inodes_wb+0x3f/0x170
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81192728>] ? kmem_freepages+0xd8/0x120
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff811db993>] ? wb_writeback+0x343/0x460
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff8119395c>] ? free_block+0x14c/0x170
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff811dbb6f>] ? wb_do_writeback+0xbf/0x260
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff811dbda8>] ? bdi_writeback_task+0x98/0x1e0
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81161c10>] ? bdi_start_fn+0x0/0x110
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81161c10>] ? bdi_start_fn+0x0/0x110
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81161ca5>] ? bdi_start_fn+0x95/0x110
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81161c10>] ? bdi_start_fn+0x0/0x110
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff810a2106>] ? kthread+0x96/0xa0
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff8100c34a>] ? child_rip+0xa/0x20
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff810a2070>] ? kthread+0x0/0xa0
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff8100c340>] ? child_rip+0x0/0x20
Aug  20 08:01:13 safecloud101 kernel: Code: 00 00 00 01 74 05 e8 c2 94 d3 ff  c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1  e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9  c3 0f 1f 40 00 55 48 89 
Aug 20 08:01:13 safecloud101 kernel: Call Trace:
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff811db51f>] ? writeback_inodes_wb+0x3f/0x170
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81192728>] ? kmem_freepages+0xd8/0x120
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff811db993>] ? wb_writeback+0x343/0x460
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff8119395c>] ? free_block+0x14c/0x170
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff811dbb6f>] ? wb_do_writeback+0xbf/0x260
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff811dbda8>] ? bdi_writeback_task+0x98/0x1e0
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81161c10>] ? bdi_start_fn+0x0/0x110
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81161c10>] ? bdi_start_fn+0x0/0x110
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81161ca5>] ? bdi_start_fn+0x95/0x110
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff81161c10>] ? bdi_start_fn+0x0/0x110
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff810a2106>] ? kthread+0x96/0xa0
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff8100c34a>] ? child_rip+0xa/0x20
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff810a2070>] ? kthread+0x0/0xa0
Aug 20 08:01:13 safecloud101 kernel: [<ffffffff8100c340>] ? child_rip+0x0/0x20
 
Maybe
I also realized that the hard drive is full on the server. Maybe there is a link between the 2 issues. Not sure why the hard drive is full. I use a NFS share for the Vms and backups. I just checked no vms are on the local hard drive. I used all the tricks I know to try to find big files on the local hard drive and I just can't seem to find any. I deleted the big logs file(even if they are not big enough to account for the full disk). But the Hard disk stayed at 100% full. Any idea on what may be causing that?
 
If a process is using a file the file will stay on disk until the process releases the file handle. Consider a file like an instance of a singleton object. Every time a request is made for the object the reference counter is incremented and every time a reference is dropped the reference counter is decremented. Only when no references exists for this singleton object the object will be flushed from memory. Likewise with files in Unix file systems. Your tool to find which process has references to your files is lsof.
 
Thanks for the quick reply and tips. I forgot about lsof.
Since I can't find any specific large file I had to go through the whole output. But even than i could not find any open file larger than 20mb. Maybe I missed it. I found a little bash script that is supposed the only show the top 10 open file with lsof but i cant seem to make it work :
Code:
root@safecloud101:/var/log# lsof |awk ‘{if($7 > 1048576) print $7/1048576 “MB” ” ” $9 }’ | sort -n -u | tail
-bash: syntax error near unexpected token `('
I'm really lost here. Its the first time that i have a full disk that I can't find why.
 
And also if this can help this is what i see in the syslog tab on the web interface :
Code:
Aug 20 17:07:18 safecloud101 pveproxy[3702]: worker 10819 started
Aug 20 17:07:18 safecloud101 pveproxy[3702]: worker 10820 started
Aug 20 17:07:18 safecloud101 pveproxy[10819]: error writing access log
Aug 20 17:07:18 safecloud101 pveproxy[10818]: error writing access log
Aug 20 17:07:18 safecloud101 pveproxy[10820]: error writing access log
Aug 20 17:07:18 safecloud101 pveproxy[3702]: worker 10819 finished
Aug 20 17:07:18 safecloud101 pveproxy[3702]: starting 1 worker(s)
Aug 20 17:07:18 safecloud101 pveproxy[3702]: worker 10830 started
Aug 20 17:07:18 safecloud101 pveproxy[3702]: worker 10818 finished
Aug 20 17:07:18 safecloud101 pveproxy[3702]: starting 1 worker(s)
Aug 20 17:07:18 safecloud101 pveproxy[3702]: worker 10831 started
Aug 20 17:07:18 safecloud101 pveproxy[10831]: error writing access log
Aug 20 17:07:19 safecloud101 pveproxy[3702]: worker 10831 finished
Aug 20 17:07:19 safecloud101 pveproxy[3702]: starting 1 worker(s)
Aug 20 17:07:19 safecloud101 pveproxy[3702]: worker 10832 started
Aug 20 17:07:19 safecloud101 pveproxy[10830]: error writing access log
Aug 20 17:07:19 safecloud101 pveproxy[3702]: worker 10830 finished
Aug 20 17:07:19 safecloud101 pveproxy[3702]: starting 1 worker(s)
Aug 20 17:07:19 safecloud101 pveproxy[3702]: worker 10833 started
Aug 20 17:07:19 safecloud101 pveproxy[10832]: error writing access log
Aug 20 17:07:19 safecloud101 pveproxy[10833]: error writing access log
Aug 20 17:07:19 safecloud101 pveproxy[3702]: worker 10820 finished
Aug 20 17:07:19 safecloud101 pveproxy[3702]: starting 1 worker(s)
This is going on in loop....
 
Your copy/paste has added wrong characters. Try this:
Code:
lsof |awk '{if($7 > 1048576) print $7/1048576 "MB" " " $9 }' | sort -n -u | tail
 
Thanks it did work. But once again no big files.
Code:
root@safecloud101:~# lsof |awk '{if($7 > 1048576) print $7/1048576 "MB" " " $9 }' | sort -n -u | tail
1.52544MB /lib/x86_64-linux-gnu/libc-2.13.so
1.95358MB /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
2.75134MB /usr/bin/python2.7
3.98476MB /var/log/messages
4.29027MB /var/log/kern.log
5.66513MB /usr/lib/librados.so.2.0.0
8.79496MB /var/log/mail.warn
9.35678MB /var/log/syslog
14.6567MB /var/log/mail.log
50.0762MB /var/log/daemon.log
 
I found the issue for the full disk. At one point my NFS share for the backups wasnt mounted on this server. So instead of throwing an error or something proxmox just did the backup on the local drive. When I rebooted it mounted the share on the same folder so I lost the ability to see what was on this folder on the local drive.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!