Hello,
I have 2 node cluster with quorum and NFS storage for HA machines.
Node Falcon was unavailable with this repeated error message on the screen: (I saw it over KVM - servers are from OVH)
BUG: soft lockup - CPU#1 stuck for 67s! [pvestatd:4968]
Falcon looked like down, I couldn't connect to it by SSH, web interface, even on KVM I couldn't login, but all machines worked until I saw this message:
BUG: soft lockup - CPU#2 stuck for 67s! [ps:340245]
After this message, everything was hang. Time between the service freeze and complete hang was about 1h 30m.
After reboot, everything is working again.
Messages from syslog:
pveversion -v
proxmox-ve-2.6.32: 3.3-139 (running kernel: 2.6.32-34-pve)
pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-34-pve: 2.6.32-139
pve-kernel-2.6.32-31-pve: 2.6.32-132
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.3-3
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-25
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
12.11.2014 I upgraded from version 3.2 to 3.3
Any ideas how to prevent this issue?
I have 2 node cluster with quorum and NFS storage for HA machines.
Node Falcon was unavailable with this repeated error message on the screen: (I saw it over KVM - servers are from OVH)
BUG: soft lockup - CPU#1 stuck for 67s! [pvestatd:4968]
Falcon looked like down, I couldn't connect to it by SSH, web interface, even on KVM I couldn't login, but all machines worked until I saw this message:
BUG: soft lockup - CPU#2 stuck for 67s! [ps:340245]
After this message, everything was hang. Time between the service freeze and complete hang was about 1h 30m.
After reboot, everything is working again.
Messages from syslog:
Code:
Nov 14 14:43:10 falcon corosync[4326]: [CMAN ] lost contact with quorum device
Nov 14 14:43:10 falcon corosync[4326]: [QUORUM] Members[2]: 1 2
Nov 14 14:43:35 falcon kernel: BUG: soft lockup - CPU#1 stuck for 67s! [pvestatd:4968]
Nov 14 14:43:35 falcon kernel: Modules linked in: vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ip_set ploop simfs vzrst vzcpt vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables nf_nat_ftp nf_conntrack_ftp xt_state xt_length xt_hl xt_tcpmss xt_TCPMSS xt_limit ipt_LOG xt_DSCP xt_dscp vhost_net ipt_REDIRECT tun macvtap macvlan xt_owner nfnetlink_log kvm_intel kvm xt_recent nfnetlink ipt_REJECT dlm configfs xt_multiport vzevent ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter iptable_mangle ip_tables dummy ipv6 ext4 jbd2 fuse snd_pcsp iTCO_wdt iTCO_vendor_support snd_pcm snd_page_alloc snd_timer lpc_ich i2c_i801 serio_raw snd soundcore mfd_core shpchp acpi_pad wmi ext3 mbcache jbd btrfs(T) lzo_compress lzo_decompress zlib_deflate raid1 sg isci libsas scsi_transport_sas ahci ig
Nov 14 14:43:35 falcon kernel: b i2c_algo_bit i2c_core dca [last unloaded: scsi_wait_scan]
Nov 14 14:43:35 falcon kernel: CPU 1
Nov 14 14:43:35 falcon kernel: Modules linked in: vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ip_set ploop simfs vzrst vzcpt vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables nf_nat_ftp nf_conntrack_ftp xt_state xt_length xt_hl xt_tcpmss xt_TCPMSS xt_limit ipt_LOG xt_DSCP xt_dscp vhost_net ipt_REDIRECT tun macvtap macvlan xt_owner nfnetlink_log kvm_intel kvm xt_recent nfnetlink ipt_REJECT dlm configfs xt_multiport vzevent ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter iptable_mangle ip_tables dummy ipv6 ext4 jbd2 fuse snd_pcsp iTCO_wdt iTCO_vendor_support snd_pcm snd_page_alloc snd_timer lpc_ich i2c_i801 serio_raw snd soundcore mfd_core shpchp acpi_pad wmi ext3 mbcache jbd btrfs(T) lzo_compress lzo_decompress zlib_deflate raid1 sg isci libsas scsi_transport_sas ahci ig
Nov 14 14:43:35 falcon kernel: b i2c_algo_bit i2c_core dca [last unloaded: scsi_wait_scan]
Nov 14 14:43:35 falcon kernel:
Nov 14 14:43:35 falcon kernel: Pid: 4968, comm: pvestatd veid: 0 Tainted: G --------------- T 2.6.32-34-pve #1 042stab094_7 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F
Nov 14 14:43:35 falcon kernel: RIP: 0010:[<ffffffff81099da0>] [<ffffffff81099da0>] find_pid_ns+0x70/0xa0
Nov 14 14:43:35 falcon kernel: RSP: 0018:ffff8808787edd58 EFLAGS: 00000202
Nov 14 14:43:35 falcon kernel: RAX: ffff88083e0aa808 RBX: ffff8808787edd58 RCX: 0000000000000034
Nov 14 14:43:35 falcon kernel: RDX: ffff88002804ec00 RSI: ffffffff81aa5b20 RDI: 0000000000005992
Nov 14 14:43:35 falcon kernel: RBP: ffffffff8100bcce R08: 4000000000000000 R09: a590000000000000
Nov 14 14:43:35 falcon kernel: R10: fffe06aad2c80000 R11: 0001f954aee2b4b2 R12: ffff8808787edf40
Nov 14 14:43:35 falcon kernel: R13: ffff8808787edd88 R14: 00000000037fdfe8 R15: 0000000000000020
Nov 14 14:43:35 falcon kernel: FS: 00007ffb8e0f4700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
Nov 14 14:43:35 falcon kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 14 14:43:35 falcon kernel: CR2: 00000000037fe007 CR3: 0000000876611000 CR4: 00000000001427e0
Nov 14 14:43:35 falcon kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 14 14:43:35 falcon kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 14 14:43:35 falcon kernel: Process pvestatd (pid: 4968, veid: 0, threadinfo ffff8808787ec000, task ffff8808789de380)
Nov 14 14:43:35 falcon kernel: Stack:
Nov 14 14:43:35 falcon kernel: ffff8808787edd88 ffffffff8109af5a ffff88087c801af8 0000000000004ea2
Nov 14 14:43:35 falcon kernel: <d> ffff88087c4fbb40 ffffffff81aa5b20 ffff8808787eddf8 ffffffff81218c5a
Nov 14 14:43:35 falcon kernel: <d> ffff8808787eddf8 000000000000fe99 ffffffff811bde00 ffff8808787edf40
Nov 14 14:43:35 falcon kernel: Call Trace:
Nov 14 14:43:35 falcon kernel: [<ffffffff8109af5a>] ? find_ge_pid+0x2a/0x50
Nov 14 14:43:35 falcon kernel: [<ffffffff81218c5a>] ? next_tgid+0x6a/0xc0
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde63>] ? filldir+0x63/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff8121bb59>] ? proc_pid_readdir+0x149/0x220
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff8121674a>] ? proc_root_readdir+0x4a/0x60
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff811be070>] ? vfs_readdir+0xa0/0xd0
Nov 14 14:43:35 falcon kernel: [<ffffffff811be19a>] ? sys_getdents+0x8a/0x100
Nov 14 14:43:35 falcon kernel: [<ffffffff8100b182>] ? system_call_fastpath+0x16/0x1b
Nov 14 14:43:35 falcon kernel: Code: 48 29 ca b9 40 00 00 00 2b 0d 95 c6 a0 00 48 01 d0 48 8b 15 fb 60 e5 00 48 d3 e8 48 8d 04 c2 48 8b 00 48 85 c0 75 0c eb 26 66 90 <48> 8b 00 48 85 c0 74 1c 39 78 f0 75 f3 48 39 70 f8 75 ed 8b 96
Nov 14 14:43:35 falcon kernel: Call Trace:
Nov 14 14:43:35 falcon kernel: [<ffffffff8109af5a>] ? find_ge_pid+0x2a/0x50
Nov 14 14:43:35 falcon kernel: [<ffffffff81218c5a>] ? next_tgid+0x6a/0xc0
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde63>] ? filldir+0x63/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff8121bb59>] ? proc_pid_readdir+0x149/0x220
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff8121674a>] ? proc_root_readdir+0x4a/0x60
Nov 14 14:43:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 14:43:35 falcon kernel: [<ffffffff811be070>] ? vfs_readdir+0xa0/0xd0
Nov 14 14:43:35 falcon kernel: [<ffffffff811be19a>] ? sys_getdents+0x8a/0x100
Nov 14 14:43:35 falcon kernel: [<ffffffff8100b182>] ? system_call_fastpath+0x16/0x1b
Code:
Nov 14 16:23:35 falcon kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ps:340245]
Nov 14 16:23:35 falcon kernel: Modules linked in: vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ip_set ploop simfs vzrst vzcpt vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables nf_nat_ftp nf_conntrack_ftp xt_state xt_length xt_hl xt_tcpmss xt_TCPMSS xt_limit ipt_LOG xt_DSCP xt_dscp vhost_net ipt_REDIRECT tun macvtap macvlan xt_owner nfnetlink_log kvm_intel kvm xt_recent nfnetlink ipt_REJECT dlm configfs xt_multiport vzevent ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter iptable_mangle ip_tables dummy ipv6 ext4 jbd2 fuse snd_pcsp iTCO_wdt iTCO_vendor_support snd_pcm snd_page_alloc snd_timer lpc_ich i2c_i801 serio_raw snd soundcore mfd_core shpchp acpi_pad wmi ext3 mbcache jbd btrfs(T) lzo_compress lzo_decompress zlib_deflate raid1 sg isci libsas scsi_transport_sas ahci ig
Nov 14 16:23:35 falcon kernel: b i2c_algo_bit i2c_core dca [last unloaded: scsi_wait_scan]
Nov 14 16:23:35 falcon kernel: CPU 2
Nov 14 16:23:35 falcon kernel: Modules linked in: vzethdev vznetdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ip_set ploop simfs vzrst vzcpt vzdquota vzmon vzdev ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables nf_nat_ftp nf_conntrack_ftp xt_state xt_length xt_hl xt_tcpmss xt_TCPMSS xt_limit ipt_LOG xt_DSCP xt_dscp vhost_net ipt_REDIRECT tun macvtap macvlan xt_owner nfnetlink_log kvm_intel kvm xt_recent nfnetlink ipt_REJECT dlm configfs xt_multiport vzevent ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter iptable_mangle ip_tables dummy ipv6 ext4 jbd2 fuse snd_pcsp iTCO_wdt iTCO_vendor_support snd_pcm snd_page_alloc snd_timer lpc_ich i2c_i801 serio_raw snd soundcore mfd_core shpchp acpi_pad wmi ext3 mbcache jbd btrfs(T) lzo_compress lzo_decompress zlib_deflate raid1 sg isci libsas scsi_transport_sas ahci ig
Nov 14 16:23:35 falcon kernel: b i2c_algo_bit i2c_core dca [last unloaded: scsi_wait_scan]
Nov 14 16:23:35 falcon kernel:
Nov 14 16:23:35 falcon kernel: Pid: 340245, comm: ps veid: 200 Tainted: G --------------- T 2.6.32-34-pve #1 042stab094_7 Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F
Nov 14 16:23:35 falcon kernel: RIP: 0010:[<ffffffff81099da0>] [<ffffffff81099da0>] find_pid_ns+0x70/0xa0
Nov 14 16:23:35 falcon kernel: RSP: 0018:ffff8801eb0f3d58 EFLAGS: 00000206
Nov 14 16:23:35 falcon kernel: RAX: ffff8804576f8f28 RBX: ffff8801eb0f3d58 RCX: 0000000000000034
Nov 14 16:23:35 falcon kernel: RDX: ffff88002804ec00 RSI: ffff8804e0cf6940 RDI: 0000000000000157
Nov 14 16:23:35 falcon kernel: RBP: ffffffff8100bcce R08: e000000000000000 R09: 54b8000000000000
Nov 14 16:23:35 falcon kernel: R10: 2013833daa5c0000 R11: dfec04c736736a97 R12: ffff88083e130c00
Nov 14 16:23:35 falcon kernel: R13: ffff88083e130c00 R14: ffff8801afb4fc18 R15: ffff88087fc38380
Nov 14 16:23:35 falcon kernel: FS: 00007f5b89172700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
Nov 14 16:23:35 falcon kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 14 16:23:35 falcon kernel: CR2: 00007f5b88d3edb0 CR3: 00000004cec88000 CR4: 00000000001427e0
Nov 14 16:23:35 falcon kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 14 16:23:35 falcon kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 14 16:23:35 falcon kernel: Process ps (pid: 340245, veid: 200, threadinfo ffff8801eb0f2000, task ffff880791c5e200)
Nov 14 16:23:35 falcon kernel: Stack:
Nov 14 16:23:35 falcon kernel: ffff8801eb0f3d88 ffffffff8109af5a ffff880800b27238 0000000000000157
Nov 14 16:23:35 falcon kernel: <d> ffff880100000156 ffff8804e0cf6940 ffff8801eb0f3df8 ffffffff81218c5a
Nov 14 16:23:35 falcon kernel: <d> ffff8801eb0f3df8 0000000000d92261 ffffffff811bde00 ffff8801eb0f3f40
Nov 14 16:23:35 falcon kernel: Call Trace:
Nov 14 16:23:35 falcon kernel: [<ffffffff8109af5a>] ? find_ge_pid+0x2a/0x50
Nov 14 16:23:35 falcon kernel: [<ffffffff81218c5a>] ? next_tgid+0x6a/0xc0
Nov 14 16:23:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 16:23:35 falcon kernel: [<ffffffff8121bb59>] ? proc_pid_readdir+0x149/0x220
Nov 14 16:23:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 16:23:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 16:23:35 falcon kernel: [<ffffffff8121674a>] ? proc_root_readdir+0x4a/0x60
Nov 14 16:23:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 16:23:35 falcon kernel: [<ffffffff811be070>] ? vfs_readdir+0xa0/0xd0
Nov 14 16:23:35 falcon kernel: [<ffffffff811be19a>] ? sys_getdents+0x8a/0x100
Nov 14 16:23:35 falcon kernel: [<ffffffff8100b182>] ? system_call_fastpath+0x16/0x1b
Nov 14 16:23:35 falcon kernel: Code: 48 29 ca b9 40 00 00 00 2b 0d 95 c6 a0 00 48 01 d0 48 8b 15 fb 60 e5 00 48 d3 e8 48 8d 04 c2 48 8b 00 48 85 c0 75 0c eb 26 66 90 <48> 8b 00 48 85 c0 74 1c 39 78 f0 75 f3 48 39 70 f8 75 ed 8b 96
Nov 14 16:23:35 falcon kernel: Call Trace:
Nov 14 16:23:35 falcon kernel: [<ffffffff8109af5a>] ? find_ge_pid+0x2a/0x50
Nov 14 16:23:35 falcon kernel: [<ffffffff81218c5a>] ? next_tgid+0x6a/0xc0
Nov 14 16:23:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 16:23:35 falcon kernel: [<ffffffff8121bb59>] ? proc_pid_readdir+0x149/0x220
Nov 14 16:23:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 16:23:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 16:23:35 falcon kernel: [<ffffffff8121674a>] ? proc_root_readdir+0x4a/0x60
Nov 14 16:23:35 falcon kernel: [<ffffffff811bde00>] ? filldir+0x0/0xf0
Nov 14 16:23:35 falcon kernel: [<ffffffff811be070>] ? vfs_readdir+0xa0/0xd0
Nov 14 16:23:35 falcon kernel: [<ffffffff811be19a>] ? sys_getdents+0x8a/0x100
Nov 14 16:23:35 falcon kernel: [<ffffffff8100b182>] ? system_call_fastpath+0x16/0x1b
pveversion -v
proxmox-ve-2.6.32: 3.3-139 (running kernel: 2.6.32-34-pve)
pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)
pve-kernel-2.6.32-30-pve: 2.6.32-130
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-34-pve: 2.6.32-139
pve-kernel-2.6.32-31-pve: 2.6.32-132
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.3-3
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-25
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
12.11.2014 I upgraded from version 3.2 to 3.3
Any ideas how to prevent this issue?