can't lock file ... got timeout

sa10 · Feb 17, 2020

From time to time I have to restart the servers due to loss of control.

Sometimes I get such messages in the log

Code:

Feb 17 02:38:29 kfn1-node4 pvedaemon[38846]: can't lock file '/var/lock/qemu-server/lock-141.conf' - got timeout
Feb 17 02:38:29 kfn1-node4 pvedaemon[8817]: <root@pam> end task UPID:kfn1-node4:000097BE:08EF85DA:5E49EE8B:qmstop:141:root@pam: can't lock file '/var/lock/qemu-server/lock-141.conf' - got timeout

and from this moment on I can't start / stop virtual servers.

Before this problem I always see such messages in the /var/logs/syslog:

Code:

Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520706] CPU: 52 PID: 1649 Comm: z_wr_int_6 Tainted: P           O     4.15.18-21-pve #1
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520736] Hardware name: Supermicro SYS-1029U-TR4T/X11DPU, BIOS 3.1a 07/19/2019
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520811] RIP: 0010:buf_hash_insert+0xbd/0x180 [zfs]
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520828] RSP: 0018:ffffb07b60eebcc0 EFLAGS: 00010206
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520846] RAX: 1b0210fa010154fc RBX: ffff95e1569c91f0 RCX: 0000000000000080
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520866] RDX: 0000000000000001 RSI: ffff96324bf587b0 RDI: ffffb07b7736d558
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520886] RBP: ffffb07b60eebcd8 R08: ffff95e1569c9200 R09: 0000000000baf367
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520913] R10: ffffb07b60eebce0 R11: 0000000000000000 R12: 00000000028256ab
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520934] R13: 000000000005aac0 R14: ffff96335e5df690 R15: ffff9633b1dcbb50
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520958] FS:  0000000000000000(0000) GS:ffff9635bf300000(0000) knlGS:0000000000000000
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.520983] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521000] CR2: 00007f79ef135000 CR3: 0000002adea0a003 CR4: 00000000007626e0
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521020] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521040] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521060] PKRU: 55555554
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521070] Call Trace:
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521123]  arc_write_done+0x125/0x3f0 [zfs]
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521164]  zio_done+0x2d0/0xe60 [zfs]
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521179]  ? kfree+0x165/0x180
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521197]  ? spl_kmem_free+0x33/0x40 [spl]
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521237]  zio_execute+0x95/0xf0 [zfs]
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521253]  taskq_thread+0x2ae/0x4d0 [spl]
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.521268]  ? wake_up_q+0x80/0x80
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.522089]  ? zio_reexecute+0x390/0x390 [zfs]
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.523021]  kthread+0x105/0x140
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.523759]  ? task_done+0xb0/0xb0 [spl]
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.524503]  ? kthread_create_worker_on_cpu+0x70/0x70
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.525215]  ret_from_fork+0x1f/0x40
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.525909] Code: 05 31 7c 18 00 4a 8d 3c e0 48 8b 37 48 85 f6 0f 84 c2 00 00 00 48 8b 0b 48 89 f0 31 d2 eb 0c 48 8b 40 20 83 c2 01 48 85 c0 74 2f <48> 39 08 75 ef 4c 8b 53 08 4c 39 50 08 75 e5 4c 8
b 5b 10 4c 39
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.527427] RIP: buf_hash_insert+0xbd/0x180 [zfs] RSP: ffffb07b60eebcc0
Feb 17 01:49:32 kfn1-node4 kernel: [1496221.528218] ---[ end trace 1ec84add9901e42b ]---

This forum described problems with similar symptoms, but it's something else.
Only a full server reboot can help.

Does anyone have any idea how to diagnose such a problem?


 pveversion -v
proxmox-ve: 5.4-2 (running kernel: 4.15.18-21-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15.18-21-pve: 4.15.18-48
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
pve-zsync: 1.7-4
qemu-server: 5.0-55
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

wolfgang · Feb 18, 2020

Hi,
Please update to the current kernel version.

f.yakhyaev · Feb 18, 2020

Hi
rm /var/lock/qemu-server/lock-141.conf and wait 5min when all task error, then run the command start or stop.

sa10 · Feb 19, 2020

f.yakhyaev said:
Hi
rm /var/lock/qemu-server/lock-141.conf and wait 5min when all task error, then run the command start or stop.

It doesn't help.
It seems the ZFS module error causes violations of other kernel functions.
But I don't know how to determine what exactly.
Probably it would help find a way to continue without rebooting.

I hope the kernel update solves the problem.

oguz · Feb 19, 2020

sa10 said:
Probably it would help find a way to continue without rebooting.

I hope the kernel update solves the problem.

you'll need to reboot after upgrading the kernel

sa10 · Feb 24, 2020

The system upgrade didn't solve the problem.
Yesterday this happened with last proxmox 6.1-7.
It was after full reinstall.

Symptoms of malfunction:
1. Problems begin with this message

Code:

Feb 22 15:43:53 brk2-node2 kernel: [349758.004743] general protection fault: 0000 [#1] SMP NOPTI
Feb 22 15:43:53 brk2-node2 kernel: [349758.005349] CPU: 18 PID: 43772 Comm: z_wr_int Tainted: P           O      5.3.18-1-pve #1
Feb 22 15:43:53 brk2-node2 kernel: [349758.005887] Hardware name: Supermicro SYS-1029U-TR4T/X11DPU, BIOS 3.1a 07/19/2019
Feb 22 15:43:53 brk2-node2 kernel: [349758.006439] RIP: 0010:buf_hash_insert+0x93/0x160 [zfs]
Feb 22 15:43:53 brk2-node2 kernel: [349758.006910] Code: 4b e6 1d 00 4a 8d 3c e0 48 8b 37 48 85 f6 0f 84 c0 00 00 00 48 8b 0b 48 89 f0 31 d2 eb 0c 48 8b 40 20 83 c2 01 48 85 c0 74 2c <48> 39 08 75 ef 4c 8b 43 08 4c 39 40 08 75 e5 4c 8b 4b
10 4c 39 48
Feb 22 15:43:53 brk2-node2 kernel: [349758.008025] RSP: 0018:ffffb4ddb34f7cd8 EFLAGS: 00010202
Feb 22 15:43:53 brk2-node2 kernel: [349758.008538] RAX: 200050c7000dfbad RBX: ffff96124cafc520 RCX: 0000000000000080
Feb 22 15:43:53 brk2-node2 kernel: [349758.009019] RDX: 0000000000000001 RSI: ffff95692fe00a40 RDI: ffffb4dd3a712fa8
Feb 22 15:43:53 brk2-node2 kernel: [349758.009629] RBP: ffffb4ddb34f7cf8 R08: 50b0e0f50851038f R09: 9ae16a3b2f90404f
Feb 22 15:43:53 brk2-node2 kernel: [349758.010496] R10: ffff96128db74800 R11: 0000000000000001 R12: 00000000014e23f5
Feb 22 15:43:53 brk2-node2 kernel: [349758.011364] R13: 000000000000fd40 R14: ffffb4ddb34f7d08 R15: 0000000000000000
Feb 22 15:43:53 brk2-node2 kernel: [349758.012131] FS:  0000000000000000(0000) GS:ffff96133f480000(0000) knlGS:0000000000000000
Feb 22 15:43:53 brk2-node2 kernel: [349758.012639] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 15:43:53 brk2-node2 kernel: [349758.013153] CR2: 00000012d974efc8 CR3: 000000508300a006 CR4: 00000000007626e0
Feb 22 15:43:53 brk2-node2 kernel: [349758.013664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 22 15:43:53 brk2-node2 kernel: [349758.014167] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 22 15:43:53 brk2-node2 kernel: [349758.014684] PKRU: 55555554
Feb 22 15:43:53 brk2-node2 kernel: [349758.015333] Call Trace:
Feb 22 15:43:53 brk2-node2 kernel: [349758.015989]  arc_write_done+0x12a/0x410 [zfs]
Feb 22 15:43:53 brk2-node2 kernel: [349758.016604]  zio_done+0x440/0x1030 [zfs]
Feb 22 15:43:53 brk2-node2 kernel: [349758.017171]  zio_execute+0x99/0xf0 [zfs]
Feb 22 15:43:53 brk2-node2 kernel: [349758.017680]  taskq_thread+0x2ec/0x4d0 [spl]
Feb 22 15:43:53 brk2-node2 kernel: [349758.018193]  ? wake_up_q+0x80/0x80
Feb 22 15:43:53 brk2-node2 kernel: [349758.018737]  ? zio_taskq_member.isra.12.constprop.17+0x70/0x70 [zfs]
Feb 22 15:43:53 brk2-node2 kernel: [349758.019256]  kthread+0x120/0x140
Feb 22 15:43:53 brk2-node2 kernel: [349758.019796]  ? task_done+0xb0/0xb0 [spl]
Feb 22 15:43:53 brk2-node2 kernel: [349758.020390]  ? __kthread_parkme+0x70/0x70
Feb 22 15:43:53 brk2-node2 kernel: [349758.020938]  ret_from_fork+0x1f/0x40
Feb 22 15:43:53 brk2-node2 kernel: [349758.021481] Modules linked in: tcp_diag inet_diag veth xt_mac ip_set_hash_ip nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_mark xt_set xt_physdev xt_addrtype xt_comment xt_multiport xt_conntrack xt_tcpudp ip_set_hash_net ip_set sctp iptable_filter bpfilter bonding openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd ast cryptd glue_helper drm_vram_helper ttm intel_cstate drm_kms_helper intel_rapl_perf pcspkr drm joydev i2c_algo_bit input_leds fb_sys_fops syscopyarea sysfillrect sysimgblt mei_me mei ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_pad
Feb 22 15:43:53 brk2-node2 kernel: [349758.021530]  acpi_power_meter mac_hid tcp_bbr sch_fq vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c hid_generic usbmouse usbkbd usbhid hid ixgbe xfrm_algo dca mdio i2c_i801 lpc_ich ahci libahci wmi
Feb 22 15:43:53 brk2-node2 kernel: [349758.030584] ---[ end trace 4e4874ffac3ce925 ]---
Feb 22 15:43:53 brk2-node2 kernel: [349758.094098] RIP: 0010:buf_hash_insert+0x93/0x160 [zfs]
Feb 22 15:43:53 brk2-node2 kernel: [349758.094985] Code: 4b e6 1d 00 4a 8d 3c e0 48 8b 37 48 85 f6 0f 84 c0 00 00 00 48 8b 0b 48 89 f0 31 d2 eb 0c 48 8b 40 20 83 c2 01 48 85 c0 74 2c <48> 39 08 75 ef 4c 8b 43 08 4c 39 40 08 75 e5 4c 8b 4b 10 4c 39 48
Feb 22 15:43:53 brk2-node2 kernel: [349758.096560] RSP: 0018:ffffb4ddb34f7cd8 EFLAGS: 00010202
Feb 22 15:43:53 brk2-node2 kernel: [349758.097296] RAX: 200050c7000dfbad RBX: ffff96124cafc520 RCX: 0000000000000080
Feb 22 15:43:53 brk2-node2 kernel: [349758.098033] RDX: 0000000000000001 RSI: ffff95692fe00a40 RDI: ffffb4dd3a712fa8
Feb 22 15:43:53 brk2-node2 kernel: [349758.098765] RBP: ffffb4ddb34f7cf8 R08: 50b0e0f50851038f R09: 9ae16a3b2f90404f
Feb 22 15:43:53 brk2-node2 kernel: [349758.099500] R10: ffff96128db74800 R11: 0000000000000001 R12: 00000000014e23f5
Feb 22 15:43:53 brk2-node2 kernel: [349758.100246] R13: 000000000000fd40 R14: ffffb4ddb34f7d08 R15: 0000000000000000
Feb 22 15:43:53 brk2-node2 kernel: [349758.100985] FS:  0000000000000000(0000) GS:ffff96133f480000(0000) knlGS:0000000000000000
Feb 22 15:43:53 brk2-node2 kernel: [349758.101788] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 15:43:53 brk2-node2 kernel: [349758.102492] CR2: 00000012d974efc8 CR3: 000000508300a006 CR4: 00000000007626e0
Feb 22 15:43:53 brk2-node2 kernel: [349758.103192] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 22 15:43:53 brk2-node2 kernel: [349758.103887] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 22 15:43:53 brk2-node2 kernel: [349758.104604] PKRU: 55555554

2. The Load average sharply increases

3. I see the messages In the daemon.log

Code:

Feb 22 16:34:53 brk2-node2 pvedaemon[41457]: <root@pam> end task UPID:brk2-node2:00003AF4:021A5804:5E515823:qmstart:141:root@pam: can't lock file '/var/lock/qemu-server/lock-141.conf' - got timeout

4. My attempts to restart services and deleting lock files didn't give any effect.
5. Many commands just stop executing. For example, I didn't get a System report from proxmox. I tried to remove the cache device with the command zpool remove MainPool nvme0n1 , but the command hung

I have a suspicion that this may be due to hardware.
But how to find out what exactly can cause this problem?
Judging by the message RIP: 0010:buf_hash_insert+0x93/0x160 [zfs]
the problem may be related to the ZFS module.
However the ZFS pool in the emergency state of the system continues to work normally.
And I can’t connect how a file system problem could be related to a specific hardware.
Moreover, this problem somehow completely disrupts the functions of the kernel.

I have the Proxmox Subscription for this system.
Does anyone have any ideas how to solve this problem?

wolfgang · Feb 24, 2020

sa10 said:
I have a suspicion that this may be due to hardware.

In this stack, only Memory, CPU, Disk Controller is involved.
I guess if this is the only problem you can cut the CPU.

sa10 said:
I have the Proxmox Subscription for this system.

If you have a subscription level Basic or higher, then you can open a ticket on my.proxmox.com.

sa10 · Feb 24, 2020

wolfgang said:
I guess if this is the only problem you can cut the CPU.

I'm not sure I got it right.
Have you proposed to remove the CPU from the list of suspicions or is the CPU the main suspect?

Could you recommend a way to clarify this hypothesis please ?

Do you suspect that the CPU may be causing problems by intuition or is there something else?

I’m looking for a way to artificially create conditions for the repetition of this state of the system.
I searched with what event it is possible to associate the occurrence of an emergency state of the system, but did not find anything suspicious.

The problem happens sporadically and there is no time to investigate the emergency state of the system.
I was thinking about a script that will periodically check the Load Average and, if the threshold is exceeded, run a set of diagnostic commands with output to the log.
But not all diagnostic commands can be executed. Something will hung certainly.
However, it is not necessary to execute these commands sequentially.

Search

Search

can't lock file ... got timeout

sa10

Renowned Member

wolfgang

Proxmox Retired Staff

f.yakhyaev

Member

sa10

Renowned Member

oguz

Proxmox Retired Staff

sa10

Renowned Member

wolfgang

Proxmox Retired Staff

sa10

Renowned Member

We value your privacy