"pct" & "qm" hanging on most servers in a cluster

alchemycs · Feb 3, 2017

Hello,
I have been very slowly migrating my containers from the old style VZ containers to LXC, and while things got off to a rocky start, they have seemed to be working better lately, until, that is, today.
This morning I went to go look at a node in the cluster, and I ran "pct list" and it completely froze the terminal - nothing would interrupt or suspend it. I discovered to my dismay that only a single cluster member could still run pct, qm, or, in fact, the website! That machine seems to believe everything is perfectly normal and happy, but it cannot of course communicate with the other machines in the cluster (I get a Connection Refused (595) error when I try).
So some quick looking around and I see that on each of the hung nodes, I cannot read the contents of /etc/pve/nodes/[local-node-name]. I can see the correct information on all the other nodes, but I cannot restart the pve-cluster daemon (hangs indefinitely just like pct and qm).
I can see from the /var/log/daemon.log file that it thinks a node failed (which was the one still working)

Code:

Feb  3 03:09:39 node-b corosync[3917]:  [TOTEM ] A processor failed, forming new configuration.
Feb  3 03:09:45 node-b corosync[3917]:  [TOTEM ] A new membership (10.0.0.8:188) was formed. Members left: 6
Feb  3 03:09:45 node-b corosync[3917]:  [TOTEM ] Failed to receive the leave message. failed: 6

At the same time on the node that everyone else thought was failed, I see this:

Code:

Feb  3 03:09:49 node-f corosync[16522]:  [MAIN  ] Corosync main process was not scheduled for 14964.5771 ms (threshold is 3400.0000 ms). Consider token timeout increase.
Feb  3 03:10:21 node-f corosync[16522]:  [TOTEM ] A processor failed, forming new configuration.
Feb  3 03:10:21 node-f corosync[16522]:  [MAIN  ] Corosync main process was not scheduled for 32056.3379 ms (threshold is 3400.0000 ms). Consider token timeout increase.
Feb  3 03:10:21 node-f pvestatd[3961]: status update time (37.547 seconds)
Feb  3 03:10:21 node-f pve-firewall[3958]: firewall update time (43.401 seconds)
Feb  3 03:10:21 node-f corosync[16522]:  [TOTEM ] A new membership (10.0.0.8:192) was formed. Members joined: 7 5 1 4 3 2 left: 7 5 1 4 3 2
Feb  3 03:10:21 node-f corosync[16522]:  [TOTEM ] Failed to receive the leave message. failed: 7 5 1 4 3 2
Feb  3 03:10:21 node-f corosync[16522]:  [QUORUM] Members[7]: 7 6 5 1 4 3 2
Feb  3 03:10:21 node-f corosync[16522]:  [MAIN  ] Completed service synchronization, ready to provide service.

I eventually rebooted one of the cluster (that wasn't responding), and when it finally went down (it hung for ~45 minutes on several PVE related messages, which I thought I had grabbed, but had not), the other machines could all run pct & qm again, and /etc/pve/nodes/ is back to normal however, their websites still do not work.
The machine that stayed up, and now the rebooted machine's website both work, however, they both get "Connection Refused (595)" messages when trying to talk to any other nodes.
I have some production machines on all the other nodes, so while I can move them around and reboot them, that obviously isn't a long term solution.

I did capture the list of stuck processes before the reboot snapped everything back into mostly working:

Code:

node-e# ps auxf | grep -E ' [DR]'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     28507  0.0  0.1 240680 60296 pts/0    D+   12:16   0:00  |       \_ /usr/bin/perl -T /usr/sbin/pct list
root     31204  0.0  0.1 240668 60296 pts/2    DN   12:21   0:00          \_ /usr/bin/perl -T /usr/sbin/pct
root      8506  0.0  0.0  19760  3008 pts/2    R+   13:42   0:00          \_ ps auxf
root      2523  0.0  0.1 239812 63544 ?        Ds   06:25   0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
root      7400  0.0  0.1 239788 63628 ?        Ds   06:32   0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root      9872  0.0  0.1 239232 63240 ?        Ds   06:37   0:00 /usr/bin/perl -T /usr/bin/spiceproxy stop
root     12982  0.0  0.1 239784 63780 ?        Ds   06:44   0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root     15525  0.0  0.1 239244 63084 ?        Ds   06:49   0:00 /usr/bin/perl -T /usr/bin/spiceproxy start
root      6113  0.1  0.1 224316 54808 ?        D    13:38   0:00 /usr/bin/perl /usr/sbin/qm list

FWIW, the cluster is a mix of
pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-1-pve)
and
pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-2-pve)
but, several of the machines that were not responding were the newer version.

Any thoughts on what went wrong, or suggestions for where to go from here? I'd like to get the website responding for all the nodes, and of course avoid this situation in the future!

Thanks in advance!

fhc · May 15, 2017

Hi,

Recently I upgraded my nodes and started to have the same behaviour (or kind of).

My setup is as follows :

- 1 PVE 4.4 node exporting a NFS share for the cluster
- 2 PVE 4.4 nodes exporting a GlusterFS mirror for the cluster

So the three nodes use both the NFS and the GlusterFS mount.

When I try to list my VM or containers on the 2 "Gluster" nodes, I lose my terminal : command is frozen, and although I still can send characters I can't cancel it (ctrl+c). And as far as I know, there is no timeout (I left one node as is all night to test). All the VM/containers still run, and I never had the problem on the first node (why ?). The only way to recover a fully operational node is to reboot it, and sometimes I have to wait for about 30 to 45 min !

I also find the same kind of lines in the logs :

Code:

May 14 09:33:55 lrdf-srv-proxmox2 corosync[3682]:  [TOTEM ] A processor failed, forming new configuration.
May 14 09:33:57 lrdf-srv-proxmox2 corosync[3682]:  [TOTEM ] A new membership (10.15.1.2:760) was formed. Members left: 1
May 14 09:33:57 lrdf-srv-proxmox2 corosync[3682]:  [TOTEM ] Failed to receive the leave message. failed: 1
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: members: 2/3600, 3/3664
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: starting data syncronisation
May 14 09:33:57 lrdf-srv-proxmox2 corosync[3682]:  [QUORUM] Members[2]: 2 3
May 14 09:33:57 lrdf-srv-proxmox2 corosync[3682]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: cpg_send_message retried 1 times
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: members: 2/3600, 3/3664
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: starting data syncronisation
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: received sync request (epoch 2/3600/00000005)
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: received sync request (epoch 2/3600/00000005)
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: received all states
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: leader is 2/3600
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: synced members: 2/3600, 3/3664
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: start sending inode updates
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: sent all (0) updates
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: all data is up to date
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: dfsm_deliver_queue: queue length 2
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: received all states
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: all data is up to date
May 14 09:33:57 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: dfsm_deliver_queue: queue length 22
May 14 09:34:05 lrdf-srv-proxmox2 corosync[3682]:  [TOTEM ] A new membership (10.15.1.1:764) was formed. Members joined: 1
May 14 09:34:05 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: members: 1/6265, 2/3600, 3/3664
May 14 09:34:05 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: starting data syncronisation
May 14 09:34:05 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: members: 1/6265, 2/3600, 3/3664
May 14 09:34:05 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: starting data syncronisation
May 14 09:34:05 lrdf-srv-proxmox2 corosync[3682]:  [QUORUM] Members[3]: 1 2 3
May 14 09:34:05 lrdf-srv-proxmox2 corosync[3682]:  [MAIN  ] Completed service synchronization, ready to provide service.

And then dozens of :

Code:

May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: remove message from non-member 1/6265
May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: remove message from non-member 1/6265
May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [dcdb] notice: remove message from non-member 1/6265
May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: received all states
May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: all data is up to date
May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: dfsm_deliver_queue: queue length 975
May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: remove message from non-member 1/626
5
May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: remove message from non-member 1/626
5
May 14 09:39:19 lrdf-srv-proxmox2 pmxcfs[3600]: [status] notice: remove message from non-member 1/626
5
...
May 14 11:20:54 lrdf-srv-proxmox2 corosync[3682]:  [TOTEM ] Retransmit List: 60ca 60cb 60cc 60cd 60ce
 60cf
May 14 11:20:54 lrdf-srv-proxmox2 corosync[3682]:  [TOTEM ] Retransmit List: 60ca 60cb 60cc 60cd 60ce
 60cf

Another problem I recently discovered is that on the WebUI, if I log onto the first node and try to manage my VM/containers on the two others, I am disconnected ! I got a "403 Permission check failed (permission denied - invalid PVE ticket) (401)" error. And when I try to reconnect, I am disconnected immediately. Can it be related to some cluster members communication problem ? Is it related to the first problem ?

Here is the 3 pveversion :

Code:

root@lrdf-srv-proxmox1:~# pveversion --verbose
proxmox-ve: 4.4-87 (running kernel: 4.4.59-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-99
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.8.8-1~bpo8+1
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
openvswitch-switch: 2.6.0-2

root@lrdf-srv-proxmox2:~# pveversion --verbose
proxmox-ve: 4.4-87 (running kernel: 4.4.59-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-99
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.8.8-1~bpo8+1
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
openvswitch-switch: 2.6.0-2

root@lrdf-srv-proxmox3:~# pveversion --verbose
proxmox-ve: 4.4-87 (running kernel: 4.4.59-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-99
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.8.8-1~bpo8+1
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
openvswitch-switch: 2.6.0-2

Please tell me if you need any other element.

Thanks !

Jon morby · Mar 2, 2018

did you ever find a solution to this?

alchemycs · Mar 2, 2018

Unfortunately, I never did find a solution. In good news however, it has appeared to have stopped for me. I did update & reboot all the physical nodes, and it has not come back since then. Are you still experiencing it on a current version, or do you have an older version of PVE that is showing this symptom?

Jon morby · Mar 2, 2018

This is happening to me on the latest 5.1 release currently ... which is somewhat problematic

fabian · Mar 2, 2018

could you provide the information I asked for in another thread with a similar problem:

pve-ha-lrm I/O Wait ...

Jon morby · Mar 2, 2018

My German is pretty non existent so I didn't see (or understand) that thread sadly

root@pve-1:~# gdb -batch -p $(pidof pmxcfs) -ex "thread apply all bt full"
[New LWP 2436]
[New LWP 2475]
[New LWP 2476]
[New LWP 2477]
[New LWP 2576]
[New LWP 3033]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f9c6dd82536 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ffc7d42ebb0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.

Thread 7 (Thread 0x7f9c4d5d5700 (LWP 3033)):
#0 0x00007f9c6dd8320d in read () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007f9c6e7736f2 in read (__nbytes=135168, __buf=0x7f9c48001700, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
No locals.
#2 fuse_kern_chan_receive (chp=<optimized out>, buf=0x7f9c48001700 "1", size=135168) at fuse_kern_chan.c:28
ch = 0x559de3c97350
err = <optimized out>
se = 0x559de3c97470
__PRETTY_FUNCTION__ = "fuse_kern_chan_receive"
#3 0x00007f9c6e7752ba in fuse_ll_receive_buf (se=0x559de3c97470, buf=0x7f9c4d5d4d40, chp=0x7f9c4d5d4d38) at fuse_lowlevel.c:2670
ch = 0x559de3c97350
f = <optimized out>
bufsize = 135168
tmpbuf = <optimized out>
err = <optimized out>
res = <optimized out>
#4 0x00007f9c6e773d1e in fuse_do_work (data=0x7f9c480008e0) at fuse_loop_mt.c:81
isforget = 0
ch = 0x559de3c97350
fbuf = {size = 135168, flags = (unknown: 0), mem = 0x7f9c48001700, fd = 0, pos = 0}
res = <optimized out>
w = 0x7f9c480008e0
mt = 0x7ffc7d42eb40
#5 0x00007f9c6dd7a494 in start_thread (arg=0x7f9c4d5d5700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7f9c4d5d5700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140309289588480, -960139244650756460, 0, 140309459622639, 0, 140309858979904, 978544231187351188, 978475046548014740}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007f9c6dabcaff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals.

Thread 6 (Thread 0x7f9c577fe700 (LWP 2576)):
#0 0x00007f9c6dd8320d in read () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007f9c6e7736f2 in read (__nbytes=135168, __buf=0x7f9c5c001900, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
No locals.
#2 fuse_kern_chan_receive (chp=<optimized out>, buf=0x7f9c5c001900 "8", size=135168) at fuse_kern_chan.c:28
ch = 0x559de3c97350
err = <optimized out>
se = 0x559de3c97470
__PRETTY_FUNCTION__ = "fuse_kern_chan_receive"
#3 0x00007f9c6e7752ba in fuse_ll_receive_buf (se=0x559de3c97470, buf=0x7f9c577fdd40, chp=0x7f9c577fdd38) at fuse_lowlevel.c:2670
ch = 0x559de3c97350
f = <optimized out>
bufsize = 135168
tmpbuf = <optimized out>
err = <optimized out>
res = <optimized out>
#4 0x00007f9c6e773d1e in fuse_do_work (data=0x7f9c5c000a40) at fuse_loop_mt.c:81
isforget = 0
ch = 0x559de3c97350
fbuf = {size = 135168, flags = (unknown: 0), mem = 0x7f9c5c001900, fd = 0, pos = 0}
res = <optimized out>
w = 0x7f9c5c000a40
mt = 0x7ffc7d42eb40
#5 0x00007f9c6dd7a494 in start_thread (arg=0x7f9c577fe700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7f9c577fe700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140309459625728, -960139244650756460, 0, 140309681326831, 0, 140309858979904, 978486794052830868, 978475046548014740}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007f9c6dabcaff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals.

Thread 5 (Thread 0x7f9c57fff700 (LWP 2477)):
#0 0x00007f9c6dd8320d in read () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007f9c6e7736f2 in read (__nbytes=135168, __buf=0x7f9c6f42c010, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
No locals.
#2 fuse_kern_chan_receive (chp=<optimized out>, buf=0x7f9c6f42c010 ".", size=135168) at fuse_kern_chan.c:28
ch = 0x559de3c97350
err = <optimized out>
se = 0x559de3c97470
__PRETTY_FUNCTION__ = "fuse_kern_chan_receive"
#3 0x00007f9c6e7752ba in fuse_ll_receive_buf (se=0x559de3c97470, buf=0x7f9c57ffed40, chp=0x7f9c57ffed38) at fuse_lowlevel.c:2670
ch = 0x559de3c97350
f = <optimized out>
bufsize = 135168
tmpbuf = <optimized out>
err = <optimized out>
res = <optimized out>
#4 0x00007f9c6e773d1e in fuse_do_work (data=0x7f9c5c0008c0) at fuse_loop_mt.c:81
isforget = 0
ch = 0x559de3c97350
fbuf = {size = 135168, flags = (unknown: 0), mem = 0x7f9c6f42c010, fd = 0, pos = 0}
res = <optimized out>
w = 0x7f9c5c0008c0
mt = 0x7ffc7d42eb40
#5 0x00007f9c6dd7a494 in start_thread (arg=0x7f9c57fff700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7f9c57fff700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140309468018432, -960139244650756460, 0, 140309681326831, 0, 140309858979904, 978487895175071380, 978475046548014740}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007f9c6dabcaff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals.

Thread 4 (Thread 0x7f9c64b6d700 (LWP 2476)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00007f9c6e27624f in g_cond_wait () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#2 0x0000559de2170074 in dfsm_send_message_sync (dfsm=dfsm@entry=0x559de3c976d0, msgtype=msgtype@entry=5, iov=iov@entry=0x7f9c64b6c8d0, len=len@entry=8, rp=rp@entry=0x7f9c64b6c8c0) at dfsm.c:339
__func__ = "dfsm_send_message_sync"
msgcount = 1
header = {base = {type = 0, subtype = 5, protocol_version = 1, time = 1519991644, reserved = 0}, count = 1}
real_iov = {{iov_base = 0x7f9c64b6c840, iov_len = 24}, {iov_base = 0x7f9c64b6c8ac, iov_len = 4}, {iov_base = 0x7f9c64b6c990, iov_len = 4}, {iov_base = 0x7f9c64b6c8b8, iov_len = 4}, {iov_base = 0x7f9c64b6c8bc, iov_len = 4}, {iov_base = 0x7f9c64b6c998, iov_len = 4}, {iov_base = 0x7f9c5c022b10, iov_len = 32}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
result = CS_OK
#3 0x0000559de216cb98 in dcdb_send_fuse_message (dfsm=0x559de3c976d0, msg_type=msg_type@entry=DCDB_MESSAGE_CFS_CREATE, path=0x7f9c5c022b10 "nodes/pve-1/lrm_status.tmp.2731", to=to@entry=0x0, buf=buf@entry=0x0, size=<optimized out>, size@entry=0, offset=<optimized out>, flags=<optimized out>) at dcdb.c:155
iov = {{iov_base = 0x7f9c64b6c8ac, iov_len = 4}, {iov_base = 0x7f9c64b6c990, iov_len = 4}, {iov_base = 0x7f9c64b6c8b8, iov_len = 4}, {iov_base = 0x7f9c64b6c8bc, iov_len = 4}, {iov_base = 0x7f9c64b6c998, iov_len = 4}, {iov_base = 0x7f9c5c022b10, iov_len = 32}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
pathlen = 32
tolen = 0
rc = {msgcount = 1, result = -16, processed = 0}
#4 0x0000559de2174334 in cfs_plug_memdb_create (plug=0x559de3c979b0, path=<optimized out>, mode=<optimized out>, fi=<optimized out>) at cfs-plug-memdb.c:307
res = <optimized out>
mdb = 0x559de3c979b0
mode = <optimized out>
fi = <optimized out>
path = <optimized out>
plug = 0x559de3c979b0
#5 0x0000559de2165c35 in cfs_fuse_create (path=0x7f9c5c022910 "/nodes/pve-1/lrm_status.tmp.2731", mode=33188, fi=0x7f9c64b6cb80) at pmxcfs.c:415
__func__ = "cfs_fuse_create"
ret = -13
subpath = 0x7f9c5c022b10 "nodes/pve-1/lrm_status.tmp.2731"
plug = <optimized out>
#6 0x00007f9c6e76f4e9 in fuse_fs_create (fs=0x559de3c99a20, path=0x7f9c5c022910 "/nodes/pve-1/lrm_status.tmp.2731", mode=mode@entry=33188, fi=fi@entry=0x7f9c64b6cb80) at fuse.c:2073
err = <optimized out>
#7 0x00007f9c6e76f60c in fuse_lib_create (req=0x7f9c5c001460, parent=5, name=0x7f9c6f44e048 "lrm_status.tmp.2731", mode=33188, fi=0x7f9c64b6cb80) at fuse.c:3165
f = 0x559de3c998c0
d = {id = 94136619866304, cond = {__data = {__lock = 1866784824, __futex = 32668, __total_seq = 0, __wakeup_seq = 140309844902951, __woken_seq = 140309681326976, __mutex = 0x7f9c64b6cac0, __nwaiters = 1, __broadcast_seq = 0}, __size = "8\340Do\234\177\000\000\000\000\000\000\000\000\000\000'\304vn\234\177\000\000\200˶d\234\177\000\000\300ʶd\234\177\000\000\001\000\000\000\000\000\000", __align = 140309858410552}, finished = 1543645456}
e = {ino = 140309535271696, generation = 140309535271616, attr = {st_dev = 4096, st_ino = 94136619866304, st_nlink = 0, st_mode = 2, st_uid = 0, st_gid = 1, __pad0 = 0, st_rdev = 41453, st_size = 4, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 9, tv_nsec = 128}, st_mtim = {tv_sec = 176, tv_nsec = 2}, st_ctim = {tv_sec = 214748364809, tv_nsec = 0}, __glibc_reserved = {0, 472446402651, 0}}, attr_timeout = 0, entry_timeout = 2.6312747813848758e-312}
path = 0x7f9c5c022910 "/nodes/pve-1/lrm_status.tmp.2731"
err = <optimized out>
#8 0x00007f9c6e775c4d in do_create (req=<optimized out>, nodeid=<optimized out>, inarg=<optimized out>) at fuse_lowlevel.c:1200
fi = {flags = 32961, fh_old = 0, writepage = 0, direct_io = 0, keep_cache = 0, flush = 0, nonseekable = 0, flock_release = 0, padding = 0, fh = 0, lock_owner = 0}
name = <optimized out>
arg = <optimized out>
#9 0x00007f9c6e7775d9 in fuse_ll_process_buf (data=0x559de3c99bb0, buf=0x7f9c64b6cd40, ch=<optimized out>) at fuse_lowlevel.c:2443
f = 0x559de3c99bb0
bufv = {count = 1, idx = 0, off = 0, buf = {{size = 76, flags = (unknown: 0), mem = 0x7f9c6f44e010, fd = 0, pos = 0}}}
tmpbuf = {count = 1, idx = 0, off = 0, buf = {{size = 80, flags = (unknown: 0), mem = 0x0, fd = -1, pos = 0}}}
in = 0x7f9c6f44e010
inarg = <optimized out>
req = 0x7f9c5c001460
mbuf = 0x0
err = <optimized out>
res = <optimized out>
#10 0x00007f9c6e773d98 in fuse_do_work (data=0x559de3c9c6a0) at fuse_loop_mt.c:117
isforget = 0
ch = 0x559de3c97350
fbuf = {size = 76, flags = (unknown: 0), mem = 0x7f9c6f44e010, fd = 0, pos = 0}
res = 76
w = 0x559de3c9c6a0
mt = 0x7ffc7d42eb40
#11 0x00007f9c6dd7a494 in start_thread (arg=0x7f9c64b6d700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7f9c64b6d700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140309681329920, -960139244650756460, 0, 140722410023167, 0, 140309858979904, 978458817172735636, 978475046548014740}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#12 0x00007f9c6dabcaff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals.

Thread 3 (Thread 0x7f9c6536e700 (LWP 2475)):
#0 0x00007f9c6dabd0f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007f9c6e5147ca in ?? () from /usr/lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#2 0x00007f9c6e50609f in qb_loop_run () from /usr/lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#3 0x0000559de21671d5 in worker_thread (data=<optimized out>) at server.c:443
th = 4403210616571953152
#4 0x00007f9c6e2583d5 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#5 0x00007f9c6dd7a494 in start_thread (arg=0x7f9c6536e700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7f9c6536e700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140309689722624, -960139244650756460, 0, 140722410023631, 0, 140309858979904, 978455518100981396, 978475046548014740}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007f9c6dabcaff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals.

Thread 2 (Thread 0x7f9c65b6f700 (LWP 2436)):
#0 0x00007f9c6dabd0f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1 0x00007f9c6e5147ca in ?? () from /usr/lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#2 0x00007f9c6e50609f in qb_loop_run () from /usr/lib/x86_64-linux-gnu/libqb.so.0
No symbol table info available.
#3 0x0000559de2165eaf in cfs_loop_worker_thread (data=0x559de3c97950) at loop.c:330
__func__ = "cfs_loop_worker_thread"
loop = 0x559de3c97950
qbloop = 0x559de3c99760
l = <optimized out>
ctime = <optimized out>
th = 7222815479134420993
#4 0x00007f9c6e2583d5 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#5 0x00007f9c6dd7a494 in start_thread (arg=0x7f9c65b6f700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7f9c65b6f700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140309698115328, -960139244650756460, 0, 140722410023615, 0, 140309858979904, 978456619223221908, 978475046548014740}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#6 0x00007f9c6dabcaff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals.

Thread 1 (Thread 0x7f9c6f4b28c0 (LWP 2433)):
#0 0x00007f9c6dd82536 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ffc7d42ebb0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
__ret = -512
oldtype = 0
err = <optimized out>
#1 do_futex_wait (sem=sem@entry=0x7ffc7d42ebb0, abstime=0x0) at sem_waitcommon.c:111
No locals.
#2 0x00007f9c6dd825e4 in __new_sem_wait_slow (sem=0x7ffc7d42ebb0, abstime=0x0) at sem_waitcommon.c:181
_buffer = {__routine = 0x7f9c6dd824f0 <__sem_wait_cleanup>, __arg = 0x7ffc7d42ebb0, __canceltype = 0, __prev = 0x0}
err = <optimized out>
d = 0
#3 0x00007f9c6dd82679 in __new_sem_wait (sem=sem@entry=0x7ffc7d42ebb0) at sem_wait.c:29
No locals.
#4 0x00007f9c6e773f88 in fuse_session_loop_mt (se=0x559de3c97470) at fuse_loop_mt.c:242
err = 0
mt = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, numworker = 4, numavail = 3, se = 0x559de3c97470, prevch = 0x559de3c97350, main = {prev = 0x7f9c480008e0, next = 0x559de3c9c6a0, thread_id = 140309858822336, bufsize = 0, buf = 0x0, mt = 0x0}, finish = {__size = "\000\000\000\000\001", '\000' <repeats 26 times>, __align = 4294967296}, exit = 0, error = 0}
w = <optimized out>
#5 0x00007f9c6e779727 in fuse_loop_mt (f=0x559de3c998c0) at fuse_mt.c:117
res = <optimized out>
#6 0x0000559de2163e4e in main (argc=<optimized out>, argv=<optimized out>) at pmxcfs.c:1025
ret = -1
lockfd = <optimized out>
foreground = 0
force_local_mode = 0
wrote_pidfile = 1
memdb = 0x559de3c16890
dcdb = 0x559de3c976d0
status_fsm = 0x559de3c9b080
context = <optimized out>
entries = {{long_name = 0x559de217bb73 "debug", short_name = 100 'd', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x559de2389044 <cfs+20>, description = 0x559de217b98b "Turn on debug messages", arg_description = 0x0}, {long_name = 0x559de217b9a2 "foreground", short_name = 102 'f', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffc7d42ec68, description = 0x559de217b9ad "Do not daemonize server", arg_description = 0x0}, {long_name = 0x559de217b9c5 "local", short_name = 108 'l', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x7ffc7d42ec6c, description = 0x559de217bd30 "Force local mode (ignore corosync.conf, force quorum)", arg_description = 0x0}, {long_name = 0x0, short_name = 0 '\000', flags = 0, arg = G_OPTION_ARG_NONE, arg_data = 0x0, description = 0x0, arg_description = 0x0}}
err = 0x0
__func__ = "main"
utsname = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "pve-1", '\000' <repeats 59 times>, release = "4.13.13-6-pve", '\000' <repeats 51 times>, version = "#1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100)\000\000\000\000\000\000\000\000\000", machine = "x86_64", '\000' <repeats 58 times>, __domainname = "(none)", '\000' <repeats 58 times>}
dot = <optimized out>
www_data = <optimized out>
create = <optimized out>
conf_data = 0x559de3c97350
len = <optimized out>
config = <optimized out>
bplug = <optimized out>
fa = {0x559de217ba57 "-f", 0x559de217ba5a "-odefault_permissions", 0x559de217ba70 "-oallow_other", 0x0}
fuse_args = {argc = 1, argv = 0x559de3c99720, allocated = 1}
fuse_chan = 0x559de3c97350
corosync_loop = 0x559de3c97950
service_quorum = 0x559de3c9a700
service_confdb = 0x559de3c9a790
service_dcdb = 0x559de3c9b020
service_status = 0x559de3c9b360
root@pve-1:~#
root@pve-1:~# ps ax | grep pmxcfs
2433 ? Ssl 0:08 /usr/bin/pmxcfs
23877 pts/4 S+ 0:00 grep pmxcfs
root@pve-1:~# echo $(pidof pmxcfs)
2433
root@pve-1:~# gdb -batch -p $(pidof pmxcfs) -ex generate-core-file
[New LWP 2436]
[New LWP 2475]
[New LWP 2476]
[New LWP 2477]
[New LWP 2576]
[New LWP 3033]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f9c6dd82536 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ffc7d42ebb0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.

The core file is too large to upload ... where would you like me to put it?

fabian · Mar 2, 2018

Jon morby said:
My German is pretty non existent so I didn't see (or understand) that thread sadly

sorry - sometimes I forget to pay attention (usually it's the other way round and I accidentally convert German threads to English by responding in English though, so this is a first

)

what I wrote was:

we can now reproduce this or a similar hang in our testlab and are looking for the root cause. to make sure it's the same issue, complete logs of all cluster nodes, especially in the time range of 10 minutes before the first "hung task" log message up to hung task message itself would be helpful.

in case it is not a production syhstem, but a test system without sensitive data, the following would be nice to have as well:
- collecting and saving of a core dump of pmxcfs (see below) on the node with the hung task
- collecting and posting of a backtrace of pmxcfs (see below) on the node with the hung task
- repeating a test with the debug option of pmxcfs ("-d") enabled on all nodes - this does create rather long logs though

- posting of all (compressed) logs of all cluster nodes

collection of backtrace:

Code:

apt install pve-cluster-dbgsym fuse-dbg
gdb -batch -p $(pidof pmxcfs) -ex "thread apply all bt full"

collection of a core dump:

Code:

gdb -batch -p $(pidof pmxcfs) -ex generate-core-file

because of the sensitive nature of the core dump (it contains all of the memory used by pmxcfs!) please don't post it on the forum - if we need it based on the back trace / the logs we will ask for it specifically.

root@pve-1:~# gdb -batch -p $(pidof pmxcfs) -ex "thread apply all bt full"
Thread 4 (Thread 0x7f9c64b6d700 (LWP 2476)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00007f9c6e27624f in g_cond_wait () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#2 0x0000559de2170074 in dfsm_send_message_sync (dfsm=dfsm@entry=0x559de3c976d0, msgtype=msgtype@entry=5, iov=iov@entry=0x7f9c64b6c8d0, len=len@entry=8, rp=rp@entry=0x7f9c64b6c8c0) at dfsm.c:339
__func__ = "dfsm_send_message_sync"
msgcount = 1
header = {base = {type = 0, subtype = 5, protocol_version = 1, time = 1519991644, reserved = 0}, count = 1}
real_iov = {{iov_base = 0x7f9c64b6c840, iov_len = 24}, {iov_base = 0x7f9c64b6c8ac, iov_len = 4}, {iov_base = 0x7f9c64b6c990, iov_len = 4}, {iov_base = 0x7f9c64b6c8b8, iov_len = 4}, {iov_base = 0x7f9c64b6c8bc, iov_len = 4}, {iov_base = 0x7f9c64b6c998, iov_len = 4}, {iov_base = 0x7f9c5c022b10, iov_len = 32}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
result = CS_OK
#3 0x0000559de216cb98 in dcdb_send_fuse_message (dfsm=0x559de3c976d0, msg_type=msg_type@entry=DCDB_MESSAGE_CFS_CREATE, path=0x7f9c5c022b10 "nodes/pve-1/lrm_status.tmp.2731", to=to@entry=0x0, buf=buf@entry=0x0, size=<optimized out>, size@entry=0, offset=<optimized out>, flags=<optimized out>) at dcdb.c:155
iov = {{iov_base = 0x7f9c64b6c8ac, iov_len = 4}, {iov_base = 0x7f9c64b6c990, iov_len = 4}, {iov_base = 0x7f9c64b6c8b8, iov_len = 4}, {iov_base = 0x7f9c64b6c8bc, iov_len = 4}, {iov_base = 0x7f9c64b6c998, iov_len = 4}, {iov_base = 0x7f9c5c022b10, iov_len = 32}, {iov_base = 0x0, iov_len = 0}, {iov_base = 0x0, iov_len = 0}}
pathlen = 32
tolen = 0
rc = {msgcount = 1, result = -16, processed = 0}
#4 0x0000559de2174334 in cfs_plug_memdb_create (plug=0x559de3c979b0, path=<optimized out>, mode=<optimized out>, fi=<optimized out>) at cfs-plug-memdb.c:307
res = <optimized out>
mdb = 0x559de3c979b0
mode = <optimized out>
fi = <optimized out>
path = <optimized out>
plug = 0x559de3c979b0
#5 0x0000559de2165c35 in cfs_fuse_create (path=0x7f9c5c022910 "/nodes/pve-1/lrm_status.tmp.2731", mode=33188, fi=0x7f9c64b6cb80) at pmxcfs.c:415
__func__ = "cfs_fuse_create"
ret = -13
subpath = 0x7f9c5c022b10 "nodes/pve-1/lrm_status.tmp.2731"
plug = <optimized out>
#6 0x00007f9c6e76f4e9 in fuse_fs_create (fs=0x559de3c99a20, path=0x7f9c5c022910 "/nodes/pve-1/lrm_status.tmp.2731", mode=mode@entry=33188, fi=fi@entry=0x7f9c64b6cb80) at fuse.c:2073
err = <optimized out>
#7 0x00007f9c6e76f60c in fuse_lib_create (req=0x7f9c5c001460, parent=5, name=0x7f9c6f44e048 "lrm_status.tmp.2731", mode=33188, fi=0x7f9c64b6cb80) at fuse.c:3165
f = 0x559de3c998c0
d = {id = 94136619866304, cond = {__data = {__lock = 1866784824, __futex = 32668, __total_seq = 0, __wakeup_seq = 140309844902951, __woken_seq = 140309681326976, __mutex = 0x7f9c64b6cac0, __nwaiters = 1, __broadcast_seq = 0}, __size = "8\340Do\234\177\000\000\000\000\000\000\000\000\000\000'\304vn\234\177\000\000\200˶d\234\177\000\000\300ʶd\234\177\000\000\001\000\000\000\000\000\000", __align = 140309858410552}, finished = 1543645456}
e = {ino = 140309535271696, generation = 140309535271616, attr = {st_dev = 4096, st_ino = 94136619866304, st_nlink = 0, st_mode = 2, st_uid = 0, st_gid = 1, __pad0 = 0, st_rdev = 41453, st_size = 4, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 9, tv_nsec = 128}, st_mtim = {tv_sec = 176, tv_nsec = 2}, st_ctim = {tv_sec = 214748364809, tv_nsec = 0}, __glibc_reserved = {0, 472446402651, 0}}, attr_timeout = 0, entry_timeout = 2.6312747813848758e-312}
path = 0x7f9c5c022910 "/nodes/pve-1/lrm_status.tmp.2731"
err = <optimized out>
#8 0x00007f9c6e775c4d in do_create (req=<optimized out>, nodeid=<optimized out>, inarg=<optimized out>) at fuse_lowlevel.c:1200
fi = {flags = 32961, fh_old = 0, writepage = 0, direct_io = 0, keep_cache = 0, flush = 0, nonseekable = 0, flock_release = 0, padding = 0, fh = 0, lock_owner = 0}
name = <optimized out>
arg = <optimized out>
#9 0x00007f9c6e7775d9 in fuse_ll_process_buf (data=0x559de3c99bb0, buf=0x7f9c64b6cd40, ch=<optimized out>) at fuse_lowlevel.c:2443
f = 0x559de3c99bb0
bufv = {count = 1, idx = 0, off = 0, buf = {{size = 76, flags = (unknown: 0), mem = 0x7f9c6f44e010, fd = 0, pos = 0}}}
tmpbuf = {count = 1, idx = 0, off = 0, buf = {{size = 80, flags = (unknown: 0), mem = 0x0, fd = -1, pos = 0}}}
in = 0x7f9c6f44e010
inarg = <optimized out>
req = 0x7f9c5c001460
mbuf = 0x0
err = <optimized out>
res = <optimized out>
#10 0x00007f9c6e773d98 in fuse_do_work (data=0x559de3c9c6a0) at fuse_loop_mt.c:117
isforget = 0
ch = 0x559de3c97350
fbuf = {size = 76, flags = (unknown: 0), mem = 0x7f9c6f44e010, fd = 0, pos = 0}
res = 76
w = 0x559de3c9c6a0
mt = 0x7ffc7d42eb40
#11 0x00007f9c6dd7a494 in start_thread (arg=0x7f9c64b6d700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7f9c64b6d700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140309681329920, -960139244650756460, 0, 140722410023167, 0, 140309858979904, 978458817172735636, 978475046548014740}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#12 0x00007f9c6dabcaff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals.

this seems to be the hang we can reproduce, so no need for further information for now except the logs (it would be especially interesting to know if prior to the first hung task, you see a message about the corosync process not being scheduled for X ms).

The core file is too large to upload ... where would you like me to put it?

see above - please don't unless we ask for it and provide you a way to safely transmit it to us!

Jon morby · Mar 4, 2018

how / where would you like the log archives? They're currently just over 3M gzip'd

fabian · Mar 5, 2018

Jon morby said:
how / where would you like the log archives? They're currently just over 3M gzip'd

you can send them to office@proxmox.com

Jon morby · Mar 5, 2018

Thanks ... done

Search

Search

"pct" & "qm" hanging on most servers in a cluster

alchemycs

Well-Known Member

fhc

Active Member

Jon morby

New Member

alchemycs

Well-Known Member

Jon morby

New Member

fabian

Proxmox Staff Member

Jon morby

New Member

fabian

Proxmox Staff Member

Jon morby

New Member

fabian

Proxmox Staff Member

Jon morby

New Member