Service pve-cluster stops regularly

adoII · Oct 12, 2016

Hi there,

we have the newest pve packages from pve-no-subscription installed and the newest kernels bootet.

Since the last update of proxmox packages a few days ago we notoce that pve-cluster daemon stops working on different machines in our cluster.

In syslogs then we finde the following messages:

Code:

syslog:Oct 12 13:19:50 machine03 systemd[1]: pve-cluster.service: main process exited, code=killed, status=6/ABRT
syslog:Oct 12 13:19:50 machine03 systemd[1]: Unit pve-cluster.service entered failed state.

Our Software Versions are these:

Code:

proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-3 (running version: 4.3-3/557191d3)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-91
pve-firmware: 1.1-9
libpve-common-perl: 4.0-75
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-66
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.2-2
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: not correctly installed
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
ceph: 10.2.3-1~bpo80+1

Stopping and Starting pve-daemon then fixes the problem for some minutes then the problem occurs again.

Any idea what is haeppening and what I can do to debug the problem ?

Thanks

adoII · Oct 17, 2016

We still have the issue.
The daemon pmxcfs stops running randomly on our machines.

We found the following more debug information:

Code:

pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: failed (Result: signal) since Mo 2016-10-17 13:01:01 CEST; 2h 8min ago
  Process: 3788 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
  Process: 3737 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 3784 (code=killed, signal=ABRT)

Okt 17 12:52:58 san05 pmxcfs[3784]: [status] notice: received sync request (epoch 1/2230/00000027)
Okt 17 12:52:58 san05 pmxcfs[3784]: [dcdb] notice: received all states
Okt 17 12:52:58 san05 pmxcfs[3784]: [dcdb] notice: leader is 1/2230
Okt 17 12:52:58 san05 pmxcfs[3784]: [dcdb] notice: synced members: 1/2230, 2/2416, 3/26511, 4/3784, 5/24979, 6/3629, 7/20736, 9/31488, 10/4122, 11/9838
Okt 17 12:52:58 san05 pmxcfs[3784]: [dcdb] notice: all data is up to date
Okt 17 12:52:58 san05 pmxcfs[3784]: [status] notice: received all states
Okt 17 12:52:58 san05 pmxcfs[3784]: [status] notice: all data is up to date
Okt 17 13:00:37 san05 pmxcfs[3784]: [status] notice: received log
Okt 17 13:01:01 san05 systemd[1]: pve-cluster.service: main process exited, code=killed, status=6/ABRT
Okt 17 13:01:01 san05 systemd[1]: Unit pve-cluster.service entered failed state.

adoII · Oct 18, 2016

Okay, now i was able to capture a trace of the pmxcfs process when dying.

I think the reason for crashing is a corrupted cluster database ?

[pid 25704] writev(2, [{"*** Error in `", 14}, {"/usr/bin/pmxcfs", 15}, {"': ", 3}, {"corrupted double-linked list", 28}, {": 0x", 4}, {"00007f8fac023a50", 16}, {" ***\n", 5}], 7 <unfinished ...>

I do not remember we did anything bad. We did not add or remove hosts from the cluster. The problem is since the update from proxmox 4.2 to proxmox 4.3

More verbose trace is here

pid 25704] <... read resumed> "4\0\0\0\1\0\0\0\335\272\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 135168) = 52
[pid 14291] read(7, <unfinished ...>
[pid 25704] open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such device or address)
[pid 25704] writev(2, [{"*** Error in `", 14}, {"/usr/bin/pmxcfs", 15}, {"': ", 3}, {"corrupted double-linked list", 28}, {": 0x", 4}, {"00007f8fac023a50", 16}, {" ***\n", 5}], 7 <unfinished ...>
[pid 25684] writev(7, [{"\220\0\0\0\0\0\0\0\334\272\0\0\0\0\0\0", 16}, {"\361\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128}], 2 <unfinished ...>
[pid 25704] <... writev resumed> ) = 85
[pid 25704] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 25684] <... writev resumed> ) = 144
[pid 25707] <... read resumed> "4\0\0\0\1\0\0\0\336\272\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 135168) = 52
[pid 25704] <... mmap resumed> ) = 0x7f8fe2e4c000
[pid 25707] writev(7, [{"\220\0\0\0\0\0\0\0\336\272\0\0\0\0\0\0", 16}, {"\362\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128}], 2 <unfinished ...>
[pid 25704] rt_sigprocmask(SIG_UNBLOCK, [ABRT], <unfinished ...>
[pid 25707] <... writev resumed> ) = 144
[pid 25704] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid 25707] read(7, <unfinished ...>
[pid 25706] <... read resumed> "1\0\0\0\1\0\0\0\337\272\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 135168) = 49
[pid 25704] tgkill(14283, 25704, SIGABRT <unfinished ...>
[pid 25684] read(7, <unfinished ...>
[pid 25704] <... tgkill resumed> ) = 0
[pid 25704] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=14283, si_uid=0} ---
[pid 25706] writev(7, [{"\220\0\0\0\0\0\0\0\337\272\0\0\0\0\0\0", 16}, {"\363\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128}], 2 <ptrace(SYSCALL):No such process>
[pid 17040] <... read resumed> ) = ? <unavailable>
[pid 14330] <... read resumed> ) = ? <unavailable>
[pid 25705] +++ killed by SIGABRT +++
[pid 25684] +++ killed by SIGABRT +++
[pid 25704] +++ killed by SIGABRT +++
[pid 25707] +++ killed by SIGABRT +++
[pid 25706] +++ killed by SIGABRT +++
[pid 17040] +++ killed by SIGABRT +++
[pid 14330] +++ killed by SIGABRT +++

robhost · Aug 26, 2017

Hi,

we just faced the same issue 2 days ago without any reason on PVE 4.4-15/7599e35a:

Aug 24 12:59:54 node systemd[1]: pve-cluster.service: main process exited, code=killed, status=6/ABRT
Aug 24 12:59:54 node systemd[1]: Unit pve-cluster.service entered failed state.

We have no idea what was the problem. Could this be a bug in pve-cluster/pmxcfs?

How did you get your trace? Our node has been fenced after this, what is expected when pve-cluster is gone.

Is there any chance to get more info on this @ proxmox stuff? Thanks!

ipkpjersi · Aug 26, 2017

I've also been having issues lately with pmxcfs, although it causes a soft lockup on my host and requires a physical reboot.

robhost · Sep 2, 2017

The issue triggered again here

It seems like this comes when reading files from /etc/pve (we use collectd to read stats from /etc/pve/.rrd every minute).
Is it possible that there is any kind of read lock or read race condition on /etc/pve or is it generally not a good idea to read periodically from there @ proxmox staff?

fabian · Sep 4, 2017

robhost said:
The issue triggered again here

It seems like this comes when reading files from /etc/pve (we use collectd to read stats from /etc/pve/.rrd every minute).
Is it possible that there is any kind of read lock or read race condition on /etc/pve or is it generally not a good idea to read periodically from there @ proxmox staff?

you could try installing and enabling systemd-coredump and see if it collects a coredump when pmxcfs crashes. if you install the pve-cluster-dbg package, you should then be able to show a backtrace using "coredumpctl info pmxcfs".

robhost · Sep 4, 2017

fabian said:
you could try installing and enabling systemd-coredump and see if it collects a coredump when pmxcfs crashes. if you install the pve-cluster-dbg package, you should then be able to show a backtrace using "coredumpctl info pmxcfs".

Yes, we'll try and report back

robhost · Sep 15, 2017

We made a coredump and can even reproduce this bug now, please see https://bugzilla.proxmox.com/show_bug.cgi?id=1504

Search

Search

Service pve-cluster stops regularly

adoII

Renowned Member

adoII

Renowned Member

adoII

Renowned Member

robhost

Active Member

ipkpjersi

Active Member

robhost

Active Member

fabian

Proxmox Staff Member

robhost

Active Member

robhost

Active Member