Service pve-cluster stops regularly

adoII

Renowned Member
Jan 28, 2010
174
17
83
Hi there,

we have the newest pve packages from pve-no-subscription installed and the newest kernels bootet.

Since the last update of proxmox packages a few days ago we notoce that pve-cluster daemon stops working on different machines in our cluster.

In syslogs then we finde the following messages:

Code:
syslog:Oct 12 13:19:50 machine03 systemd[1]: pve-cluster.service: main process exited, code=killed, status=6/ABRT
syslog:Oct 12 13:19:50 machine03 systemd[1]: Unit pve-cluster.service entered failed state.

Our Software Versions are these:
Code:
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-3 (running version: 4.3-3/557191d3)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-91
pve-firmware: 1.1-9
libpve-common-perl: 4.0-75
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-66
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.2-2
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: not correctly installed
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
ceph: 10.2.3-1~bpo80+1

Stopping and Starting pve-daemon then fixes the problem for some minutes then the problem occurs again.

Any idea what is haeppening and what I can do to debug the problem ?

Thanks
 
We still have the issue.
The daemon pmxcfs stops running randomly on our machines.

We found the following more debug information:

Code:
pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: failed (Result: signal) since Mo 2016-10-17 13:01:01 CEST; 2h 8min ago
  Process: 3788 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
  Process: 3737 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 3784 (code=killed, signal=ABRT)

Okt 17 12:52:58 san05 pmxcfs[3784]: [status] notice: received sync request (epoch 1/2230/00000027)
Okt 17 12:52:58 san05 pmxcfs[3784]: [dcdb] notice: received all states
Okt 17 12:52:58 san05 pmxcfs[3784]: [dcdb] notice: leader is 1/2230
Okt 17 12:52:58 san05 pmxcfs[3784]: [dcdb] notice: synced members: 1/2230, 2/2416, 3/26511, 4/3784, 5/24979, 6/3629, 7/20736, 9/31488, 10/4122, 11/9838
Okt 17 12:52:58 san05 pmxcfs[3784]: [dcdb] notice: all data is up to date
Okt 17 12:52:58 san05 pmxcfs[3784]: [status] notice: received all states
Okt 17 12:52:58 san05 pmxcfs[3784]: [status] notice: all data is up to date
Okt 17 13:00:37 san05 pmxcfs[3784]: [status] notice: received log
Okt 17 13:01:01 san05 systemd[1]: pve-cluster.service: main process exited, code=killed, status=6/ABRT
Okt 17 13:01:01 san05 systemd[1]: Unit pve-cluster.service entered failed state.
 
Okay, now i was able to capture a trace of the pmxcfs process when dying.

I think the reason for crashing is a corrupted cluster database ?

[pid 25704] writev(2, [{"*** Error in `", 14}, {"/usr/bin/pmxcfs", 15}, {"': ", 3}, {"corrupted double-linked list", 28}, {": 0x", 4}, {"00007f8fac023a50", 16}, {" ***\n", 5}], 7 <unfinished ...>

I do not remember we did anything bad. We did not add or remove hosts from the cluster. The problem is since the update from proxmox 4.2 to proxmox 4.3

More verbose trace is here

pid 25704] <... read resumed> "4\0\0\0\1\0\0\0\335\272\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 135168) = 52
[pid 14291] read(7, <unfinished ...>
[pid 25704] open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such device or address)
[pid 25704] writev(2, [{"*** Error in `", 14}, {"/usr/bin/pmxcfs", 15}, {"': ", 3}, {"corrupted double-linked list", 28}, {": 0x", 4}, {"00007f8fac023a50", 16}, {" ***\n", 5}], 7 <unfinished ...>
[pid 25684] writev(7, [{"\220\0\0\0\0\0\0\0\334\272\0\0\0\0\0\0", 16}, {"\361\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128}], 2 <unfinished ...>
[pid 25704] <... writev resumed> ) = 85
[pid 25704] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 25684] <... writev resumed> ) = 144
[pid 25707] <... read resumed> "4\0\0\0\1\0\0\0\336\272\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 135168) = 52
[pid 25704] <... mmap resumed> ) = 0x7f8fe2e4c000
[pid 25707] writev(7, [{"\220\0\0\0\0\0\0\0\336\272\0\0\0\0\0\0", 16}, {"\362\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128}], 2 <unfinished ...>
[pid 25704] rt_sigprocmask(SIG_UNBLOCK, [ABRT], <unfinished ...>
[pid 25707] <... writev resumed> ) = 144
[pid 25704] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid 25707] read(7, <unfinished ...>
[pid 25706] <... read resumed> "1\0\0\0\1\0\0\0\337\272\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 135168) = 49
[pid 25704] tgkill(14283, 25704, SIGABRT <unfinished ...>
[pid 25684] read(7, <unfinished ...>
[pid 25704] <... tgkill resumed> ) = 0
[pid 25704] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=14283, si_uid=0} ---
[pid 25706] writev(7, [{"\220\0\0\0\0\0\0\0\337\272\0\0\0\0\0\0", 16}, {"\363\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128}], 2 <ptrace(SYSCALL):No such process>
[pid 17040] <... read resumed> ) = ? <unavailable>
[pid 14330] <... read resumed> ) = ? <unavailable>
[pid 25705] +++ killed by SIGABRT +++
[pid 25684] +++ killed by SIGABRT +++
[pid 25704] +++ killed by SIGABRT +++
[pid 25707] +++ killed by SIGABRT +++
[pid 25706] +++ killed by SIGABRT +++
[pid 17040] +++ killed by SIGABRT +++
[pid 14330] +++ killed by SIGABRT +++
 
Hi,

we just faced the same issue 2 days ago without any reason on PVE 4.4-15/7599e35a:

Aug 24 12:59:54 node systemd[1]: pve-cluster.service: main process exited, code=killed, status=6/ABRT
Aug 24 12:59:54 node systemd[1]: Unit pve-cluster.service entered failed state.

We have no idea what was the problem. Could this be a bug in pve-cluster/pmxcfs?

How did you get your trace? Our node has been fenced after this, what is expected when pve-cluster is gone.

Is there any chance to get more info on this @ proxmox stuff? Thanks!
 
I've also been having issues lately with pmxcfs, although it causes a soft lockup on my host and requires a physical reboot.
 
The issue triggered again here :(

It seems like this comes when reading files from /etc/pve (we use collectd to read stats from /etc/pve/.rrd every minute).
Is it possible that there is any kind of read lock or read race condition on /etc/pve or is it generally not a good idea to read periodically from there @ proxmox staff?
 
The issue triggered again here :(

It seems like this comes when reading files from /etc/pve (we use collectd to read stats from /etc/pve/.rrd every minute).
Is it possible that there is any kind of read lock or read race condition on /etc/pve or is it generally not a good idea to read periodically from there @ proxmox staff?

you could try installing and enabling systemd-coredump and see if it collects a coredump when pmxcfs crashes. if you install the pve-cluster-dbg package, you should then be able to show a backtrace using "coredumpctl info pmxcfs".
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!