PMXCF Crash ?

kev1904

Well-Known Member
Feb 11, 2019
61
4
48
32
Hallo,

gestern abend sind einige Nodes aus unserm Cluster neu gestartet. Um genau zu sein 4 von 5 Computing Nodes. Der eine, der nicht neu gestartet ist, ist dann allerdings nach reboot der anderen nicht wieder automatisch dem cluster gejoint, dort musste ich mit killall corosync und pce-cluster restart nachhelfen.
In allen Nodes auch die nicht neu gestartet sind ist ein segfault von pmxcf zu sehen:


Feb 25 21:43:08 prox03 pmxcfs[7152]: [dcdb] notice: members: 1/23948, 3/7152, 4/30371, 8/20357, 9/3026, 10/3219060, 11/2665251
Feb 25 21:43:08 prox03 pmxcfs[7152]: [dcdb] notice: starting data syncronisation
Feb 25 21:43:08 prox03 pmxcfs[7152]: [status] notice: members: 1/23948, 3/7152, 4/30371, 8/20357, 9/3026, 10/3219060, 11/2665251
Feb 25 21:43:08 prox03 pmxcfs[7152]: [status] notice: starting data syncronisation
Feb 25 21:43:08 prox03 pmxcfs[7152]: [dcdb] notice: received sync request (epoch 1/23948/0000009D)
Feb 25 21:43:08 prox03 pmxcfs[7152]: [status] notice: received sync request (epoch 1/23948/0000008B)
Feb 25 21:43:08 prox03 kernel: [8765900.825555] cfs_loop[7153]: segfault at 7fba8a7360f1 ip 000056080bd0a820 sp 00007fba1d789318 error 4 in pmxcfs[56080bcf1000+1b000]
Feb 25 21:43:08 prox03 kernel: [8765900.825567] Code: 10 48 89 c6 48 89 ef 48 89 10 48 8b 53 08 48 89 50 08 48 89 c2 e8 e0 73 fe ff b8 01 00 00 00 e9 4a ff ff ff 66 0f 1f 44 00 00 <8b> 47 0c 8b 56 0c 39 d0 75 0d 48 8b 47 10 48 8b 56 10 48 39 d0 74
Feb 25 21:43:08 prox03 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Feb 25 21:43:08 prox03 systemd[1]: Started Process Core Dump (PID 14451/UID 0).
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Feb 25 21:48:17 prox03 systemd[1]: Starting Flush Journal to Persistent Storage...
 
Last edited:
welche version ist auf dem node installiert? pveversion -v
 
und falls der coredump existiert waere es super wenn du uns den zukommen lassen koenntest (coredumpctl list)
 
der coredump ist leider auf keinem der Systeme mehr zu finden, warum auch immer

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.10.17-2-pve: 4.10.17-20
ceph-fuse: 12.2.13-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 
okay danke! versionen und log meldungen waren auf allen nodes gleich? irgendetwas was diesen cluster "besonders" macht?

wir haben vereinzelt meldungen von so aehnlichen abstuerzen, aber leider noch nicht geschafft einen coredump zu bekommen oder das problem selbst zu reproduzieren..
 
Versionen sind alle gleich.
Es gibt nichts besonderes, das cluster besteht aus 5 Computing und 3 Ceph nodes. Auf allen nodes ist die selbe Meldung mit dem Segfault. Die Storage nodes sind allerdings alle 3 nicht neu gestartet.

Feb 25 21:43:08 proxstore11 pmxcfs[3026]: [dcdb] notice: members: 1/23948, 3/7152, 4/30371, 8/20357, 9/3026, 10/3219060, 11/2665251
Feb 25 21:43:08 proxstore11 pmxcfs[3026]: [dcdb] notice: starting data syncronisation
Feb 25 21:43:08 proxstore11 pmxcfs[3026]: [status] notice: members: 1/23948, 3/7152, 4/30371, 8/20357, 9/3026, 10/3219060, 11/2665251
Feb 25 21:43:08 proxstore11 pmxcfs[3026]: [status] notice: starting data syncronisation
Feb 25 21:43:08 proxstore11 pmxcfs[3026]: [dcdb] notice: received sync request (epoch 1/23948/0000009D)
Feb 25 21:43:08 proxstore11 pmxcfs[3026]: [status] notice: received sync request (epoch 1/23948/0000008B)
Feb 25 21:43:08 proxstore11 kernel: [7627826.291291] cfs_loop[3027]: segfault at 7efcae8e3d51 ip 00005617ce921820 sp 00007efc4109e318 error 4 in pmxcfs[5617ce908000+1b000]
Feb 25 21:43:08 proxstore11 kernel: [7627826.291301] Code: 10 48 89 c6 48 89 ef 48 89 10 48 8b 53 08 48 89 50 08 48 89 c2 e8 e0 73 fe ff b8 01 00 00 00 e9 4a ff ff ff 66 0f 1f 44 00 00 <8b> 47 0c 8b 56
0c 39 d0 75 0d 48 8b 47 10 48 8b 56 10 48 39 d0 74
Feb 25 21:43:08 proxstore11 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Feb 25 21:43:08 proxstore11 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Feb 25 21:43:08 proxstore11 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Feb 25 21:43:08 proxstore11 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 1.
Feb 25 21:43:08 proxstore11 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Feb 25 21:43:08 proxstore11 systemd[1]: Starting The Proxmox VE cluster filesystem...
Feb 25 21:43:08 proxstore11 pveproxy[1962001]: ipcc_send_rec[1] failed: Connection refused
Feb 25 21:43:08 proxstore11 pveproxy[1962001]: ipcc_send_rec[2] failed: Connection refused
Feb 25 21:43:08 proxstore11 pveproxy[1962001]: ipcc_send_rec[3] failed: Connection refused
Feb 25 21:43:08 proxstore11 pmxcfs[1966680]: fuse: failed to access mountpoint /etc/pve: Transport endpoint is not connected
Feb 25 21:43:08 proxstore11 pmxcfs[1966680]: [main] crit: fuse_mount error: Transport endpoint is not connected
Feb 25 21:43:08 proxstore11 pmxcfs[1966680]: [main] crit: fuse_mount error: Transport endpoint is not connected
Feb 25 21:43:08 proxstore11 pmxcfs[1966680]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 25 21:43:08 proxstore11 pmxcfs[1966680]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 25 21:43:08 proxstore11 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Feb 25 21:43:08 proxstore11 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Feb 25 21:43:08 proxstore11 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Feb 25 21:43:09 proxstore11 pveproxy[1962001]: ipcc_send_rec[1] failed: Connection refused
Feb 25 21:43:09 proxstore11 pveproxy[1962001]: ipcc_send_rec[2] failed: Connection refused
Feb 25 21:43:09 proxstore11 pveproxy[1962001]: ipcc_send_rec[3] failed: Connection refused
Feb 25 21:43:09 proxstore11 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Feb 25 21:43:09 proxstore11 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 2.
Feb 25 21:43:09 proxstore11 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Feb 25 21:43:09 proxstore11 systemd[1]: Starting The Proxmox VE cluster filesystem...
Feb 25 21:43:09 proxstore11 pmxcfs[1966799]: fuse: failed to access mountpoint /etc/pve: Transport endpoint is not connected
Feb 25 21:43:09 proxstore11 pmxcfs[1966799]: [main] crit: fuse_mount error: Transport endpoint is not connected
Feb 25 21:43:09 proxstore11 pmxcfs[1966799]: [main] crit: fuse_mount error: Transport endpoint is not connected
Feb 25 21:43:09 proxstore11 pmxcfs[1966799]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 25 21:43:09 proxstore11 pmxcfs[1966799]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 25 21:43:09 proxstore11 systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Feb 25 21:43:09 proxstore11 systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Feb 25 21:43:09 proxstore11 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Feb 25 21:43:09 proxstore11 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Feb 25 21:43:09 proxstore11 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 3.
Feb 25 21:43:09 proxstore11 systemd[1]: Stopped The Proxmox VE cluster filesystem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!