pvestatd filling up /tmp with ceph-client.admin.$pid.$random.asok socket files

biggles

Member
Feb 18, 2021
3
0
6
52
Hi,

Name's Erik and I am a longtime lurker but this being the first time I could not resolve an issue using documentation or this forum I thought I'd register and ask.

I have a hyperconverged 3-node cluster with ceph that was recently upgraded from 4.x with ceph Hammer to 6.3 and ceph Nautilus, following the PVE upgrade guides and. I have to say that PVE has been nothing but impressive when it comes to upgrades. Compliments to the software and to the docs!

The snag that has hit me this time is that, while everything still works OK, is that the /tmp directory has started to fill up with ceph admin socket files that seem to belong to processes that no longer exist. I caught on to the problem because monitoring alerted about / running out of inodes.

After some digging it seems like it is the pvestatd process that forks a subprocess (which is renamed to "pverados" which creates these files and does bind() and listen() to the socket but does not delete them when the child process exits. Strace shows no attempt to unlink() them as far as I can tell.

There is one such file created every two seconds or so and scripted cleanup of the files is a bit risky as there are legit ceph admin sockets with the same naming structure, such as the ones being used by running vms.

Versions are:
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)
ceph 14.2.16-pve1

Anyone who might have an idea on what might be wrong?

Best regards,

/Erik

root@pve2:~# ls -ltr /tmp/ | grep ceph-client | tail -10
srwxr-xr-x 1 root root 0 Feb 18 18:03 ceph-client.admin.1299363.93951616497296.asok
srwxr-xr-x 1 root root 0 Feb 18 18:03 ceph-client.admin.1299383.93951616497296.asok
srwxr-xr-x 1 root root 0 Feb 18 18:03 ceph-client.admin.1299410.93951687626256.asok
srwxr-xr-x 1 root root 0 Feb 18 18:03 ceph-client.admin.1299440.93932732736024.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299474.93951687628944.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299518.93951687626256.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299538.93951688944176.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299570.93932732730872.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299597.93951616503840.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299618.93951616503600.asok
root@pve2:~# ls -ltr /tmp/ | grep ceph-client | tail -10
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299518.93951687626256.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299538.93951688944176.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299570.93932732730872.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299597.93951616503840.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299618.93951616503600.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299637.93951687626784.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299671.93932732737768.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299696.93951616503120.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299717.93951616503088.asok
srwxr-xr-x 1 root root 0 Feb 18 18:04 ceph-client.admin.1299736.93951688951504.asok
root@pve2:~# ls -ltr /tmp/ | grep ceph-client | wc -l
349152
 
Hi!

This is really a bit weird. And the ceph cluster works OK? Anything in the journal (or syslog)?

Can you please post your ceph.conf:
/etc/ceph/ceph.conf
 
Hi Thomas,

Thanks for getting back!

The cluster works perfectly as far as I can tell. ceph.conf as below. I notice there is a config line for admin socket in the [client] section which is not present on another cluster that option is not set. This current cluster is also the oldest one I have and it could have started on Proxmox 3.

/Erik

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
auth supported = cephx
cluster network = 10.33.2.0/24
filestore xattr use omap = true
fsid = 6056726b-53ae-4487-a354-c4d0c8c12dc4
osd journal size = 5120
osd pool default min size = 1
public network = 10.33.2.0/24
mon_host = 10.33.2.232 10.33.2.233 10.33.2.231
mon allow pool delete = true

[osd]
# keyring = /var/lib/ceph/osd/ceph-$id/keyring
osd_crush_initial_weight = 0

[mon.1]
host = lkp2sx0032
mon addr = 10.33.2.232:6789

[mon.2]
host = lkp2sx0033
mon addr = 10.33.2.233:6789

[mon.0]
host = lkp2sx0031
mon addr = 10.33.2.231:6789

[client]
admin socket = /tmp/$cluster-$type.$id.$pid.$cctid.asok
keyring = /etc/pve/priv/$cluster.$name.keyring
 
I notice there is a config line for admin socket in the [client] section which is not present on another cluster that option is not set.
Yeah, that seems out of place, especially as the default admin socket path is in /run (or /var/run, pretty much the same nowadays) which is guaranteed to be a tmpfs (a memory backed ephemeral fs).

Just to confirm, that file is still a link to /etc/pve/ceph.conf?
Bash:
ls -l /etc/ceph/ceph.conf
lrwxrwxrwx 1 root root 18 Jan 24  2020 /etc/ceph/ceph.conf -> /etc/pve/ceph.conf

Also, how does ceph's run directory looks like:
Bash:
ls -la /var/run/ceph/
 
Yes, the conf files are all symlinked to /etc/pve. /var/run/ceph looks like:

root@lkp2sx0031:~# ls -la /var/run/ceph/
total 0
drwxrwx--- 2 ceph ceph 120 Feb 8 16:20 .
drwxr-xr-x 36 root root 1580 Feb 19 13:00 ..
srwxr-xr-x 1 ceph ceph 0 Feb 8 16:20 ceph-mgr.lkp2sx0031.asok
srwxr-xr-x 1 ceph ceph 0 Feb 8 16:19 ceph-mon.0.asok
srwxr-xr-x 1 ceph ceph 0 Feb 8 16:20 ceph-osd.3.asok
srwxr-xr-x 1 ceph ceph 0 Feb 8 16:20 ceph-osd.6.asok

It does indeed seem to point to the lingering admin socket directive being the issue. Could this have been a leftover default option that was set in Proxmox 3.2 with bundled ceph whatever version that could have been - firefly perhaps?

The question now would be how to safely get rid of it preferably without downtime. Aside from pvestatd, running vms use the socket location. Which proxmox services need to be restarted to re-read that config (if any)? I guess that changing ceph config and livemigrate all vms should take care of it?

/Erik
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!