Some client(s) are unable to authenticate to Ceph but which ones?

ikogan

Well-Known Member
Apr 8, 2017
41
4
48
40
This past weekend I completed migrating my config management to Ansible and updated all PVE packages, rebooting each node as I did so. Afterwards, 4 of my 5 nodes started throwing errors that some Ceph client(s) using the client.admin client are unable to authenticate:

Code:
2025-03-10T12:21:05.216-0400 772cb2a006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:21:05.216-0400 79df80e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:37.002-0400 759b9b6006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:37.002-0400 7fc8c4e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:37.001-0400 79df80e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:36.682-0400 772cb2a006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:36.682-0400 79df80e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:36.682-0400 7fc8c4e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:19.893-0400 772cb2a006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:19.892-0400 7fc8c4e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:20:19.892-0400 759b9b6006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:19:36.317-0400 7fc8c4e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:19:36.316-0400 79df886006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]

As you can see, it seems to happen about once a minute. The problem is...everything seems to still work. The PVE web UI works, VMs can access their storage, the Kubernetes cluster can access it's storage, autofs mounts CephFS fine, I have data in both Prometheus and InfluxDB from Ceph. I don't share my client.admin token outside of the PVE hypervisor itself so it has to be on the nodes. Interestingly, the 5th node that doesn't log this error is the node that had a bootloader problem I had to fix but I suspect that's unrelated. I've been tailing the whole system journal (and I've configured ceph to log to the journal, not files) and I don't see anything in there about any kind of failure.

Is it possible to get more information about what's failing? I'd prefer to not discover sometime in the future that there's a piece that's been slightly broken that is now causing massive failures. Here's the log lines from all nodes around the time of those errors:

Code:
2025-03-10T12:21:09.614-0400 772cada006c0  0 mon.svantevit@1(peon) e28 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2025-03-10T12:21:08.641-0400 7f26e46006c0  1 mon.triglav@2(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 335544320 full_alloc: 364904448 kv_alloc: 314572800
2025-03-10T12:21:08.513-0400 772cb0c006c0  1 mon.svantevit@1(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 339738624 full_alloc: 360710144 kv_alloc: 314572800
2025-03-10T12:21:08.221-0400 79df836006c0  0 log_channel(audit) log [DBG] : from='mgr.1116134312 10.10.0.20:0/2110' entity='mgr.zorya' cmd=[{"prefix": "osd blocklist ls", "format": "json"}]: dispatch
2025-03-10T12:21:08.221-0400 79df836006c0  0 mon.lada@3(peon) e28 handle_command mon_command({"prefix": "osd blocklist ls", "format": "json"} v 0) v1
2025-03-10T12:21:08.184-0400 7f26e14006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.3:0/4232461125' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2025-03-10T12:21:08.184-0400 7f26e14006c0  0 mon.triglav@2(peon) e28 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2025-03-10T12:21:08.135-0400 759b9d4006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.1:0/2536031223' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2025-03-10T12:21:08.135-0400 759b9d4006c0  0 mon.perun@0(leader) e28 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2025-03-10T12:21:08.087-0400 759b9d4006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.1:0/463361600' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2025-03-10T12:21:08.087-0400 759b9d4006c0  0 mon.perun@0(leader) e28 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2025-03-10T12:21:07.959-0400 7f26e14006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.3:0/1399881575' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2025-03-10T12:21:07.959-0400 7f26e14006c0  0 mon.triglav@2(peon) e28 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2025-03-10T12:21:07.939-0400 759b9d4006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.2:0/2505811715' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2025-03-10T12:21:07.939-0400 759b9d4006c0  0 mon.perun@0(leader) e28 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2025-03-10T12:21:07.737-0400 772cada006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.2:0/2805278002' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2025-03-10T12:21:07.737-0400 772cada006c0  0 mon.svantevit@1(peon) e28 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2025-03-10T12:21:07.589-0400 759ba06006c0  1 mon.perun@0(leader).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 335544320 full_alloc: 369098752 kv_alloc: 310378496
2025-03-10T12:21:07.348-0400 772ca9e006c0  4 rocksdb: [db/db_impl/db_impl.cc:1111]
** DB Stats **
Uptime(secs): 71400.6 total, 600.0 interval
Cumulative writes: 217K writes, 1324K keys, 217K commit groups, 1.0 writes per commit group, ingest: 2.57 GB, 0.04 MB/s
Cumulative WAL: 217K writes, 217K syncs, 1.00 writes per sync, written: 2.57 GB, 0.04 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 1960 writes, 11K keys, 1960 commit groups, 1.0 writes per commit group, ingest: 23.55 MB, 0.04 MB/s
Interval WAL: 1960 writes, 1960 syncs, 1.00 writes per sync, written: 0.02 GB, 0.04 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       1.6      1.6       0.0   1.0      0.0     17.3     95.19              5.02       749    0.127       0      0       0.0       0.0
  L6      3/0   145.53 MB   0.0    106.6     1.6    105.0     105.0      0.0       0.0  65.3     31.3     30.8   3484.69            442.00       748    4.659    297M   788K       0.0       0.0
 Sum      3/0   145.53 MB   0.0    106.6     1.6    105.0     106.6      1.6       0.0  66.3     30.5     30.5   3579.89            447.01      1497    2.391    297M   788K       0.0       0.0
 Int      0/0    0.00 KB   0.0      0.8     0.0      0.8       0.8      0.0       0.0  62.7     25.4     25.5     34.09              3.48        12    2.841   2388K   6180       0.0       0.0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0    106.6     1.6    105.0     105.0      0.0       0.0   0.0     31.3     30.8   3484.69            442.00       748    4.659    297M   788K       0.0       0.0
High      0/0    0.00 KB   0.0      0.0     0.0      0.0       1.6      1.6       0.0   0.0      0.0     17.3     95.03              5.02       748    0.127       0      0       0.0       0.0
User      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0     31.7      0.16              0.00         1    0.164       0      0       0.0       0.0

Blob file count: 0, total size: 0.0 GB, garbage size: 0.0 GB, space amp: 0.0

Uptime(secs): 71400.6 total, 600.0 interval
Flush(GB): cumulative 1.608, interval 0.014
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 106.58 GB write, 1.53 MB/s write, 106.58 GB read, 1.53 MB/s read, 3579.9 seconds
Interval compaction: 0.85 GB write, 1.45 MB/s write, 0.85 GB read, 1.44 MB/s read, 34.1 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache BinnedLRUCache@0x56f3aac031f0#2659 capacity: 300.00 MB usage: 297.11 MB table_size: 0 occupancy: 18446744073709551615 collections: 120 last_copies: 0 last_secs: 0.001504 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(8242,159.64 MB,53.212%) FilterBlock(408,98.65 MB,32.8823%) IndexBlock(409,38.83 MB,12.942%) Misc(1,0.00 KB,0%)

** File Read Latency Histogram By Level [default] **
2025-03-10T12:21:07.348-0400 772ca9e006c0  4 rocksdb: [db/db_impl/db_impl.cc:1109] ------- DUMPING STATS -------
2025-03-10T12:21:06.935-0400 79df868006c0  1 mon.lada@3(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 335544320 full_alloc: 369098752 kv_alloc: 310378496
2025-03-10T12:21:05.217-0400 7fc8c4e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:21:05.216-0400 772cb2a006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:21:05.216-0400 79df80e006c0  0 cephx server client.admin: handle_request failed to decode CephXAuthenticate: End of buffer [buffer:2]
2025-03-10T12:21:04.771-0400 7fc8ca8006c0  1 mon.zorya@4(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 331350016 full_alloc: 369098752 kv_alloc: 310378496
2025-03-10T12:21:03.641-0400 7f26e46006c0  1 mon.triglav@2(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 335544320 full_alloc: 364904448 kv_alloc: 314572800
2025-03-10T12:21:03.513-0400 772cb0c006c0  1 mon.svantevit@1(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 339738624 full_alloc: 360710144 kv_alloc: 314572800
2025-03-10T12:21:02.588-0400 759ba06006c0  1 mon.perun@0(leader).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 335544320 full_alloc: 369098752 kv_alloc: 310378496
2025-03-10T12:21:01.934-0400 79df868006c0  1 mon.lada@3(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 335544320 full_alloc: 369098752 kv_alloc: 310378496
2025-03-10T12:21:01.777-0400 772cada006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.20:0/3445141623' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2025-03-10T12:21:01.777-0400 772cada006c0  0 mon.svantevit@1(peon) e28 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2025-03-10T12:21:01.709-0400 759b9d4006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.20:0/522833880' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2025-03-10T12:21:01.709-0400 759b9d4006c0  0 mon.perun@0(leader) e28 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2025-03-10T12:20:59.770-0400 7fc8ca8006c0  1 mon.zorya@4(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 331350016 full_alloc: 369098752 kv_alloc: 310378496
2025-03-10T12:20:58.682-0400 79df836006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.32:0/1762198447' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2025-03-10T12:20:58.682-0400 79df836006c0  0 mon.lada@3(peon) e28 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2025-03-10T12:20:58.613-0400 7f26e46006c0  1 mon.triglav@2(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 335544320 full_alloc: 364904448 kv_alloc: 314572800
2025-03-10T12:20:58.512-0400 772cb0c006c0  1 mon.svantevit@1(peon).osd e1255924 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 339738624 full_alloc: 360710144 kv_alloc: 314572800
2025-03-10T12:20:58.448-0400 759b9d4006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.1:0/1664074803' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2025-03-10T12:20:58.448-0400 759b9d4006c0  0 mon.perun@0(leader) e28 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2025-03-10T12:20:58.407-0400 772cada006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.32:0/1459805445' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2025-03-10T12:20:58.407-0400 772cada006c0  0 mon.svantevit@1(peon) e28 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2025-03-10T12:20:58.295-0400 759b9d4006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.1:0/377243753' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2025-03-10T12:20:58.295-0400 759b9d4006c0  0 mon.perun@0(leader) e28 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2025-03-10T12:20:58.221-0400 759b9d4006c0  0 log_channel(audit) log [DBG] : from='client.? 10.10.0.2:0/1383024204' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch