get nodes/$node/storage showing 0 byte for ceph pool

alexskysilk

Distinguished Member
Oct 16, 2015
2,353
643
183
Chatsworth, CA
www.skysilk.com
I have this intermittent problem with storage returning 0 values for a specific rbd pool. Its only happening on one cluster, and there doesnt seem to be a corrolation to which node context is being called:

Code:
{"CODE":"OK","ERRORS":"","proxmoxRes":{"active":0,"avail":0,"content":"rootdir,images","enabled":1,"shared":1,"total":0,"type":"rbd","used":0,"data":{"active":0,"content":"rootdir,images","avail":0,"shared":1,"used":0,"total":0,"enabled":1,"type":"rbd"},"errors":null,"status":null,"success":1,"message":null},"request":null}

If I run the query in pvesh, I get a timeout before the 0 response:

Code:
pvesh get nodes/sky11/storage/vdisk-3pg/status
got timeout
200 OK
{
   "active" : 0,
   "avail" : 0,
   "content" : "rootdir,images",
   "enabled" : 1,
   "shared" : 1,
   "total" : 0,
   "type" : "rbd",
   "used" : 0
}

Why is it timing out? none of the nodes are overloaded, and pveproxy is showing any issues.

Code:
# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-3 (running version: 5.2-3/785ba980)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-13
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.3-1-pve: 4.15.3-1
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-34
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9
 
Is the storage accessible through rbd command line? If it is an external ceph cluster, is the keyring file at /etc/pve/priv/ceph/'?
 
Are all MONs accessible through the PVE node? The timeout could come from a MON not being reachable, while the rest is.
 
Is the port of every MON accessible (telnet/netcat)? Maybe a firewall/routing issue?
 
Does a 'ceph -m monhost mon_status' to each of the MONs work?

For the moment, I believe not all MONs are (equally?) reachable, as I have seen in the past, the "sometimes empty" results from such behavior.

not sure how/why it would be firewall related, there is no firewall (software or hardware) enabled on that subnet, its dedicated to ceph traffic.
Just going through the usual questions, as with remote diagnosis you never know, what is and what isn't. ;):)
 
For the moment, I believe not all MONs are (equally?) reachable, as I have seen in the past, the "sometimes empty" results from such behavior.

That seems logical. I ran tried randomly to call the monitors and in at least one instance it just hung without replying. I will move the defective monitor but how do I troubleshoot why its not responding?
 
The logs on the MON may give any clues, if it is some network issue, then you maybe see dropped packets.