get nodes/$node/storage showing 0 byte for ceph pool

alexskysilk

Distinguished Member
Oct 16, 2015
1,809
351
153
Chatsworth, CA
www.skysilk.com
I have this intermittent problem with storage returning 0 values for a specific rbd pool. Its only happening on one cluster, and there doesnt seem to be a corrolation to which node context is being called:

Code:
{"CODE":"OK","ERRORS":"","proxmoxRes":{"active":0,"avail":0,"content":"rootdir,images","enabled":1,"shared":1,"total":0,"type":"rbd","used":0,"data":{"active":0,"content":"rootdir,images","avail":0,"shared":1,"used":0,"total":0,"enabled":1,"type":"rbd"},"errors":null,"status":null,"success":1,"message":null},"request":null}

If I run the query in pvesh, I get a timeout before the 0 response:

Code:
pvesh get nodes/sky11/storage/vdisk-3pg/status
got timeout
200 OK
{
   "active" : 0,
   "avail" : 0,
   "content" : "rootdir,images",
   "enabled" : 1,
   "shared" : 1,
   "total" : 0,
   "type" : "rbd",
   "used" : 0
}

Why is it timing out? none of the nodes are overloaded, and pveproxy is showing any issues.

Code:
# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-3 (running version: 5.2-3/785ba980)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-13
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.15.3-1-pve: 4.15.3-1
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-34
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9
 
Is the storage accessible through rbd command line? If it is an external ceph cluster, is the keyring file at /etc/pve/priv/ceph/'?
 
Are all MONs accessible through the PVE node? The timeout could come from a MON not being reachable, while the rest is.
 
Is the port of every MON accessible (telnet/netcat)? Maybe a firewall/routing issue?
 
Does a 'ceph -m monhost mon_status' to each of the MONs work?

For the moment, I believe not all MONs are (equally?) reachable, as I have seen in the past, the "sometimes empty" results from such behavior.

not sure how/why it would be firewall related, there is no firewall (software or hardware) enabled on that subnet, its dedicated to ceph traffic.
Just going through the usual questions, as with remote diagnosis you never know, what is and what isn't. ;):)
 
For the moment, I believe not all MONs are (equally?) reachable, as I have seen in the past, the "sometimes empty" results from such behavior.

That seems logical. I ran tried randomly to call the monitors and in at least one instance it just hung without replying. I will move the defective monitor but how do I troubleshoot why its not responding?
 
The logs on the MON may give any clues, if it is some network issue, then you maybe see dropped packets.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!