I added several dozens SSDs to Ceph cluster and found that Proxmox reports Apply latency for them from 200 to 500 ms. I checked with iostat program - zero activity. What could be wrong with them?
P.S. No migration process happens - they are linked to separate 'root' container.
1. Resizing from Proxmox UI failed with message 'VM 102 qmp command failed - VM 102 qmp command 'block_resize' failed - Could not resize: Invalid argument'
2. Successfully resized image on ceph with 'qemu-img resize -f rbd rbd:rbd/vm-102-disk-1 48G'
3. But Proxmox UI and VM...
I found simple way to get all CT/VMs unusable. Simply delete ceph pool in Ceph->Pools->Remove. No warnings or locks even it's used in any way.
Running pve-manager/4.2-15/6669ad2c (running kernel: 4.4.10-1-pve)
For those who are interested in creating osd journal on separate partition (not disk) here are the steps:
(assume that sda3 and sdb3 - 5GB partitions for journal, sdc and sdd - disks for osd data)
1. Create partition with correct size. If using fdisk partition's size should be 10483712 sectors...
I use Proxmox 4.2 and wonder if things are ok:
# pveceph createosd /dev/sdc -journal_dev /dev/sda3
create OSD on /dev/sdc (xfs)
using device '/dev/sda3' for journal
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header...
It was definitely problem with ceph installation. Using separate network segments via public/cluster network parameters leads to problem with osd daemon can't report to mon daemon (still checking why). Using single network segment is ok.
I have successfully resolved the problem. The root of issue was in creating ceph cluster with
pveceph init --network b.b.b.b/mask
where b.b.b.b/mask network is second (private) segment devoted to ceph intra-cluster communication. So the /etc/pve/ceph.conf had the following lines:
If you want to use partitions as OSD disks then you have to create/initialize them manually, not with pveceph createosd command (you will need to set correct partition type GUID and name them as 'ceph data').
I has PVE 4.2 with Ceph installed with help of pveceph. The problem is that when I try to create CT/VM on RBD volume it ends with error:
TASK ERROR: rbd error: rbd: couldn't connect to the cluster!
I was able to trace calls to rbd binary which is executed with "--auth_supported none" in...
I found what's happening. Browser (web-interface) makes calls to https://hv01:8006 like
GET /api2/json/nodes/hv02/storage/local/status HTTP/1.1" 500 -
they are proxied to node hv02, then get locally proxied to tcp/85 where pvedaemon listens. Here is the timeout happened.
I have even switched from ntpd back to systemd-timesyncd - still no luck. What looks interesting - ceph cluster on these nodes doesn't complain about time synchronization problems.
P.S. Btw, why Proxmox starts systemd-timesyncd even if ntpd is up and running?
Installed Proxmox 4.2 on 3 nodes and formed cluster. CLI tools like pvecm status/nodes show that cluster is OK. But GUI randomly 'grays out' nodes and randomly takes them back. No errors, quorum is established. I changed to udpu and back to multicast - no difference. What should I check...