Ceph Sync makes proxmox VM's Unusable

infinityM

Well-Known Member
Dec 7, 2019
179
1
58
31
Hey Guys,

Ok so I have 3 Nodes in my cluster with about 50TB storage and 20TB used.
When my ceph starts recovering like now when we have a failed drive, all the VM's load jumps in the 100's and they become close to unusable.


I have configured my proxmox as follows:

1 IP range for live traffic (own NIC)
1 IP range for internal traffic (Own Nic) - Stuff like backups
1 IP range for Corosync (Own NIC)
1 IP Range for Ceph Cluster IP Range

The last item is the newest change and since then we've not had a drive fail, so I'm assuming that's what ceph's not liking?
Does anyone have any advise for what might be causing these issues? Or where to start looking?
 
Oddly enough I just noticed that if I set my cluster replication to 1/2 instead of 1/1 then all the pg's become unavailable.
I had to set it to 1/1 since it was the only way to get the cluster usable...

Any idea of what might be causing this behaviour?
I'm very worried about the data's safety at the moment...
 
I had to set it to 1/1 since it was the only way to get the cluster usable...
Any OSD failure will result in data loss! The minimum should be at least 2/2. Better 3/2, since the pools will continue to handle writes, when only two copies are available.

The last item is the newest change and since then we've not had a drive fail, so I'm assuming that's what ceph's not liking?
Yes and no. There is not much information to go on.

Does anyone have any advise for what might be causing these issues? Or where to start looking?
Describe your setup in more detail (hardware, configuration, performance). Check the logs and run benchmarks.
 
Any OSD failure will result in data loss! The minimum should be at least 2/2. Better 3/2, since the pools will continue to handle writes, when only two copies are available.


Yes and no. There is not much information to go on.


Describe your setup in more detail (hardware, configuration, performance). Check the logs and run benchmarks.
Will send the details through as soon as I'm at the office.
In the meantime though. I do have 1 question.

When I try and change the cluster to 2/1 (to incrementally increase it and eventually end up at 3/2) the entire cluster immediately stops being able to read and write...
It just says all pg's are unavailable... Is this expected? I left it for 20 minutes but no pg's came back up so had to go back...
 
When I try and change the cluster to 2/1 (to incrementally increase it and eventually end up at 3/2) the entire cluster immediately stops being able to read and write...
As long as the pool has min_size=1, the data is at risk. 2/2 is the save minimum. The data will be redistributed and this counts for all PGs.

It just says all pg's are unavailable... Is this expected? I left it for 20 minutes but no pg's came back up so had to go back...
The cluster had issues before, this is just another symptom. All PGs will need to peer and redistribute.

What config options would you like to know? I'll send them through :)
Just come up with those that you think are important. And then we will see what we might still need.
 
As long as the pool has min_size=1, the data is at risk. 2/2 is the save minimum. The data will be redistributed and this counts for all PGs.
I know... Hence my fear....

The cluster had issues before, this is just another symptom. All PGs will need to peer and redistribute.
The problem is that they all go offline. And take way too long to come online. A couple minutes is one thing. but hour's is a problem since client's need to be able to access their data? Which is why I'm looking for a work-around that keeps the data accessible.

Just come up with those that you think are important. And then we will see what we might still need.
Below is a few details, Hope it's what you need :).

# cat /etc/pve/ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.161.1.1/24
fsid = 248fab2c-bd08-43fb-a562-08144c019785
mon_allow_pool_delete = true
mon_host = 129.232.156.119 129.232.156.116 129.232.156.120
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 129.232.156.112/28

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.c4]
public_addr = 129.232.156.119

[mon.c5]
public_addr = 129.232.156.120

[mon.c6]
public_addr = 129.232.156.116



# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: c4
nodeid: 5
quorum_votes: 1
ring0_addr: 10.161.2.101
}
node {
name: c5
nodeid: 3
quorum_votes: 1
ring0_addr: 10.161.2.102
}
node {
name: c6
nodeid: 2
quorum_votes: 1
ring0_addr: 10.161.2.103
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cluster01
config_version: 19
interface {
linknumber: 0
}
ip_version: ipv4-6
secauth: on
version: 2
}


# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp3s0f0
iface enp3s0f0 inet manual

auto enp3s0f1
iface enp3s0f1 inet manual

auto enp4s0f0
iface enp4s0f0 inet manual

auto enp4s0f1
iface enp4s0f1 inet manual

auto vmbr0
iface vmbr0 inet static
address 129.232.156.116/28
gateway 129.232.156.114
bridge-ports enp3s0f0
bridge-stp off
bridge-fd 0

auto vmbr1
iface vmbr1 inet static
address 10.161.0.101/24
bridge-ports enp3s0f1
bridge-stp off
bridge-fd 0

auto vmbr2
iface vmbr2 inet static
address 10.161.1.103/24
bridge-ports enp4s0f0
bridge-stp off
bridge-fd 0
#Ceph Cluster

auto vmbr3
iface vmbr3 inet static
address 10.161.2.103/24
bridge-ports enp4s0f1
bridge-stp off
bridge-fd 0
#Corosync
 
The problem is that they all go offline. And take way too long to come online. A couple minutes is one thing. but hour's is a problem since client's need to be able to access their data? Which is why I'm looking for a work-around that keeps the data accessible.
Use a different shared storage for the interim.

address 10.161.1.103/24
bridge-ports enp4s0f0
Ceph traffic is only partially going through that interface. public_network 129.232.156.112/28, is still used. Proxmox VE is client and server at the same time. To change the public_network is a bigger effort. Both the current and new network need to be routed.
1603271490277.png

Things to know, would be what the status of Ceph ceph -s is and how the OSDs and PGs ceph osd df tree are distributed.

And further you will need to definitely get an understanding of Ceph. Especially for these situations. ;)
https://docs.ceph.com/en/nautilus/
https://pve.proxmox.com/pve-docs/chapter-pveceph.html
 
Use a different shared storage for the interim.
I have removed 3 of the osd's to start migrating data over. and we're adding 6 more this afternoon so we have enough room to start moving the data.

Things to know, would be what the status of Ceph ceph -s is and how the OSDs and PGs ceph osd df tree are distributed.
ceph -s
cluster:
id: 248fab2c-bd08-43fb-a562-08144c019785
health: HEALTH_WARN
1 pool(s) have no replicas configured
6 daemons have recently crashed

services:
mon: 3 daemons, quorum c4,c6,c5 (age 18h)
mgr: c6(active, since 18h), standbys: c4
osd: 35 osds: 35 up (since 12h), 32 in (since 12h); 145 remapped pgs

data:
pools: 1 pools, 1024 pgs
objects: 3.48M objects, 13 TiB
usage: 14 TiB used, 39 TiB / 54 TiB avail
pgs: 420711/3479255 objects misplaced (12.092%)
878 active+clean
135 active+remapped+backfill_wait
10 active+remapped+backfilling
1 active+clean+scrubbing+deep

io:
client: 42 MiB/s rd, 4.9 MiB/s wr, 124 op/s rd, 204 op/s wr
recovery: 37 MiB/s, 9 objects/s

progress:
Rebalancing after osd.8 marked out
[===================...........]
Rebalancing after osd.7 marked out
[========......................]
Rebalancing after osd.30 marked out
[===================...........]

ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 53.84706 - 48 TiB 14 TiB 14 TiB 9.7 MiB 41 GiB 35 TiB 28.55 1.00 - root default
-16 19.27872 - 17 TiB 5.8 TiB 5.8 TiB 3.1 MiB 17 GiB 12 TiB 33.34 1.17 - host c4
1 hdd 1.81940 1.00000 1.8 TiB 824 GiB 822 GiB 257 KiB 2.7 GiB 1.0 TiB 44.25 1.55 58 up osd.1
4 hdd 1.81940 1.00000 1.8 TiB 679 GiB 678 GiB 315 KiB 1.5 GiB 1.2 TiB 36.46 1.28 51 up osd.4
5 hdd 1.81940 1.00000 1.8 TiB 507 GiB 506 GiB 371 KiB 1.3 GiB 1.3 TiB 27.23 0.95 35 up osd.5
6 hdd 1.81940 1.00000 1.8 TiB 732 GiB 731 GiB 274 KiB 1.5 GiB 1.1 TiB 39.30 1.38 56 up osd.6
7 hdd 1.81940 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 38 up osd.7
9 hdd 1.81940 1.00000 1.8 TiB 584 GiB 582 GiB 303 KiB 1.4 GiB 1.2 TiB 31.33 1.10 41 up osd.9
11 hdd 1.09079 1.00000 1.1 TiB 299 GiB 298 GiB 245 KiB 1.0 GiB 818 GiB 26.80 0.94 22 up osd.11
14 hdd 1.81940 1.00000 1.8 TiB 567 GiB 566 GiB 251 KiB 1.3 GiB 1.3 TiB 30.46 1.07 43 up osd.14
15 hdd 0.90819 1.00000 930 GiB 289 GiB 288 GiB 121 KiB 1024 MiB 641 GiB 31.08 1.09 22 up osd.15
16 hdd 0.90819 1.00000 930 GiB 223 GiB 222 GiB 276 KiB 1024 MiB 707 GiB 23.97 0.84 17 up osd.16
21 hdd 0.90819 1.00000 930 GiB 224 GiB 223 GiB 229 KiB 1024 MiB 706 GiB 24.13 0.84 17 up osd.21
22 hdd 0.90819 1.00000 930 GiB 431 GiB 430 GiB 163 KiB 1.6 GiB 499 GiB 46.37 1.62 33 up osd.22
26 hdd 1.81940 1.00000 1.8 TiB 600 GiB 599 GiB 330 KiB 1.5 GiB 1.2 TiB 32.22 1.13 46 up osd.26
-7 18.19397 - 16 TiB 4.8 TiB 4.8 TiB 2.0 MiB 13 GiB 12 TiB 29.40 1.03 - host c5
0 hdd 0.90970 1.00000 932 GiB 257 GiB 256 GiB 129 KiB 1024 MiB 674 GiB 27.64 0.97 15 up osd.0
3 hdd 0.90970 1.00000 932 GiB 350 GiB 349 GiB 136 KiB 1.2 GiB 581 GiB 37.59 1.32 18 up osd.3
12 hdd 1.81940 1.00000 1.8 TiB 583 GiB 581 GiB 242 KiB 1.6 GiB 1.3 TiB 31.29 1.10 41 up osd.12
13 hdd 1.81940 1.00000 1.8 TiB 513 GiB 511 GiB 255 KiB 1.4 GiB 1.3 TiB 27.52 0.96 38 up osd.13
17 hdd 1.81940 1.00000 1.8 TiB 666 GiB 665 GiB 258 KiB 1.6 GiB 1.2 TiB 35.77 1.25 47 up osd.17
18 hdd 1.81940 1.00000 1.8 TiB 718 GiB 717 GiB 221 KiB 1.5 GiB 1.1 TiB 38.55 1.35 48 up osd.18
23 hdd 1.81940 1.00000 1.8 TiB 308 GiB 307 GiB 162 KiB 1024 MiB 1.5 TiB 16.53 0.58 22 up osd.23
27 hdd 1.81940 1.00000 1.8 TiB 548 GiB 546 GiB 207 KiB 1.5 GiB 1.3 TiB 29.40 1.03 37 up osd.27
28 hdd 1.81940 1.00000 1.8 TiB 725 GiB 724 GiB 302 KiB 1.5 GiB 1.1 TiB 38.93 1.36 50 up osd.28
29 hdd 1.81940 1.00000 1.8 TiB 261 GiB 260 GiB 114 KiB 1024 MiB 1.6 TiB 14.03 0.49 18 up osd.29
30 hdd 1.81940 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 1 up osd.30
-5 16.37437 - 15 TiB 3.2 TiB 3.2 TiB 4.7 MiB 11 GiB 11 TiB 21.85 0.77 - host c6
2 hdd 1.09160 1.00000 1.1 TiB 275 GiB 274 GiB 469 KiB 1.0 GiB 842 GiB 24.63 0.86 13 up osd.2
8 hdd 1.81940 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 1 up osd.8
10 hdd 1.81940 1.00000 1.8 TiB 385 GiB 384 GiB 554 KiB 1.0 GiB 1.4 TiB 20.66 0.72 27 up osd.10
19 hdd 1.09160 1.00000 1.1 TiB 236 GiB 235 GiB 109 KiB 1024 MiB 882 GiB 21.12 0.74 14 up osd.19
20 hdd 1.09160 1.00000 1.1 TiB 391 GiB 390 GiB 272 KiB 1.2 GiB 727 GiB 34.97 1.22 23 up osd.20
24 hdd 1.09160 1.00000 1.1 TiB 365 GiB 363 GiB 288 KiB 1.3 GiB 753 GiB 32.62 1.14 22 up osd.24
25 hdd 1.09160 1.00000 1.1 TiB 357 GiB 356 GiB 200 KiB 1.0 GiB 761 GiB 31.91 1.12 20 up osd.25
31 hdd 1.81940 1.00000 1.8 TiB 282 GiB 281 GiB 674 KiB 1023 MiB 1.5 TiB 15.16 0.53 20 up osd.31
32 hdd 1.81940 1.00000 1.8 TiB 335 GiB 333 GiB 133 KiB 1.2 GiB 1.5 TiB 17.96 0.63 25 up osd.32
33 hdd 1.81940 1.00000 1.8 TiB 282 GiB 281 GiB 967 KiB 1023 MiB 1.5 TiB 15.13 0.53 21 up osd.33
34 hdd 1.81940 1.00000 1.8 TiB 349 GiB 348 GiB 1.1 MiB 1023 MiB 1.5 TiB 18.75 0.66 24 up osd.34
TOTAL 54 TiB 14 TiB 14 TiB 10 MiB 44 GiB 39 TiB 28.55
MIN/MAX VAR: 0.49/1.62 STDDEV: 8.47

And further you will need to definitely get an understanding of Ceph. Especially for these situations. ;)
https://docs.ceph.com/en/nautilus/
https://pve.proxmox.com/pve-docs/chapter-pveceph.html
Ok so how do I correct it?
The problem for me is most guides don't go into the smaller details of the network config...
So let's use this as a bit of a learning exercise if possible...

What would we need to change in this scenario?
And more or less how can I go about doing so? Even just rough details will help. I've been breaking my head trying to figure that part out D:
 
Please post command output in CODE tags (</>), it preserves the formating.

osd: 35 osds: 35 up (since 12h), 32 in (since 12h); 145 remapped pgs
It seems that OSDs have crashed and that 3x of them are not 'in' the cluster. So data is rebalanced. I suppose these are the disks that you want to use for local storage.

client: 42 MiB/s rd, 4.9 MiB/s wr, 124 op/s rd, 204 op/s wr
recovery: 37 MiB/s, 9 objects/s
I assume the OSDs are spinners, at least the low speed would suggest that.

Ok so how do I correct it?
Depends on what's the underlying issue. That is currently unknown. You did not yet tell how the hardware of the cluster is setup.
 
Please post command output in CODE tags (</>), it preserves the formating.


It seems that OSDs have crashed and that 3x of them are not 'in' the cluster. So data is rebalanced. I suppose these are the disks that you want to use for local storage.


I assume the OSDs are spinners, at least the low speed would suggest that.


Depends on what's the underlying issue. That is currently unknown. You did not yet tell how the hardware of the cluster is setup.
It is spinners for the moment yes. SSD's this side are CRAZY expensive.
THe 3 are intented to be used to start a new ceph array but I'll need to move them bit by bit since there is also the issue of space, we're a bit limited a.t.m

The HW is setup as 3 Server Nodes each running an EMC2 machine for the drives.
about 150-200GB RAM per server and between 32-64 cores per server too.

What I meant by how do I fix it, would be how can we correct the network config for Ceph?
You said it's partially sharing the network with Proxmox, so how do we correct that part and align it to run in a recommended way?
 
The HW is setup as 3 Server Nodes each running an EMC2 machine for the drives.
External storage boxes aren't recommended either. And that could very well introduce issues.

You said it's partially sharing the network with Proxmox, so how do we correct that part and align it to run in a recommended way?
You can run them as is, just that the network with the bigger bandwidth is preferred for both public and cluster network.

What I meant by how do I fix it, would be how can we correct the network config for Ceph?
Both the new and the current Ceph public network need to be reachable by all MONs. Then one by one the MONs can be created on the new network.
 
Both the new and the current Ceph public network need to be reachable by all MONs. Then one by one the MONs can be created on the new network.
on which range should the mon's be created though?
The public or cluster range?
 
They may but they don't need to.
Thank you. Ok so I created a new Ceph Pool. And I'm busy migrating everything to the other pool instead of rebuilding & resyncing.
Only problem now (Potential Problem), Is in Proxmox I am only seeing the first Ceph Pool under Node->Ceph->OSD's is that normal?
 
Ok guys so first of all thanks for all the info. I used a lot of it to correct our cluster's config...
Turns out the major issue was with 1 of our network ports reporting as a 100Mb connection. That slowed everything down to NULL!

Luckily we found it. Thank you very much for the advise & info :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!