Ceph VM Errors

ermanishchawla · May 14, 2020

I have a ceph configuration as follows

1. Number of Nodes: 4
2. OSD per Server: 03
4. Total OSD: 12
5. Capacity Per OSD: 900GB

Ceph Pool
VMPOOL
Pg_num: 256

I have VM on all the servers and VM are using ceph storage
Now whenever I reboot any server, my other server VM hangs and they are not able to write
is there any issue in the design/

t.lamprecht · May 14, 2020

What's the size/min of your pools? Also, what ceph errors/warnings do you see during a node reboot?
Should be both visible over anothers node webinterface.

ermanishchawla · May 14, 2020

Pool size is 2
and I don't see any error in proxmox, just VM become unresponsive and unable to write

Alwin · May 14, 2020

Please post a ceph osd dump.

ermanishchawla · May 14, 2020

epoch 261
fsid a5a22a8b-6956-4320-9d0b-7fec9a96b48d
created 2020-04-27 18:36:40.453688
modified 2020-05-14 09:21:21.783225
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 25
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release nautilus
pool 6 'DataStore' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 146 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps [1~3]
max_osd 12
osd.0 up in weight 1 up_from 211 up_thru 260 down_at 206 last_clean_interval [118,204) [v2:172.19.2.17:6802/2051,v1:172.19.2.17:6803/2051] [v2:172.19.2.17:6804/2051,v1:172.19.2.17:6805/2051] exists,up 1b7bb119-9471-4d62-9975-43abc85bdc65
osd.1 up in weight 1 up_from 211 up_thru 260 down_at 206 last_clean_interval [117,204) [v2:172.19.2.17:6810/2053,v1:172.19.2.17:6811/2053] [v2:172.19.2.17:6812/2053,v1:172.19.2.17:6813/2053] exists,up d38663db-ed83-4b36-bf05-ca0275ce843c
osd.2 up in weight 1 up_from 210 up_thru 260 down_at 206 last_clean_interval [118,204) [v2:172.19.2.17:6818/2052,v1:172.19.2.17:6819/2052] [v2:172.19.2.17:6820/2052,v1:172.19.2.17:6821/2052] exists,up 5915e646-6af7-44b5-89d3-2ace57d8a9f2
osd.3 up in weight 1 up_from 228 up_thru 260 down_at 225 last_clean_interval [123,223) [v2:172.19.2.16:6818/1810,v1:172.19.2.16:6819/1810] [v2:172.19.2.16:6820/1810,v1:172.19.2.16:6821/1810] exists,up ff98eccd-0ab0-4fdb-be70-eddca2cc702d
osd.4 up in weight 1 up_from 229 up_thru 260 down_at 225 last_clean_interval [123,223) [v2:172.19.2.16:6802/1808,v1:172.19.2.16:6804/1808] [v2:172.19.2.16:6806/1808,v1:172.19.2.16:6807/1808] exists,up 6468168c-f91f-419a-8804-a0cd7cda46f7
osd.5 up in weight 1 up_from 228 up_thru 260 down_at 225 last_clean_interval [123,223) [v2:172.19.2.16:6803/1809,v1:172.19.2.16:6805/1809] [v2:172.19.2.16:6808/1809,v1:172.19.2.16:6810/1809] exists,up 64171370-4b0e-49fc-a6b5-7bd01decf14f
osd.6 up in weight 1 up_from 216 up_thru 260 down_at 214 last_clean_interval [202,212) [v2:172.19.2.19:6803/2008,v1:172.19.2.19:6804/2008] [v2:172.19.2.19:6806/2008,v1:172.19.2.19:6808/2008] exists,up 73cb008e-18e0-47d2-9385-86c89e316429
osd.7 up in weight 1 up_from 217 up_thru 260 down_at 214 last_clean_interval [202,212) [v2:172.19.2.19:6802/2009,v1:172.19.2.19:6805/2009] [v2:172.19.2.19:6807/2009,v1:172.19.2.19:6809/2009] exists,up ab5eb9ad-4696-42f8-b5a0-4497952ca345
osd.8 up in weight 1 up_from 217 up_thru 260 down_at 214 last_clean_interval [203,212) [v2:172.19.2.19:6818/2007,v1:172.19.2.19:6819/2007] [v2:172.19.2.19:6820/2007,v1:172.19.2.19:6821/2007] exists,up e7dc1c81-0ccc-429e-8ed9-26b2a1a89f26
osd.9 up in weight 1 up_from 260 up_thru 260 down_at 258 last_clean_interval [256,257) [v2:172.19.2.18:6808/1844,v1:172.19.2.18:6809/1844] [v2:172.19.2.18:6810/1844,v1:172.19.2.18:6811/1844] exists,up 15222442-c794-4564-94c2-ec53aa0e3685
osd.10 up in weight 1 up_from 260 up_thru 260 down_at 258 last_clean_interval [255,257) [v2:172.19.2.18:6800/1850,v1:172.19.2.18:6801/1850] [v2:172.19.2.18:6802/1850,v1:172.19.2.18:6803/1850] exists,up 3318faca-e323-4b8b-bbe2-e49d07504fe3
osd.11 up in weight 1 up_from 260 up_thru 260 down_at 258 last_clean_interval [254,257) [v2:172.19.2.18:6816/1845,v1:172.19.2.18:6817/1845] [v2:172.19.2.18:6818/1845,v1:172.19.2.18:6819/1845] exists,up 11d107c7-461d-4ff3-9d46-052a9c83d882

t.lamprecht · May 14, 2020

ermanishchawla said:
Pool size is 2

ermanishchawla said:
replicated size 2 min_size 2

That explains it, there are just two copies of each object on the cluster, if one node goes down that data has only one copy and thus cannot be rewritten or the like (to prevent a split brain).

That's why we recommend using size/min-size of 3/2, this solves that issue.

ermanishchawla · May 14, 2020

Let me migrate to another pool with 3/2
and update you status

Search

Search

Ceph VM Errors

ermanishchawla

Well-Known Member

t.lamprecht

Proxmox Staff Member

ermanishchawla

Well-Known Member

Alwin

Proxmox Retired Staff

ermanishchawla

Well-Known Member

t.lamprecht

Proxmox Staff Member

ermanishchawla

Well-Known Member

We value your privacy