Cannot start any KVM / LXC

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hi,

here I describe 1 of the 2 major issues I'm currently facing in my 8 node ceph cluster (2x MDS, 6x ODS).

The issue is that I cannot start any virtual machine KVM or container LXC; the boot process just hangs after a few seconds.
All these KVMs and LXCs have in common that their virtual disks reside in the same pool: hdd

This pool hdd is relatively small compared to the largest pool: hdb_backup
root@ld3955:~# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY
UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
backup 0 B 0 0 0
0 0 0 0 0 B 0 0 B 0
B 0 B
hdb_backup 589 TiB 51262212 0 153786636
0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0
B 0 B
hdd 3.2 TiB 281884 6568 845652
0 0 1658 275277357 16 TiB 208213922 10 TiB 0
B 0 B
pve_cephfs_data 955 GiB 91832 0 275496
0 0 3038 2103 1021 MiB 102170 318 GiB 0
B 0 B
pve_cephfs_metadata 486 MiB 62 0 186
0 0 7 860 1.4 GiB 12393 166 MiB 0
B 0 B

total_objects 51635990
total_used 597 TiB
total_avail 522 TiB
total_space 1.1 PiB


This is the current health status of the ceph cluster:
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_ERR
1 MDSs report slow metadata IOs
78 nearfull osd(s)
1 pool(s) nearfull
Reduced data availability: 2 pgs inactive, 2 pgs peering
Degraded data redundancy: 304136/153251136 objects degraded (0.198%), 57 pgs degraded, 57 pgs undersized
Degraded data redundancy (low space): 265 pgs backfill_toofull
3 pools have too many placement groups
75 slow requests are blocked > 32 sec
78 stuck requests are blocked > 4096 sec

services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 94m)
mgr: ld5505(active, since 3d), standbys: ld5506, ld5507
mds: pve_cephfs:1 {0=ld3976=up:active} 1 up:standby
osd: 368 osds: 368 up, 367 in; 303 remapped pgs

data:
pools: 5 pools, 8868 pgs
objects: 51.08M objects, 195 TiB
usage: 590 TiB used, 563 TiB / 1.1 PiB avail
pgs: 0.023% pgs not active
304136/153251136 objects degraded (0.198%)
1673548/153251136 objects misplaced (1.092%)
8563 active+clean
195 active+remapped+backfill_toofull
57 active+undersized+degraded+remapped+backfill_toofull
36 active+remapped+backfill_wait
13 active+remapped+backfill_wait+backfill_toofull
2 active+remapped+backfilling
2 peering

io:
client: 264 KiB/s wr, 0 op/s rd, 0 op/s wr
recovery: 18 MiB/s, 4 objects/s



I believe the cluster is busy with rebalancing pool hdb_backup.
I set the balance mode upmap recently after the 589TB data was written.
root@ld3955:~# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}


In order to resolve the issue with pool hdd I started some investigation.
First step was to install drivers for the NIC provided Mellanox.
Then I configured some kernel parameters recommended
<https://community.mellanox.com/s/article/linux-sysctl-tuning> by Mellanox.

However this didn't fix the issue.
In my opinion I must get rid of all "slow requests are blocked".

When I check the output of ceph health detail any OSD listed under REQUEST_SLOW points to an OSD that belongs to pool hdd.
This means none of the disks belonging to pool hdb_backup is showing a comparable behavior.

Then I checked the running processes on the different OSD nodes; I use tool "glances" here.
Here I can see single processes that are running for hours and consuming much CPU, e.g.
63.9 0.5 4.95G 3.81G 14894 ceph 6h14:22 58 0 S 16M 293K /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph

Similar processes are running on 4 OSD nodes.
All processes have in common that the relevant OSD belongs to pool hdd.

What can / should I do now?
Kill the long running processes?
Stop the relevant OSDs?

Please advise?

THX
Thomas
 
Update:
I think that this issue is related to other issues reported here and here.

Furthermore I found out that I cannot copy data from the affected pool to local disk.
I started copying a LXC dump file and this hangs after transferring
Source
root@ld3955:~# ls -l /mnt/pve/pve_cephfs/dump/
insgesamt 139313896
-rw-r--r-- 1 root root 654 Sep 10 14:34 vzdump-lxc-200-2019_09_10-14_33_41.log
-rw-r--r-- 1 root root 812809056 Sep 10 14:34 vzdump-lxc-200-2019_09_10-14_33_41.tar.lzo

Target
root@ld3955:~# ls -l /var/lib/vz/dump/
insgesamt 12292
-rw-r--r-- 1 root root 654 Sep 23 12:20 vzdump-lxc-200-2019_09_10-14_33_41.log
-rw-r--r-- 1 root root 12582912 Sep 23 12:20 vzdump-lxc-200-2019_09_10-14_33_41.tar.lzo


In addition I found out that I can write new data to pool hdd.

Bottom line of my findings:
Reading old data from pool fails, writing new data to pool works.

Another remark I want to make:
Whenever I try to read old data in pool hdd I see an critical increase of CPU_IOWAIT.
I'm not sure if this is only impacting my servers or if other users notice the same.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!