[SOLVED] OSD high cpu usage

GPLExpert

Renowned Member
Jul 18, 2016
40
2
73
France
gplexpert.com
Hello,

We have ceph hammer with proxmox 4.4. We are preparaing to upgrade to the last version.
The cluster get 5 nodes with 4 osds (4To by sata disks).

Yesterday, we changed the nvme journal disk to intel ssd with more performances on 2 nodes.

The second node without any vms get a cpu average usage of 25/30% and a delay of 4.
If i move a vm on this node, the cpu usage of the vm become 30% (before on the other node : 6%)

I don't understand why.

Have you an idea for help me please ?

Nico
 
Please elaborate more on this, as I do not quite understand what happend on which node. And what your general system looks like.
 
Hello,

We have ceph hammer with proxmox 4.4. We are preparaing to upgrade to the last version.
The cluster get 5 nodes with 4 osds (4To by sata disks).

Yesterday, we changed the nvme journal disk to intel ssd with more performances on 2 nodes.

The second node without any vms get a cpu average usage of 25/30% and a delay of 4.
If i move a vm on this node, the cpu usage of the vm become 30% (before on the other node : 6%)

I don't understand why.

Have you an idea for help me please ?

Nico

Same cpu model on both nodes ?

if yes, are you sure that the cpu is running at the maximum frequency ? (check performance/powersave mode in your bios)
 
thanks for help

@spirit : yes all node are the same :

Node1 : ok => 50 vms
Node2 : cpu 30% average used => only 2 vm
Node3 : ok => 20 vms
Node4 : ok => 10 vms
Node5 : ok => 10 vms

if i move a mv from another node to node 2 : the cpu usage of the vm become 30%

We have 10G storage network with bonding (2 network card for each nodes)

Nico
 
I tried to delete osd an recrete them on the node 2

Not better

IOStat no sho difference with working node

On the node 2 with hdparm =>
root@GPL-HV3302:/var/log/ceph# hdparm -t -T /dev/nvme0n1

/dev/nvme0n1:
Timing cached reads: 2472 MB in 1.99 seconds = 1240.35 MB/sec
Timing buffered disk reads: 1412 MB in 3.00 seconds = 470.04 MB/sec

On a workingnode =>
root@GPL-HV3305:~# hdparm -t -T /dev/nvme0n1

/dev/nvme0n1:
Timing cached reads: 17236 MB in 1.99 seconds = 8643.32 MB/sec
Timing buffered disk reads: 5452 MB in 3.00 seconds = 1816.41 MB/sec
 
Yes

I upload 2 files.

I can see that cpu usage of a vm : 2% with htop
and 30% in the promox gui...

Very strange

node2: cpu MHz : 415.781
working node: cpu MHz : 2493.906

check your cpu governor configuration in bios,
or try to add

GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1"

in /etc/default/grub

then #update-grub

and reboot
 
Thanks for your help.

Finally, the problem was the ssd. We changed it and all become ok.

We tested it with
# hdparm -t -T /dev/nvme0n1

So, result was half that a working node

Thanks for you help.