CPU (by HMC) and CPU OS graphs differ badly

DenisM · November 2021

Hi,

I have two graphs, one from HMC data, and another from AIX OS agent. As you see they differ badly.
The LPAR operates in dedicated mode.
Please, why does HMC show 100% CPU load?

Image: https://forum.xorux.com/uploads/editor/xk/gfp7a1782ch5.png

Image: https://forum.xorux.com/uploads/editor/bi/cpuw09mzrlls.png

Pavel · November 2021

Hi,

what is your lpar2rrd version?

do you use HMC CLI (ssh) or HMC REST API?

DenisM · November 2021

7.30
REST API

Pavel · November 2021

what is Processor Sharing properties, like this?
Allow when partition is inactive.

Allow when partition is active.

HMC version is 8 or 9?

DenisM · November 2021

Processor Sharing is Allow when partition is inactive.
HMC v9

Pavel · November 2021

when you check graphs on the HMC directly, do you see normal utilisation or a line like in lpar2rrd?

DenisM · November 2021

HMC displays an adequate graph.

Image: https://forum.xorux.com/uploads/editor/jt/tdpw2o9tfjed.png

mcaroca · November 2021

Hello: Maybe it's something similar that happened to me, see:

https://forum.xorux.com/discussion/975/100-bad-lpar-cpu-graph-after-running-resource-affinity-process-on-power8-server-in-hmc#latest

jan_dvorak · November 2021

Hi, this is probably related to the issue you mentioned. Can you verify the idle cycles and utilized cycles?

lslparutil -r lpar -m SERVER_NAME --filter "event_types=sample,lpar_names=LPAR_NAME"

DenisM · November 2021

Hi,

time=11/12/2021 10:23:00,event_type=sample,resource_type=lpar,
sys_time=11/12/2021 10:28:47,time_cycles=16791092410980179,
lpar_name=srvkcpdi10,lpar_id=6,curr_proc_mode=ded,curr_procs=6,
curr_sharing_mode=share_idle_procs,curr_5250_cpw_percent=0.0,mem_mode=ded,
curr_mem=153600,entitled_cycles=84053164516128798,
capped_cycles=84053164516128798,uncapped_cycles=0,shared_cycles_while_active=0,
idle_cycles=110182386681225976,total_instructions=104679027062737524,
total_instructions_execution_time=116866157431121269

jan_dvorak · November 2021

Another user solved the problem with IBM that admitted there is a flaw when using LPM on some firmware versions.

There are possible solutions:

IBM Support notes that there is a recently discovered flaw related to partition LPM and counts with dedicated processors. Its repair is already being worked on but there is no certainty when it is released.

At the moment, the alternatives to correct the counters are:

1.- The least impact and if it has the LPM capacity. You can migrate the non-active Lpar. Then Activate the destination server. Then (if you prefer) you can return it to the original server.

2.- The most drastic ways would be to turn the server off and on, or to delete and recreate the partitions.

I personally managed to solve it with option 1.

Can you try the option 1 solution?

DenisM · November 2021

I'm going to, but did not yet.
I wonder if LPAR2RRD could work around that bug like HMC do? That flaw doesn't make HMC to plot corrupted graphs.
The solution proposed turn LPM (Live Partition Mobility) feature into Dead Partition Mobility indeed. Not so good point.

mcaroca · November 2021

Hello:
The problem is with the counters of the affected LPAR in the same HMC. It cannot be that the idle_cycles counter is greater than capped_cycles (I highlight it in red box). I solved it with alternative 1 already indicated.

Image: https://forum.xorux.com/uploads/editor/cb/4j3b7n8htia4.png

"1.- The least impact and if it has the LPM capacity. You can migrate the non-active Lpar. Then Activate the destination server. Then (if you prefer) you can return it to the original server."

mcaroca · November 2021

If you prefer, there will be an IBM support case to confirm the proposed solution or for them to indicate any additional alternatives.

DenisM · November 2021

Hi mcaroca,

That isn't a solution, that's a workaround. A solution that ruins all the benefits of LPM isn't really a solution, IMHO.

Pavel · November 2021

Hi,

basically it is a bug in the HMC, ask IBM for a solution, idle cycles are higher than capped cycles, it should not be.

BTW try to switch to HMC REST API access, it might work properly

DenisM · November 2021

Pavel,

My point is that this bug doesn't deny HMC to draw correct graphs, so why LPAR2RRD cannot?
I do utilize REST API from the very beginning.

Marek_H · April 2024

Hello,

do you have any further information on this topic, or are the proposed solutions still the only solution please? I have the same problem (CPU 100%, although when checking directly on the lpar the load is fine). It also has a dedicated cores (but the capped_cycles counter is greater than idle_cycles).

It actually happened after HMC upgrade (current version: HMC V10R3 M1051)

I'm using version 7.30...

Thank you.

Pavel · April 2024

upgrade to the latest v7.80-1 and let us know. Do you use HMC REST API or CLI (ssh)

Marek_H · April 2024

Ok, I will try it.

CLI (ssh)

Thanks.

Pavel · April 2024

CLI does not help

Marek_H · June 2024

Hello,

sorry for the delay - switching to the HMC REST API actually helped!

Thank you very much,

CPU (by HMC) and CPU OS graphs differ badly

Comments

Howdy, Stranger!

Categories

In this Discussion