Improved CPU utilization graphing for Hitachi arrays

Hello,

because of an issue with CPU utilization on Hitachi arrays we've had recently I took a deep dive into how stor2rrd shows it and found out that it really needs improvement in several ways.

Let's discuss high-end Hitachi (VSP, VSP G1000, G1500, F1500, HPE XP) first.

Stor2rrd shows an average utilization graph followed by a breakdown per each object it retrieves. It is not very useful because an array has 2 types of CPUs for different tasks and over-utilization of one CPU type will be averaged out by another. If there's an issue, we'll never notice that from this graph. Okay, here goes:

1. MPB (Microprocessor Blade) - general I/O processing. A single MPB is basically an Intel CPU with 4 (Hitachi VSP) or 8 (Hitachi VSP G1000 and later) cores. An object string looks like that: MPB-1MA.MP00-1MB where MPB-1MA is a CPU and MP00-1MP is an individual core.

2. DRR (Data Recovery and Reconstruction) - RAID parity calculations. A typical object string looks like that: CHA-xxx.DRR-xxxx or DKA-xxx.DRR-xxxx. CHA (Channel Adapter) processes front-end I/O, DKA (Disk Adapter) processes back-end IO.

So what are my suggestions? I'd like to see the following in the CPU section

- Total - Should be displayed by subsystem, without aggregation (MPB averaged, DRR-CHA averaged, DRR-DKA averaged)

- MPB - Aggregated load per CPU (plot every MPB-xxx on the graph but have all cores averaged out). Plotting every core on the graph will probably give 64+ objects for large systems and is therefore not necessary. Individual CPU cores typically have the same utilization so averaging is fine.

- DRR - Aggregated per adapter (CHA+DKA), Aggregated per CHA (CHA-xxx.DRR-xxxx), Aggregated per DKA (DKA-xxx.DRR-xxxx). If number of items is too large, average out the items under every CHA-xxx and DKA-xxx - that will cut the amount of plot items in half.

Mid-range Hitachis are less complicated because they have no dedicated CPUs. Their CPU units are named MPU-xx.MPxx-xx where MPU-xx is an individual CPU and MPxx-xx is a CPU core. I'll be happy to see a utilization graph broken down by MPU so cases of bad system load balancing become obvious. Just like above, it is possible to average out the items under each MPU-xx to make graphs less complex.

Hope you can help me out and we'll make stor2rrd even better than it is now.

Comments

  • Hi Alex,

    thanks for analyzing that!
    We know about that for years, you are not the first one pointing to that.
    However no one has ever explained us all of that in such detail.

    This is definitelly something we must do.
    I am placing it into our TODO list.

  • The patch did a good job. Now the graphs really make sense.

Sign In or Register to comment.