power 8 server statistic disappeared

Hello,

We upgraded to 6.02 and decided to switch to rest api and statistics from one of our servers disappeared. We have 4 servers connected to 2 hmcs.
3 of them - power 7 systems work good, but 9119-MHE server disappeared with all lpars on it. I don't see any issues in logs. But no signs of it  in web interface.




download data  : hmc2-sh:Server-9119-MHE-SN last 140  minute(s) (3 hours)
fetching HMC   : hmc2-sh:Server-9119-MHE-SN lpar data
fetching HMC   : hmc2-sh:Server-9119-MHE-SN pool data
last rec 3     : hmc2-sh:Server-9119-MHE-SN min:21144 , hour:353, 4/29/2019 16:59 : last-pool.txt - source:pool
fetching HMC   : hmc2-sh:Server-9119-MHE-SN mem data
last rec 3     : hmc2-sh:Server-9119-MHE-SN min:21144 , hour:353, 4/29/2019 16:59 : last-mem.txt - source:mem
fetching HMC   : hmc2-sh:Server-9119-MHE-SN CoD data
last rec 3     : hmc2-sh:Server-9119-MHE-SN min:2880 , hour:48, 4/29/2019 16:59 : init - last-cod.txt - source:cod

except

ERROR:     Server-9119-MHE-SN, hmc-sh data load in /home/lpar2rrd/lpar2rrd/bin/lpar2rrd.pl at line 851
ERROR: /home/lpar2rrd/lpar2rrd/data/Server-9119-MHE-SN/hmc-sh/rbo-db-sh.rrm: not a simple unsigned integer: '-32812763677588725' at /home/lpar2rrd/lpar2rrd/bin/LoadDataModule.pm line 420.

date load      : hmc-sh:Server-9119-MHE-SN Tue May 14 09:18:30 2019

I see the time is weird, it's like it stuck at 4/29/2019 16:59.



We have 2 more hmcs with 5 severs and similar 9119-MHE server connected. And and everything works well there.

Comments

  • Hi,


    use this as a fix:


    -rwxrwxr-x 1 lpar2rrd lpar2rrd 315607 May 10 09:04 bin/LoadDataModule.pm

    Gunzip it and copy to /home/lpar2rrd/lpar2rrd/bin (755, lpar2rrd owner)

    If your web browser gunzips it automatically then just rename it: mv LoadDataModule.pm.gz LoadDataModule.pm

    Assure that file size is the same as on above example


  • edited May 15
    Hello,

    Thank you for your help. Server appeared on the web interface. And now i can see statistics from lpar agents. However we have new issue. 

    We don't have  most of statistics from hmc from all servers since the moment i replaced the file.





    bu cpu pool is still there



    There are messages about invalid jsons in logs.

    last rec 3     : hmc-sh:Server-9117-MMD-SN067B767 min:2880 , hour:48, 5/15/2019 8:40 : init - last-cod.txt - source:cod
    Rest API       2019-05-15T09:00:45        : inserting hmc2-sh Server-9117-MMD-SN064B6F7 HMC_hmc2-sh_lpars_perf_20190515_0900.json to rrd files
    HMC_hmc2-kt_lpars_perf_20190515_0900.json is not valid : HASH(0x2101188)
    no content in /home/lpar2rrd/lpar2rrd/data/Server-9117-MMC-SN062D397/hmc2-kt/iostat/HMC_hmc2-kt_lpars_perf_20190515_0900.json
    HMC_hmc-kt_lpars_perf_20190515_0900.json is not valid : HASH(0x3237188)
  • Hi, 

    there is the fix that should resolve the issue.

    Gunzip it and copy to /home/lpar2rrd/lpar2rrd/bin (755, lpar2rrd owner)

    -rwxrwxr-x 1 lpar2rrd lpar2rrd 241786 May 15 08:53 bin/hmc_rest_api.pl

    If your web browser gunzips it automatically then just rename it: mv hmc_rest_api.pl.gz hmc_rest_api.pl

    Assure that file size is the same as on above example
  • This helped, issue solved. Thank you very much.
  • Hello, 
    Here i am again. We'he got new issue.
    Now we have blanks in hmc graphics.


    Logs look like this

    LPARSUTIL2 : tst-rep-ah-72
    LPARSUTIL3 : tst-rep-ah-72 ts for 05/20/2019 09:39:30 is OK?
    LPARSUTIL2 : tst_rep_ah
    LPARSUTIL3 : tst_rep_ah ts for 05/20/2019 09:39:30 is OK?
    LPARSUTIL2 : nes-t1a-app6
    LPARSUTIL3 : nes-t1a-app6 ts for 05/20/2019 09:39:30 is OK?
    Rest API       2019-05-20 09:40:39        : hmc-ah Server-9117-MMB-SN Perffiles OK
    Rest API       2019-05-20 09:40:39        : hmc-ah Server-9117-MMB-SN PID end : 0

    And then

    [lpar2rrd@xorux-mon lpar2rrd]$ cat load_hmc_rest_api.out 
    Mon May 20 10:00:01 +06 2019: There is already running another copy of load_hmc_rest_api.sh, exiting ...

  • Hi,

    send us logs. It looks like your load_hmc_rest_api.sh runs quite a long time. It should take a few minutes. 

    Mon May 20 10:00:01 +06 2019: There is already running another copy of load_hmc_rest_api.sh, exiting ...
    This indicates, that more than 20 minutes are needed.

    cd /home/lpar2rrd/lpar2rrd # or where is your LPAR2RRD working directory
    grep -v password etc/web_config/hosts.json > tmp/hosts.txt
    ls -l data/*/*/cpu.cfg > tmp/o.txt
    ls -l data/*/*/pool.rrm >> tmp/o.txt
    tar cvhf logs.tar logs tmp/restapi/* tmp/*.txt
    gzip -9 logs.tar
    https://upload.lpar2rrd.com/
  • I figured it out. It's one of our hmc hangs. Sorry to bother you.

  • We have 2 hmc connected to same servers. So if one of them takes to long to answer we have the issue described above. 
  • Hi,

    basically yes, this happens when connection to one HMC hangs.
    You are the first one reporting it, we have never seen it till now.
  • E332FFFF   

    Explanation

    This error occurs when the HMC receives notification that a particular Java code string is corrupted.

    Problem determination



    This is the reason hmc hanged.
  • what was resolution? HMC reboot?

    Where this error appeared? Any particual HMC log?


    thanks

  • Yeah i just rebooted hmc.  This error appeared on hmc Serviceable Events Overview.
    And repeated 6 time for the past 2 days. 
    And one more alert

    E212E151   

    Explanation

    Licensed Internal Code failure on the Hardware Management Console (HMC).

    Response

    CPU Alert: The SE HMC overall was way too busy for too long. Error reason = percent in use scaled by 10.

    I will try to get more logs.  


  • ok, thanks, I think it is enough for identification
  • I opened a case with ibm support. Will let you know the result.
Sign In or Register to comment.