VIOS showing up under Linux now
I have been playing with Xormon NG (1.5) for a couple of weeks now. Everything was working great! I noticed today while working on another issue that the VIOs that normally appear with all the over AIX LPARs in the menu were no longer in the Power tab, but in the Linux tab. And to add to that, the osagent data was also not being upload even under the linux tab for the said VIO.
My initial thought after looking through the logs is that it was an issue with the HMC not sending the required information for the VIOs to do the correct mapping as I was getting the os agent log (/home/xormon/xormon-ng/server-nest/logs/linux-error.log)
"<Date>: 252 No mapping for <SN>, vios_<name>".
It is gathering all information for the other LPARs correctly, and displaying all the correct HMC gathered and OS-agent information.
I have since tried updating xormon from 1.5 to 1.6.10 with the same issue. Restarting the primary HMC, pointing it to its secondary HMC thinking someone may have changed configurations. I am at a loss. Nothing has been changed on the VIOs since implementation and yet this seemingly happened. I have a feeling its HMC related, but I have no way to validate that Xormon.
What other ways can I validate that Xormon-NG is getting the required data from the HMC/APIs that would allow for the mappings to take place. I am almost to the point of trying to wipe this XormonNG server and seeing if that resolves this issue, but I'd love to find a root cause.
Please let me know if there are any logs or screenshots that would be helpful in figuring this out.
Comments
-
Hi,
looks like VIO issue we seen many times already, try this, let us know if you go nowhere
VIOS data is not collected through REST API because VIOS does not communicate with the HMC, we have seen that many times, check below.
I would start with point 4, then go with 3, both is something what helped a few users recently.
1. To resolve issue: http://www-01.ibm.com/support/docview.wss?uid=isg3T1024482
also restarting vios daemon should work
2.
There might be a problem with vio_daemon stuck or not communicating with HMC:
https://forum.xorux.com/discussion/comment/3450#Comment_3450
https://www.ibm.com/support/pages/node/629995
It can be also related to resolving the vioses hostname/s, dns.
also this might help:
Please, follow point 2. from the below IBM Tech Note to stop and start the vio daemon:
basically would be enough start or restart vio_daemon:
under root:
ls -l /usr/ios/db/bin/solid*
ps -ef | egrep "vio_daemon|solid|db"
lssrc -ls vio_daemon
stopsrc -s vio_daemon
startsrc -s vio_daemon
sleep 10
ps -ef | egrep "vio_daemon|solid|db"
lssrc -ls vio_daemon
when you go to UI --> IBM Power --> RMC check, is that VIOS listed there?
so after running script cleanup_cmdb_with_logging.sh and next load.sh it works correctly.
--> another user it helped as well, he saw error on HMC under Virtual Networks: Error occurred while quering for SharedEthernetAdapter from VIOS ....
--> (internall link!) https://support.xorux.com/conversation/7350?folder_id=14
4.here is what resolved our internal issue on our P10 machine, proper DNS setup and vio-daemon restart
# tail -1 /etc/hosts
10.x.x.x p10-vios p10-vios.int.xorux.com
# tail -1 /etc/netsvc.conf
hosts=local,bind4
# cat /etc/resolv.conf
domain int.xorux.com
nameserver 10.x.x.x
nameserver 1.1.1.1
--> make sure DNS is working properly, nslookup/ping
# stopsrc -s vio_daemon
# startsrc -s vio_daemon
5. can you test the same solution as is described at the end of this thread?
https://forum.xorux.com/discussion/comment/5744#Comment_5744
6.You were indeed right ¿ the problem was with the vio servers. At first, we couldn¿t see any problems with the vio functions from the HMC, but digging some more, certain ¿work with virtual networks¿ functions would throw an error on the HMC.
The solution was to force the vio to IPv4 name resolution, we received this from IBM:
# vi /etc/netsvc.conf
hosts=local4,bind4 <===== change ¿hosts=local,bind¿ to this
Then
/usr/bin/stopsrc -s vio_daemon
Wait 300 seconds or until vio_daemon has stopped.
/usr/sbin/slibclean
rm -rf /home/ios/CM
/usr/bin/startsrc -s vio_daemon -a '-d 4'
ps -ef |grep vio_chgmgt |grep -v grep |awk -F ' ' '{print $2}'
kill -1 <PID_of_vio_chgmgt>
7. IBM support:
The VIOS version that you are using has an issue with logs collection, so I'm missing the copy of the CMDB, that is the DB running on VIOS that HMC queries
I do not see any clear error in the logs but I can¿t check the DB to see if it¿s properly populated
The good new is that we can recreate this DB without any impact to the running LPARs with the following steps
$ oem_setup_env
# stopsrc -s vio_daemon
# /usr/sbin/slibclean
# rm -rf /home/ios/CM
# rm /home/ios/logs/viod_bkps/*
# startsrc -s vio_daemon -a '-d 3'
# kill -1 vio_daemon's PID
Then wait few minutes and retry the operation that was failing
-
I wanted to post an update. While the symptoms were similar, it wasnt infact an issue with the CMDB that resides on the VIO. I was an issue with the subnet mask applied to the network interface used by the VIO (not the SEA). It was a typo of "255.0255.255.0" which caused the query for the all the VIO devices via the API that could be queried from the HMC to fail. Once this was corrected, the next cycle, xormon-ng picked everything up for all VIO managed by the HMC to be seen again.
I would have expected a failure from one VIO, but not the entire query to fail. I made a case with IBM about this to which we discovered this issue and they are investigating A) why does AIX even allow a 4 digit on any of the octet values, even leading zeros and B) why does an issue on one VIO cause the entire query to fail for all VIOs.
-
ok, thanks for letting us know
Howdy, Stranger!
Categories
- 1.6K All Categories
- 41 XORMON NG
- 25 XORMON
- 150 LPAR2RRD
- 13 VMware
- 16 IBM i
- 2 oVirt / RHV
- 4 MS Windows and Hyper-V
- Solaris / OracleVM
- XenServer / Citrix
- Nutanix
- 6 Database
- 2 Cloud
- 10 Kubernetes / OpenShift / Docker
- 122 STOR2RRD
- 19 SAN
- 7 LAN
- 17 IBM
- 3 EMC
- 12 Hitachi
- 5 NetApp
- 15 HPE
- Lenovo
- 1 Huawei
- 1 Dell
- Fujitsu
- 2 DataCore
- INFINIDAT
- 3 Pure Storage
- Oracle