Veritas Cluster Debugging Tips
Initial Notes
Veritas cluster server is a high availability server. This means that processes switch between servers when a
server fails. All database processes are run through this server - and as such, this needs to run smoothly. Note
that the oracle process should only actually be running on the server which is active.
On monitoring tools, the procs light for whichever box is secondary should be yellow, because oracle is not
running. Yet, the cluster is running on both systems.
Cluster Not Up -- HELP
The normal debugging of steps includes: checking on status, restarting if no faults, checking licenses, clearing
faults if needed, and checking logs.
To find out Current Status:
/opt/VRTSvcs/bin/hastatus
-summary
This will give the general status of each
machine and processes
/opt/VRTSvcs/bin/hares -display
This gives much more detail - down to the
resource level.
If hastatus fails on both machines (it returns that the cluster is not up or returns nothing), try
to start the cluster
/opt/VRTSvcs/bin/hastart
/opt/VRTSvcs/bin/hastatus -summary
will tell you if processes started properly. It will NOT start
processes on a FAULTED system.
Starting Single System NOT Faulted
If the system is NOT FAULTED and only one system is up, the cluster probably needs to have gabconfig
manually started. Do this by running:
/sbin/gabconfig -c -x
/opt/VRTSvcs/bin/hastart
/opt/VRTSvcs/bin/hastatus -summary
If the system is faulted, check licenses and clear the faults as described next.
To check licenses:
vxlicense -p
Make sure all licenses are current - and NOT expired! If they are expired, that is your problem. Call
VERITAS to get temporary licenses.
There is a BUG with veritas licences. Veritas will not run if there are ANY expired licenses -- even if you
have the valid ones you need. To get veritas to run, you will need to MOVE the expired licenses. [Note: you
will minimally need VXFS, VxVM and RAID licenses to NOT be expired from what I understand.]
vxlicense -p
Note the NUMBER after the license (ie: Feature name: DATABASE_EDITION [100])
cd /etc/vx/elm
mkdir old
mv lic.number old [do this for all expired licenses]
vxlicense -p [Make sure there are no expired licenses AND your good licenses are there]
hastart
If still fails, call veritas for temp licenses. Otherwise, be certain to do the same on your second
machine.
To clear FAULTS:
hares -display
For each resource that is faulted run:
hares -clear resource-name -sys faulted-system
If all of these clear, then run hastatus -summary and make sure that these are clear. If some don't
clear you MAY be able to clear them on the group level. Only do this as last resort:
hagrp -disableresources groupname
hagrp -flush group -sys sysname
hagrp -enableresources groupname
To get a group to go online:
hagrp -online group -sys desired-system
If it did NOT clear, did you check licenses?
System has the following EXACT status:
gedb002# hastatus -summary
-- SYSTEM STATE
-- System State Frozen
A gedb001 RUNNING 0
A gedb002 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B oragrp gedb001 Y N OFFLINE
B oragrp gedb002 Y N OFFLINE
gedb002# hares -display | grep ONLINE
nic-qfe3 State gedb001 ONLINE
nic-qfe3 State gedb002 ONLINE
gedb002# vxdg list
NAME STATE ID
rootdg enabled 957265489.1025.gedb002
gedb001# vxdg list
NAME STATE ID
rootdg enabled 957266358.1025.gedb001
Recovery Commands:
hastop -all
on one machine hastart
wait a few minutes
on other machine hastart
Reviewing Log Files
If you are still having troubles, look at the logs in /var/VRTSvcs/log. Look at the most recent ones for
debugging purposes (ls -ltr). Here is a short description of the logs in /var/VRTSvcs/log:
hashadow-log_A: hashadow checks to see if the ha cluster daemon (had) is up and restarts it if needed. This
is the log of that process.
engine.log_A: primary log, usually what you will be reading for debugging
Oracle_A: oracle process log (related to cluster only)
Sqlnet_A: sqlnet process log (related to cluster only)
IP_A: related to shared IP
Volume_A: related to Volume manager
Mount_A: related to mounting actual filesystes (filesystem)
DiskGroup_A: related to Volume Manager/Cluster Server
NIC_A: related to actual network device
By looking at the most recent logs, you can know what failed last (or most recently). You can also tell what did
NOT run which may be jut as much of a clue. Of course, if none of this helps, open a call with veritas tech
support.
Calling Tech Support:
If you have tried the previously described debugging methods, call Veritas tech support: 800-634-4747. Your
company needs to have a Veritas support contract.
Restarting Services:
If a system is gracefully shutdown and it was running oracle or other high availability services, it will NOT
transfer them. It only transfers services when the system crashes or has an error.
hastart
hastatus -summary
will tell you if processes started properly. It will NOT start processes on a FAULTED system. If the system
is faulted, clear the faults as described above.
Doing Maintenance on DBs:
BEFORE working on DB
Run hastop -all -force
AFTER working on Dbs:
You MUST bring up oracle on same machine
Once Oracle is up, run:
hastart on the same machine as you started the work on (the first on system with oracle running)
wait 3-5 minutes
then run hastart on the other system
If you need the instance to run on the other system, you can run: hagrp -switch oragrp -to othersystem
Shutting down db machines:
If you shutdown the machine that is running veritas cluster, it will NOT start on the other machine. It only
fails over if the machine crashes. You need to manually switch the services if you shutdown the machine. To switch
processes:
Find out groups to transfer over
hagrp -display
Switch over each group
hagrp -switch group-to-move -to new-system
Then shutdown machine as desired. When rebooted will start cluster daemon automatically.
Doing Maintenance on Admin Network:
If the admin network is brought down (that the veritas cluster uses), veritas WILL fault both machines AND bring
down oracle (nicely). You will need to do the following to recover:
hastop -all
On ONE machine: hastart
wait 5 minutes
On other machine: hastart
Manual start/stop WITHOUT veritas cluster:
THIS IS ONLY USED WHEN THERE ARE DB FAILURES
If possible, use the section on DB Maintenance. Only use this if system fails on coming up AND you KNOW that it
is due to a db configuration error. If you manually startup filesystems/oracle -- manually shut them
down and restart using hastart when done.
To startup:
Make sure ONLY rootdg volume group is active on BOTH NODEs. This is EXTREMELY important as if it is active on
both nodes corruption occurs. [ie. oradg or xxoradg is NOT present]
vxdg list
hastatus (stop on both as you are faulted on both machines )
hastop -all (if either was active make sure you are truly shutdown!)
Once you have confirmed that the oracle datagroup is not active, on ONE machine do the following:
vxdg import oradg [this may be xxoradg where xx is the client 2 char code]
vxvol -g oradg startall
mount -F vxfs /dev/vx/dsk/oradg/name /mountpoint [Find volumes and mount points in
/etc/VRTSvcs/conf/config/main.cf]
Let DBAs do their stuff
To shutdown:
umount /mountpoint [foreach mountpoint]
vxdg deport oradg
vxvol -g oradg stopall
clear faults; start cluster as described above
An excellent reference book for Veritas Clusters is:
|