Testing Veritas Clusters
Actual commands are in black.
0. Check Veritas Licenses - for FileSystem, Volume Manager AND Cluster
vxlicense -p
If any licenses are not valid or expired -- get them FIXED before continuing! All licenses should say "No
expiration". If ANY license has an actual expiration date, the test failed. Permenant licenses do NOT have an
expiration date. Non-essential licenses may be moved -- however, a senior admin should do this.
1. Hand check SystemList & AutoStartList
On either machine:
grep SystemList /etc/VRTSvcs/conf/config/main.cf
You should get:
SystemList = { system1, system2 }
grep AutoStartList /etc/VRTSvcs/conf/config/main.cf
You should get:
AutoStartList = { system1, system2 }
Each list should contain both machines. If not, many of the next tests will fail.
If your lists do NOT contain both systems, you will probably need to modify them with commands that follow.
more /etc/VRTSvcs/conf/config/main.cf (See if it is reasonable. It is likely that the systems aren't fully
set up)
haconf -makerw (this lets you write the conf file)
hagrp -modify oragrp SystemList system1 0 system2 1
hagrp -modify oragrp AutoStartList system1 system2
haconf -dump -makero (this makes conf file read only again)
2. Verify Cluster is Running
First verify that veritas is up & running:
hastatus -summary
If this command could NOT be found, add the following to root's path in /.profile:
vi /.profile
add /opt/VRTSvcs/bin to your PATH variable
If /.profile does not already exist, use this one:
PATH=/usr/bin:/usr/sbin:/usr/ucb:/usr/local/bin:/opt/VRTSvcs/bin:/sbin:$PATH
export PATH
. /.profile
Re-verify command now runs if you changed /.profile:
hastatus -summary
Here is the expected result (your SYSTEMs/GROUPs may vary):
One system should be OFFLINE and one system should be ONLINE ie:
# hastatus -summary
-- SYSTEM STATE
-- System State Frozen
A e4500a RUNNING 0
A e4500b RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B oragrp e4500a Y N ONLINE
B oragrp e4500b Y N OFFLINE
If your systems do not show the above status, try these debugging steps:
- If NO systems are up, run hastart on both systems and run
hastatus -summary again.
- If only one system is shown, start other system with hastart. Note: one
system should ALWAYS be OFFLINE for the way we configure systems here. (If we ran oracle parallel server, this
could change -- but currently we run standard oracle server)
-
If both systems are up but are OFFLINE and hastart did NOT correct the problem and oracle filesystems are
not running on either system, the cluster needs to be reset. (This happens under strange network situations
with GE Access.) [You ran hastart and that wasn't enough to get full cluster to work.]
Verify that the systems have the following EXACT status (though your machine names will vary for
other customers):
gedb002# hastatus -summary
-- SYSTEM STATE
-- System State Frozen
A gedb001 RUNNING 0
A gedb002 RUNNING 0
-- GROUP STATE
-- Group System Probed AutoDisabled State
B oragrp gedb001 Y N OFFLINE
B oragrp gedb002 Y N OFFLINE
gedb002# hares -display | grep ONLINE
nic-qfe3 State gedb001 ONLINE
nic-qfe3 State gedb002 ONLINE
gedb002# vxdg list
NAME STATE ID
rootdg enabled 957265489.1025.gedb002
gedb001# vxdg list
NAME STATE ID
rootdg enabled 957266358.1025.gedb001
Recovery Commands:
hastop -all
on one machine hastart
wait a few minutes
on other machine hastart
hastatus -summary (make sure one is OFFLINE && one is
ONLINE)
If none of these steps resolved the situation, contact Lorraine or Luke (possibly Russ Button or Jen Redman
if they made it to Veritas Cluster class) or a Veritas Consultant.
3. Verify Services Can Switch Between Systems
Once, hastatus -summary works, note the GROUP name used. Usually, it will be "oragrp", but the installer can use
any name, so please determine it's name.
First check if group can switch back and forth. On the system that is running (system1), switch veritas to other
system (system2):
hagrp -switch groupname -to system2 [ie: hagrp -switch
oragrp -to e4500b]
Watch failover with hastatus -summary. Once it is failed over, switch it back:
hagrp -switch groupname -to system1
4. Verify OTHER System Can Go Up & Down Smoothly For Maintanence
On system that is OFFLINE (should be system 2 at this point), reboot the computer.
ssh system2
/usr/sbin/shutdown -i6 -g0 -y
Make sure that the when the system comes up & is running after the reboot. That is, when the reboot is
finished, the second system should say it is offline using hastatus.
hastatus -summary
Once this is done, hagrp -switch groupname -to system2 and
repeat reboot for the other system
hagrp -switch groupname -to system2
ssh system1
/usr/sbin/shutdown -i6 -g0 -y
Verify that system1 is in cluster once rebooted
hastatus -summary
5. Test Actual Failover For System 2 (and pray db is okay)
To do this, we will kill off the listener process, which should force a failover. This test SHOULD be okay for
the db (that is why we choose LISTENER) but there is a very small chance things will go wrong .. hence the "pray"
part :).
On system that is online (should be system2), kill off ORACLE LISTENER Process
ps -ef | grep LISTENER
Output should be like:
root 1415 600 0 20:43:58 pts/0 0:00 grep LISTENER
oracle 831 1 0 20:27:06 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit
kill -9 process-id (the first # in list - in this case 831)
Failover will take a few minutes
You will note that system 2 is faulted -- and system 1 is now online
You need to CLEAR the fault before trying to fail back over.
hares -display | grep FAULT
for the resource that is failed (in this case, LISTENER)
Clear the fault
hares -clear resource-name -sys faulted-system [ie: hares
-clear LISTENER -sys e4500b]
6. Test Actual Failover For System 1 (and pray db is okay)
Now we do same thing for the other system first verify that the other system is NOT faulted
hastatus -summary
Now do the same thing on this system... To do this, we will kill off the listener process, which should force a
failover.
On system that is online (should be system2), kill off ORACLE LISTENER Process
ps -ef | grep LISTENER
Output should be like:
oracle 987 1 0 20:49:19 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit
root 1330 631 0 20:58:29 pts/0 0:00 grep LISTENER
kill -9 process-id (the first # in list - in this case 987)
Failover will take a few minutes
You will note that system 1 is faulted -- and system 1 is now online
You need to CLEAR the fault before trying to fail back over.
hares -display | grep FAULT for the resource that is failed (in this case,
LISTENER)
Clear the fault
hares -clear resource-name -sys faulted-system [ie: hares
-clear LISTENER -sys e4500a]
Run:
hastatus -summary
to make sure everything is okay.
An excellent reference book for Veritas Clusters is:
|