Using Shell Scripts for Unix System Administration
Reader Please Note: As far as I know the information in this document is
accurate, if you find any errors, have any comments, additions, or just have
questions, please feel free to contact me at billetter@networktechnologist.com
Introduction
Part of the job of being an unix system administrator requires that we constantly
be looking for problems that are developing or have occured. Shell scripts
are a perfect way to automate this process. One of the first things
that I do when I work on a new system is develop a shell script that will
go out an look for items that are critical. Critical items include situations
that would cause any unix administrator concern, such as a file system is
full or nfs mount have been dropped. But there are also items that
are unique to each system. For example a production system may require
certain processes to be running, or certain critical databases to be running.
By using a shell programming language it is possible to quickly write
a simple script that will check for critical situations and give you a heads
up. Very often by running these scripts when I arrive at work first
thing in the morning, I can immediately identify problems before they become
serious.
What Situations do we check for?
Every unix adminstrator will have their own punch list of items that they
consider critical. My purpose here is to give you some ideas, so this
list is not meant to be a comprehensive list of all things to look for - just
a sample of some items which may be good to check!
- How long has the system been running (in other words, has it rebooted
recently!).
- How many users are on the system?
- What is the CPU utilization?
- Are any file systems full (or nearly full)?
- Is the XXXX process running (you replace XXXX with the process name
you are interested in)?
- Are the NFS mounts up?
These items are a pretty good start, and the techniques used to script their
solution can easily be used for the additional tasks that you will inevitably
want to add.
How to write the script
For this task I have used the Posix shell. This shell works on many
different unix systems. As far as I know these commands will also work
with the Korn and Bourne shells. The commands are not very complicated.
The first item that needs to be done is to create a header for the file.
This is critical so that the script can be maintained. I usually
use one similar to the following:
#!/usr/bin/sh
#
#######################################################################
#
# sareport
#
# The purpose of this script is to check for critical
conditions
# that require system adminstrator intervention.
#
########################################################################
#
# Date: 8/26/02
#
# Author: Bill Etter
#
########################################################################
#
# Modification History:
#
#
#########################################################################
By using a script header such as this, then there is no question about what
the script does, when it was written, by who and what it's modification history
is. Of course the very first line of the script is the familiar "shebang"
that indicateds what shell to use and it's location.
After the header has been written, it is proper to initialize some variables.
I use something like this:
HOST=`hostname`
DAY=`date '+%b %d'`
TIME=`date '+%H:%M'`
This gives me a local variables $HOST that holds the result of the hostname
command. The ` (back quote) tells the script to execute the command
hostname. The local variables $DAY and $TIME both execute the
date command. However the addition of '+%b %d' gives me just the monthname
and day of the month, not the entire date string. '+%H:%M' gives me
just the hour and minute parts of the date string. I will use these
variables later in the script.
Now we can check to see how long this particular machine has been up. This
is done with the following commands:
UP=`uptime | cut -c13-26`
echo "\nAs of $DAY at $TIME, $HOST has been up for $UP hours."
uptime is the unix command that tells us how long the systems has been up
and running. The uptime command has additional information, so by piping
the results of uptime to cut, it is possible to remove the extraneous information
and store the result in the variable $UP. The echo command tells us
the date, time, hostname and how long the system has been up.
The next task is to determine the number of users on the system. This
is done by using the command who:
USERS=`who -u|wc -l`
echo "\nCurrently $USERS users are using the system."
By piping the results of the who -u command to wc -l (word count) we can
count the number of lines reported. Since each user is listed on a line,
the result of this command is the number of users. Then we simply echo
the results to the screen.
Now we need to check CPU Utilization. This is done by using the sar
(system activity reporter) command:
echo "Checking CPU utilization . . . \n"
sar -M 1
This command may need to be adjusted for your system. On an HP-UX
system this command will list the cpu statistics for each cpu in the system.
The 1 tells sar to only display a single reporting sample. Normally
sar is used to take periodic samples so that you can see trends develop.
Rather than storing the results in a variable, it is easier just to
send the results of sar directly to the screen.
Now we are concerned about file systems that are full. With file systems
there are 2 situations. First the file system may be full - that means
it is at 100% utilization. But, we are also concerned about file systems
that are nearly full. For example, you may decide that any file system
that is 90% or more full is worthy of knowing about. Here is how this
is done:
FULL=`bdf | grep 9[0-9]% | wc -l
if [ $FULL -gt "0" ]
then
echo "Checking for filesystems that are too large .
. ."
echo "\n ***** WARNING ***** $FULL filesystems are
at 90% or more capacity."
fi
This part of the script uses a simple conditional if statement. First
the bdf command is run. Since this script was taken from an HP-UX system
the bdf command was used. The bdf command is not available on all systems,
such as Red Hat Linux for example. For other systems, the df command
can be used instead. The results of bdf is piped to grep looking for
only lines that contain 90% or more, up to a max 99%. The results are
then piped to wc -l, so that the number of filesystems that are at 90% or
more can be counted. The if statement will only print out a warning
if the value of $FULL is greater than 0. If no filesystems are at 90%
more full then the if statement will not print out anything.
Now we need to look for filesystems that are 100% full. This would
correctly be identified as a critical situation, not just a warning. The
logic is similar to the previous. Here is what it looks like:
FULL=`bdf | grep 100% | wc -l`
if [ $FULL -gt 0 ]
then
echo "Checking for fileystems that are full . . . "
echo "\n ***** CRITICAL ***** These file systems are
100% full: \n"
bdf | grep 100% | cut -c52-132
fi
Rather than just count the filesystems that area 100% full, we actually
print them out by executing the bdf command again directly to the screen.
However, the bdf command has some information that we don't need to
display, so the cut command is again used to remove extraneous information.
Now lets look for a particular process running on the computer. Very
often in my work I find myself in an environment that uses Oracle databases.
When Oracle is running under unix, there will be a process call pmon
running. So, if I wanted to know how many Oracle databases were up an
running, here is how I would do it:
DB=`ps -ef | grep pmon | grep -v grep | wc -l`
echo "Checking to see if any databases are running . . . "
if [ $DB -gt 0 ]
then
echo "\n$DB databases are currently running on $HOST.\n"
ps -ef | grep -v grep | grep pmon | cut -c58-132
else
echo "\n***** CRITICAL ***** No Databases are running
on $HOST"
fi
To check for processes that are running, we use the ps -ef command. The
first grep command uses the -v option. This will remove from the list
the process grep. If this was not done then there would be an extra
process, since grep will count itself. The results are then piped to
grep a second time to look for any process that contains pmon. Then
the results are piped to wc -l so that a count of the databases can be made.
Very often Oracle environments will be running several databases, each
of which must be running.
If the count is greater than zero, that means that some Oracle databases
are running. Next we run the ps -ef command again. This time the
results are printed on the screen. The result is that not only do we
know how many databases are running, but we can see a list of each database
that is running. This if statement uses the else option, so that if
no databases are running that the appropriate message is printed out.
Now, in some environments a specific process needs to be running. The
above logic can easily used to test if a specific process is running. Here
is how to test to see if a Oracle database called BILL is running:
if [ $HOST = linux ]
then
echo "Checking to see if BILL is running . . . "
BILLUP=`ps -ef | grep -v grep | gerp pmon | grep BILL
| wc -l`
if [ $BILLUP -eq 0 ]
then
echo "\n***** CRITICAL ***** Database
BILL is DOWN on $HOST ! ! ! !"
else
echo "\nDatabase BILL is RUNNING
on $HOST."
fi
fi
This series of commands uses a conditional to verify hostname. Often
times a production
environment will have certain databases running on each machine. This
conditional will allow you to set up the logic to check for the proper databases
on the proper machine. Now we further refine the grep logic to not only
look for pmon, but to look for pmon with the name of the database that we
want to see is running, in this case BILL. The next if statement will
trigger an alarm message is the database is not running, and normal status
message if it is running.
Our final task is to check for NFS mounts. NFS is a useful method
of accessing things on other machings. This allows unix systems mount
filesystems on other unix machines, on linux machines and even on other systems
like Novell and Windows. However, NFS is not a perfect system, it often
causes problems. One problem that we need to check for is whether the
mount is present or not. This can be done with the bdf command.
NFS=`bdf | grep : | wc -l`
echo "Checking for NFS mounts . . ."
if [ $NFS -gt 0 ]
then
echo "\nThere are $NFS NFS mounts on $HOST: \n"
bdf | grep : | cut -d " " -f1,1
else
echo "\n****** CRITICAL ***** No NFS mounts currently
exist on $HOST ! ! !"
fi
This time we look for the : character in the listing. This is a character
that is each line that has an NFS mount. Each NFS mount is counted by
piping every line that contains : to wc -l. The if statement then checks
to see if the number of NFS mounts is greater than 0. If not, then
a critical message is printed. Otherwise a count of the mounts is displayed
along with a listing of each NFS mount.
As a final refinement to this script is a method of separating the displays
of each command. One method of doing this is to print a line across
the screen after the results of each test. Here is a simple script
function that you can put in the script and call it to print a separator
line:
p_line ()
{
echo "\n------------------------------------------------------------------------------------------------------\n"
return
}
By placing this function at the beginning of the script, it can be called
anytime you need a line by using the statement p_line.
Summary
As you can see, it is not very difficult to write a shell script that goes
out a looks for certain conditions on your unix system that are important.
It will be very easy for you to customize these ideas to check for exactly
what you want on your systems. Often times I will put this script on
each system that I am responsible for and run them remotely using a remsh
or rsh command. Then from one location I can get an overview of all
of my systems.
Further refinements to this process are to use PERL rather than shell script.
PERL allows me to have a richer programming environment, so as a result
I can easily create more powerful scripts. Using either PERL or shell
scripts I can also employ CGI programming techniques so that these results
can be available via a web interface. Please contact me directly or
check my web page at www.networktechnologist.com if you are interested in
finding out more on how to use PERL or CGI programming
Copyright Bill Etter 2002 all rights reserved
Last Revised August 26, 2002
For more information, contact billetter@NetworkTechnologist.com
http://www.networktechnologist.com/sysadmin/adminshell.htm