Using Shell Scripts for Unix System Administration



Reader Please Note: As far as I know the information in this document is accurate, if you find any errors, have any comments, additions, or just have questions, please feel free to contact me at billetter@networktechnologist.com

Introduction

Part of the job of being an unix system administrator requires that we constantly be looking for problems that are developing or have occured.  Shell scripts are a perfect way to automate this process.  One of the first things that I do when I work on a new system is develop a shell script that will go out an look for items that are critical.  Critical items include situations that would cause any unix administrator concern, such as a file system is full or nfs mount have been dropped.  But there are also items that are unique to each system.  For example a production system may require certain processes to be running, or certain critical databases to be running.  By using a shell programming language it is possible to quickly write a simple script that will check for critical situations and give you a heads up.  Very often by running these scripts when I arrive at work first thing in the morning, I can immediately identify problems before they become serious.

What Situations do we check for?

Every unix adminstrator will have their own punch list of items that they consider critical.  My purpose here is to give you some ideas, so this list is not meant to be a comprehensive list of all things to look for - just a sample of some items which may be good to check!


These items are a pretty good start, and the techniques used to script their solution can easily be used for the additional tasks that you will inevitably want to add.

How to write the script

For this task I have used the Posix shell.  This shell works on many different unix systems.  As far as I know these commands will also work with the Korn and Bourne shells.  The commands are not very complicated.

The first item that needs to be done is to create a header for the file.  This is critical so that the script can be maintained.  I usually use one similar to the following:

#!/usr/bin/sh
#
#######################################################################
#
#      sareport
#
#      The purpose of this script is to check for critical conditions
#      that require system adminstrator intervention.
#
########################################################################
#
#    Date:  8/26/02
#
#    Author:  Bill Etter
#
########################################################################
#
#     Modification History:
#
#
#########################################################################


By using a script header such as this, then there is no question about what the script does, when it was written, by who and what it's modification history is.  Of course the very first line of the script is the familiar "shebang" that indicateds what shell to use and it's location.

After the header has been written, it is proper to initialize some variables.  I use something like this:

HOST=`hostname`
DAY=`date '+%b %d'`
TIME=`date '+%H:%M'`

This gives me a local variables $HOST that holds the result of the hostname command.  The ` (back quote) tells the script to execute the command hostname.  The local variables  $DAY and $TIME both execute the date command.  However the addition of '+%b %d' gives me just the monthname and day of the month, not the entire date string.  '+%H:%M' gives me just the hour and minute parts of the date string.  I will use these variables later in the script.

Now we can check to see how long this particular machine has been up.  This is done with the following commands:

UP=`uptime | cut -c13-26`
echo "\nAs of $DAY at $TIME, $HOST has been up for $UP hours."

uptime is the unix command that tells us how long the systems has been up and running.  The uptime command has additional information, so by piping the results of uptime to cut, it is possible to remove the extraneous information and store the result in the variable $UP.  The echo command tells us the date, time, hostname and how long the system has been up.

The next task is to determine the number of users on the system.  This is done by using the command who:

USERS=`who -u|wc -l`   
echo "\nCurrently $USERS users are using the system."

By piping the results of the who -u command to wc -l (word count) we can count the number of lines reported.  Since each user is listed on a line, the result of this command is the number of users.  Then we simply echo the results to the screen.

Now we need to check CPU Utilization.  This is done by using the sar (system activity reporter) command:

echo "Checking CPU utilization . . . \n"
sar -M 1

This command may need to be adjusted for your system.  On an HP-UX system this command will list the cpu statistics for each cpu in the system.  The 1 tells sar to only display a single reporting sample.  Normally sar is used to take periodic samples so that you can see trends develop.  Rather than storing the results in a variable, it is easier just to send the results of sar directly to the screen.

Now we are concerned about file systems that are full.  With file systems there are 2 situations.  First the file system may be full - that means it is at 100% utilization.  But, we are also concerned about file systems that are nearly full.  For example, you may decide that any file system that is 90% or more full is worthy of knowing about.  Here is how this is done:

FULL=`bdf | grep 9[0-9]% | wc -l
if [ $FULL -gt "0" ]
   then
      echo "Checking for filesystems that are too large . . ."
      echo "\n ***** WARNING ***** $FULL filesystems are at 90% or more capacity."
   fi

This part of the script uses a simple conditional if statement.  First the bdf command is run.  Since this script was taken from an HP-UX system the bdf command was used.  The bdf command is not available on all systems, such as Red Hat Linux for example.  For other systems, the df command can be used instead.  The results of bdf is piped to grep looking for only lines that contain 90% or more, up to a max 99%.  The results are then piped to wc -l, so that the number of filesystems that are at 90% or more can be counted.  The if statement will only print out a warning if the value of $FULL is greater than 0.  If no filesystems are at 90% more full then the if statement will not print out anything.


Now we need to look for filesystems that are 100% full.  This would correctly be identified as a critical situation, not just a warning.  The logic is similar to the previous.  Here is what it looks like:

FULL=`bdf | grep 100% | wc -l`
if [ $FULL -gt 0 ]
   then
      echo "Checking for fileystems that are full . . . "
      echo "\n ***** CRITICAL ***** These file systems are 100% full: \n"
      bdf | grep 100% | cut -c52-132
   fi

Rather than just count the filesystems that area 100% full, we actually print them out by executing the bdf command again directly to the screen.  However, the bdf command has some information that we don't need to display, so the cut command is again used to remove extraneous information.

Now lets look for a particular process running on the computer.  Very often in my work I find myself in an environment that uses Oracle databases.  When Oracle is running under unix, there will be a process call pmon running.  So, if I wanted to know how many Oracle databases were up an running, here is how I would do it:

DB=`ps -ef | grep pmon | grep -v grep | wc -l`
echo "Checking to see if any databases are running . . . "
if [ $DB -gt 0 ]
   then
      echo "\n$DB databases are currently running on $HOST.\n"
      ps -ef | grep -v grep | grep pmon | cut -c58-132
   else
      echo "\n***** CRITICAL ***** No Databases are running on $HOST"
   fi

To check for processes that are running, we use the ps -ef command.  The first grep command uses the -v option.  This will remove from the list the process grep.  If this was not done then there would be an extra process, since grep will count itself.  The results are then piped to grep a second time to look for any process that contains pmon.  Then the results are piped to wc -l so that a count of the databases can be made.  Very often Oracle environments will be running several databases, each of which must be running.

If the count is greater than zero, that means that some Oracle databases are running.  Next we run the ps -ef command again.  This time the results are printed on the screen.  The result is that not only do we know how many databases are running, but we can see a list of each database that is running.  This if statement uses the else option, so that if no databases are running that the appropriate message is printed out.

Now, in some environments a specific process needs to be running.  The above logic can easily used to test if a specific process is running.  Here is how to test to see if a Oracle database called BILL is running:

if [ $HOST = linux ]
   then
      echo "Checking to see if BILL is running . . . "
      BILLUP=`ps -ef | grep -v grep | gerp pmon | grep BILL | wc -l`
      if [ $BILLUP -eq 0 ]
         then
            echo "\n***** CRITICAL ***** Database BILL is DOWN on $HOST ! ! ! !"
         else
            echo "\nDatabase BILL is RUNNING on $HOST."
         fi
   fi

This series of commands uses a conditional to verify hostname.  Often times a production
 environment will have certain databases running on each machine.  This conditional will allow you to set up the logic to check for the proper databases on the proper machine.  Now we further refine the grep logic to not only look for pmon, but to look for pmon with the name of the database that we want to see is running, in this case BILL.  The next if statement will trigger an alarm message is the database is not running, and normal status message if it is running.

Our final task is to check for NFS mounts.  NFS is a useful method of accessing things on other machings.  This allows unix systems mount filesystems on other unix machines, on linux machines and even on other systems like Novell and Windows.  However, NFS is not a perfect system, it often causes problems.  One problem that we need to check for is whether the mount is present or not.  This can be done with the bdf command.

NFS=`bdf | grep : | wc -l`
echo "Checking for NFS mounts . . ."
if [ $NFS -gt 0 ]
   then
      echo "\nThere are $NFS NFS mounts on $HOST: \n"
      bdf | grep : | cut -d " " -f1,1
   else
      echo "\n****** CRITICAL ***** No NFS mounts currently exist on $HOST ! ! !"
   fi

This time we look for the : character in the listing.  This is a character that is each line that has an NFS mount.  Each NFS mount is counted by piping every line that contains : to wc -l.  The if statement then checks to see if the number of NFS mounts is greater than 0.  If not, then a critical message is printed.  Otherwise a count of the mounts is displayed along with a listing of each NFS mount.

As a final refinement to this script is a method of separating the displays of each command.  One method of doing this is to print a line across the screen after the results of each test.  Here is a simple script function that you can put in the script and call it to print a separator line:


p_line ()
   {
   echo "\n------------------------------------------------------------------------------------------------------\n"
   return
   }

By placing this function at the beginning of the script, it can be called anytime you need a line by using the statement p_line.

Summary

As you can see, it is not very difficult to write a shell script that goes out a looks for certain conditions on your unix system that are important.  It will be very easy for you to customize these ideas to check for exactly what you want on your systems.  Often times I will put this script on each system that I am responsible for and run them remotely using a remsh or rsh command.  Then from one location I can get an overview of all of my systems.

Further refinements to this process are to use PERL rather than shell script.  PERL allows me to have a richer programming environment, so as a result I can easily create more powerful scripts.  Using either PERL or shell scripts I can also employ CGI programming techniques so that these results can be available via a web interface.  Please contact me directly or check my web page at www.networktechnologist.com if you are interested in finding out more on how to use PERL or CGI programming


Back To Main Page For List Of Other Documents

Copyright Bill Etter 2002 all rights reserved
Last Revised August 26, 2002
For more information, contact billetter@NetworkTechnologist.com
http://www.networktechnologist.com/sysadmin/adminshell.htm