Tuesday, April 2, 2013

Concepts of Nagios


  • Purpose. Basically Nagios is used to monitor if your servers and network devices are up and working properly. If something is not working, it can notify you (the administrator) so that you can fix the problems ASAP.
  • How it works (basically). Nagios runs on a central server so that you can view the status of everything in one web interface. It will check each host (server or network device) by pinging its hostname or IP, say, every 5 minutes ("check interval"). If a host is down, it will notify the administrator ("contact") or the administrators (the "contact group").
  • Notify on state change. Once a host is detected as down, the checking will continue as usual to detect if it is up again.
    • If it is still down, usually no further notification is sent (to avoid bombarding the admins). However, to avoid the problem from being forgotten, you can configure it to send a notification at a certain interval ("notification interval").
    • If it becomes up, another notification will be sent so that the admins will know that it has recovered.
    • Therefore, notification is sent when the state is changed, not when a host is down.
  • Soft and hard state. Sometimes a host is not down, but just very busy or the network is very slow so some pings may timeout. Therefore, if a host is considered down by ping, the state is changed to a "soft" down, meaning that it is possibly down. No notification is sent yet. Nagios will check it ("retry") a few times more (the "max check attempts"). If it is still down for these checks, the state will be changed to a "hard" down and then a notification is sent. Similarly, when a host recovers from a hard down to up, it is just a soft up. Only when it is still up in the subsequent checks, it will change to a hard up and then a notification will be sent. In summary, notification is sent only when the state is changed into a different hard state.
  • Unreachable state. In addition to up or down, there is a third possible state: unreachable. It can happen if the host is running fine, but a router between them is down. But how can Nagios distinguish which case it is when ping fails? You can create a host record for the router in Nagios  and tell it that the router is the "parent" of that server host (on the path from the Nagios server to that server host). This way, when ping to that server host fails, Nagios will check if the router is up or down. If it is down, it will mark the server host as unreachable instead of down. Again, a one time result only leads to a change to a soft state.
    • Will the admins still get a notification if it is changed to a hard unreachable state? By default yes but you can configure it (below).
  • Command. In Nagios you can configure the actual command used for ping or for sending notifications. To allow you to refer to the command by a simple name without worrying about the actual path, each command is assigned a name for later use. Many such commands will also retrieve information from environment variables such as HOSTADDRESS ("macro" and set up by Nagios). Below is the command typically used for ping:  
  • define command {
      command_name check-host-alive
      command_line /var/lib/nagios/plugin/check_ping -H $HOSTADDRESS$
    }
    
    Below is a simplified example of a command used for sending email notification (again, Nagios will provide the information such as email, host name, host state through the macros):
    define command {
      command_name notify-by-email
      command_line /bin/echo $HOSTNAME$ in $HOSTSTATE$ | /usr/bin/mail $CONTACTEMAIL$
    }
    
    • Contact. A contact is simply a configuration item for sending a text message (a command line) to a person. It specifies the email address, pager/mobile number and etc. so that the command know where to send the message to.
      • Notification options. It can also specify what kinds of notifications it will accept: it can choose to accept down (from up to down), recovery (from down to up), unreachable (from up or down to unreachable), etc.
      • Notification period. It can also specify what time of the day on which days notifications can be sent. Usually it should be 24x7 for critical hosts, but if the host is unimportant and this is contact sending SMS, you may configure it to only send in business hours. The notification outside of the period is not lost though: Nagios will schedule it to be sent at the start of the next time slot (e.g., next morning).
      • Notification settings for a host. The notification options and notification period can also be specified for a host. It means some hosts may be more important and some admins may be available only for certain issues in certain time periods.
    • Below is an example of a contact (using the notify-by-email command above):
    define contact {
      contact_name kent
      email kent@foo.com
      host_notification_options d,r,u
      host_notification_period 24x7
      host_notification_command notify-by-email
    }
    
    • Contact group. A contact group is just a set of contacts (e.g., for all the administrators). You define the individual contacts first and then add them to a group. This way, you can configure the host checking to notify a contact group instead of individuals. Below is an example:
    define contactgroup {
      contactgroup_name admins
      members kent, paul
    }
    
    • Check period. Instead of performing checking at all time, Nagios requires that for each host you specify a period in which checking is performed. You can set it to 24x7 or your business hours depending on the supposed up-time window of the host.
    • Host configuration example. The configuration of host in Nagios includes the address (hostname or IP) and the various checking and notification settings mentioned above. Below is an example (using the check-host-alive command above):
    define host {
      host_name host1
      address 192.168.1.10
      parents router1
      check_interval 5
      check_period 24x7
      check_command check-host-alive
      max_check_attempts 4
      contact_groups admins
    }
    
    define host {
      host_name router1
      ...
    }
    
    • Time period. The time period used above such as 24x7 is defined like below:
    define timeperiod {
      timeperiod_name 24x7
      sunday 00:00-24:00
      monday 00:00-24:00
      tuesday 00:00-24:00
      ...
    }
    
    • Object and object inheritance. Each item above such as host, contact, command is called an "object" in Nagios. To save on specifying all those options, Nagios allows you to define an object template that can be inherited by objects. For example, there is a template coming with Nagios called generic-host defined as below to set sensible values for many settings. The name is the name of the template so that you can refer to it by name and "register 0" tells Nagios not to treat it as a real host:
    define host {
      name generic-host
      register 0
      max_check_attempts 10
      check_command check-host-alive
      contact_groups admins
      ...
    }
    
    • To use ("inherit") the template, just do it like this:
    define host {
      host_name host1
      address 192.168.1.10
      parents router1
      use generic-host
      check_period 24x7
    }
    
    • Service monitoring. Just monitoring if a host is up or down isn't enough. It is needed to monitor the services such as website or email services are running properly. This is configured with a service object. A service object specifies the command used for checking (so the actual meaning of "up" is up to you to decide) and one or more hosts to provide the IP addresses. It means Nagios will perform the checking on each such host independently and pass the IP to the command via the $HOSTADDRESS$ macro. This is useful if, say, you have Apache running dozens of hosts, as you only need one service object. So, the service object is not really a single service, but just a way to apply a type of service checking across multiple hosts. Other than this, a service checking is very much similar to host checking and support similar settings such as max check attempts, check interval, contact, notification options, etc. Below is an sample service object (the check-http command will simply check if it can successfully retrieve a web page from the IP):
    • define service {
        service_description website
        host_name host1, host2, host3
        max_check_attempts 4
        check_interval 3
        check_period 24x7
        check_command check-http
        notification_options c,w,r
        notification_period 24x7
        contact_groups admins
        ...
      }
    • Service state. Instead of up, down, unreachable, a service (running on a particular host) can be up, critical (obvious error such as no response or connection refused or 404 error), warning (e.g., slow response. The actual meaning is defined by the command )  or unknown (usually meaning that the check command itself has failed). The corresponding notification options are r (recover), c (critical), w (warning).
    • Service notification settings in a contact. Probably because the notification options for service checking are different from those for host checking, Nagios requires that in a contact object you  specify the notification settings separately for host notifications and for service notifications:
    define contact {
      contact_name kent
      email kent@foo.com
      host_notification_options d,r,u
      host_notification_period 24x7
      host_notification_command notify-by-email
      service_notification_options c,w,r
      service_notification_period 24x7
      service_notification_command notify-by-email
    }
    • Passing explicit arguments to a command. Just checking if a web page can be retrieved may not be good enough. A better check is to check if the web page contains the right content (e.g., containing a particular string). To do that, you can define your own command and pass the expected string (e.g., abc) to it. This is done using an exclamation mark and it is retrieved by $ARG1$ in the command definition:
      define service {
        service_description website
        host_name host1, host2, host3
        max_check_attempts 4
        check_interval 3
        check_period 24x7
        check_command check-http-string!abc
        ...
      }
      
      define command {
        command_name check-http-string
        command_line /usr/lib/nagios/plugins/check_http -H $HOSTNAME$ -I $HOSTADDRESS$ -s $ARG1$
      }
      
    • Finding out the plugins. There are many commands (both the defined command objects and the actual programs) coming with Nagios. They are called plugins and many 3rd party ones can be installed. To find out what is available and how to use each one, just do it like:
    # find the command object definitions by the plugins
    $ ls /etc/nagios-plugins/config
      ...
    # learn about the options supported by each program
    $ /usr/lib/nagios/plugins/check_http --help
      ...
    
    • Inheritance for service. Just like a host object, you can use object inheritance for a service object too. Nagios comes with a service template called "generic-service" defining many defaults for you to use:
      define service {
        service_description website
        use generic-service
        check_command check-http-string!abc
        ...
      }
    • Host group. Instead of listing individual hosts in a service object, it may be a better way to define a "host group" named web-servers containing such hosts and just let the service object refer to the host group. This is particularly useful if you need to perform two kinds of service checking on the same group (e.g., checking web and SSH).
      define service {
        service_description website
        hostgroup_name web-servers
        ...
      }
      
      define hostgroup {
        hostgroup_name web-servers
        members host1, host2, host3
      }
    • Service dependency. If the DNS or LDAP is down, then many other services will also not function. So, you will receive lots of notifications, burying the real issue. If you don't want to receive such "bogus" notifications, you declare that such services are dependent on the DNS or LDAP services. This way, when an upper level service fails, Nagios will check if the lower level service also fails. Given a specific option, you can tell Nagios to not send notification for the upper level service when the lower level service is in a certain "failure" states (e.g., critical or warning). As this dependency is about a service on a host depending on another service on another host, you must specify the hosts in addition to the services:
      define servicedependency {
        dependent_service_description website
        dependent_host_name host1
        service_description dns
        host_name host2
        notification_failure_criteria c,w
      }
    • Service-host dependency. In principle, the same dependency should be specified for service and host as if a host is down, it is impossible for the service to continue to run properly. However, at present Nagios doesn't support this concept. So, if a host is down, you will still receive notifications for the services running on it.