Kent Tong's personal thoughts on information technology: April 2013

PPP

PPP is protocol to negotiate network layer protocol parameters (e.g., IP) and a encapsulation to carry IP packets for transmission over a point-to-point link (a layer 2 functionality). The former allows the server to assign IP, netmask and DNS server etc. to the client. The latter allows the actual transmission of IP packets.
When a packet is sent to the PPP's IP interface (ppp0 on Linux), PPP will put the IP packet into a PPP packet and deliver it to the point-to-point link. The other side will extract the IP packet from the PPP packet. This is very much link IP over Ethernet.
Before allowing the client to connect, the client needs to authenticate with the server. PPP has built-in authentication protocols like PAP and CHAP.

EAP

EAP was devised to run on top of PPP (without IP). Instead of requiring a particular authentication mechanism like PAP and CHAP, it allows different authentication mechanisms to be plugged into it.
As EAP is not run on top of IP or TCP, it handles re-transmission (for packet loss) by itself. However, it does assume that packet ordering is preserved by the transport (typically ensured by point-to-point link).
EAP is a peer to peer protocol. Either side can authenticate the other side. The side being authenticated (e.g., a remote access client) is called the supplicant or the peer, the other side is called the authenticator (e.g., a network access server).
The authenticator can authenticate the peer itself or can pass through to another server (authentication server). The latter approach is useful to centralized the user accounts. Typically, it is a RADIUS server.
EAP-TLS is an authentication mechanism of EAP which uses TLS for authentication. That is, using the TLS handshake protocol (using the certificate of the peer) for authentication. The last step of the TLS handshake protocol is to verify that both sides have the master secret which was established by random numbers generated by each side and sent to the peer encrypted by the public key of the peer.
PEAP (protected EAP). There are some weaknesses in EAP with some mechanisms such as sending the identity of the peer in clear. To avoid the problem, PEAP works by establishing a TLS session with the authenticator (or authentication server in pass-through mode) first and then perform EAP (so, the actual mechanism is needed such as EAP-CHAP). This way, everything is encrypted. PEAP is not an open standard but created by Cisco and MS.

RADIUS

RADIUS provides its own authentication (quite weak) using a central user database. It can also return additional information through the attributes to the authenticator (e.g., the privilege level of the user).
Even though RADIUS has its own authentication protocol, it can support CHAP and EAP using its attributes.
A RADIUS attribute has a type (an integer) and a value. It is called attribute-value pair (AVP). There are some standard RADIUS types and vendor proprietary attribute types.

802.1x

802.1x (EAP over LAN) is basically just EAP over 802 (Ethernet or 802.11 wireless LAN) so that a switch or an AP can authenticate a device being connected to it. This is like the case in PPP for allowing the client to connect or not, except that PPP deals with a remote client while 802.1x deals with a LAN client. As on a LAN there is no need for another layer 2 protocol, only the authentication part (EAP) of PPP is needed.
As the authentication is done in layer 2, there is no IP yet. So, just like normal EAP, EAP packets are encapsulated in layer 2 frames (here, in 802 frames).
The switch or AP is called the authenticator as it needs to authenticate the peer.
In pass through mode, the authenticator will use RADIUS protocol with EAP support to talk to the authentication server (for central management).
A RADIUS server can return RADIUS attributes telling the switch which VLAN to put the client/port into.

L2TP

L2TP is used to simulate a point-to-point link with an IP network such as the Internet. Therefore, PPP can be run over L2TP. This way, a VPN tunnel can be built across the Internet. Essentially L2TP serves as a layer 2 protocol that runs on top of a layer 3 protocol (IP). When people say it is an L2TP VPN, actually it is L2TP and PPP VPN as the address assignment and etc. are done with PPP. The authentication is also the same (e.g., EAP).
L2TP itself provides no encryption, so typically it is run over IPSec to protect the traffic. So, it is PPP over L2TP over IPSec.

GRE

GRE is very much like L2TP except that it can tunnel any protocol over any protocol (i.e., the transport is not necessarily IP).
When using for VPN, PPP is also run over GRE. In addition, for security, the PPP payload is typically encrypted and this modified version of PPP is PPTP (proprietary by MS).

Purpose. Basically Nagios is used to monitor if your servers and network devices are up and working properly. If something is not working, it can notify you (the administrator) so that you can fix the problems ASAP.
How it works (basically). Nagios runs on a central server so that you can view the status of everything in one web interface. It will check each host (server or network device) by pinging its hostname or IP, say, every 5 minutes ("check interval"). If a host is down, it will notify the administrator ("contact") or the administrators (the "contact group").
Notify on state change. Once a host is detected as down, the checking will continue as usual to detect if it is up again.

If it is still down, usually no further notification is sent (to avoid bombarding the admins). However, to avoid the problem from being forgotten, you can configure it to send a notification at a certain interval ("notification interval").
If it becomes up, another notification will be sent so that the admins will know that it has recovered.
Therefore, notification is sent when the state is changed, not when a host is down.

Soft and hard state. Sometimes a host is not down, but just very busy or the network is very slow so some pings may timeout. Therefore, if a host is considered down by ping, the state is changed to a "soft" down, meaning that it is possibly down. No notification is sent yet. Nagios will check it ("retry") a few times more (the "max check attempts"). If it is still down for these checks, the state will be changed to a "hard" down and then a notification is sent. Similarly, when a host recovers from a hard down to up, it is just a soft up. Only when it is still up in the subsequent checks, it will change to a hard up and then a notification will be sent. In summary, notification is sent only when the state is changed into a different hard state.
Unreachable state. In addition to up or down, there is a third possible state: unreachable. It can happen if the host is running fine, but a router between them is down. But how can Nagios distinguish which case it is when ping fails? You can create a host record for the router in Nagios and tell it that the router is the "parent" of that server host (on the path from the Nagios server to that server host). This way, when ping to that server host fails, Nagios will check if the router is up or down. If it is down, it will mark the server host as unreachable instead of down. Again, a one time result only leads to a change to a soft state.

Will the admins still get a notification if it is changed to a hard unreachable state? By default yes but you can configure it (below).

Command. In Nagios you can configure the actual command used for ping or for sending notifications. To allow you to refer to the command by a simple name without worrying about the actual path, each command is assigned a name for later use. Many such commands will also retrieve information from environment variables such as HOSTADDRESS ("macro" and set up by Nagios). Below is the command typically used for ping:

define command {
  command_name check-host-alive
  command_line /var/lib/nagios/plugin/check_ping -H $HOSTADDRESS$
}

Below is a simplified example of a command used for sending email notification (again, Nagios will provide the information such as email, host name, host state through the macros):

define command {
  command_name notify-by-email
  command_line /bin/echo $HOSTNAME$ in $HOSTSTATE$ | /usr/bin/mail $CONTACTEMAIL$
}

Contact. A contact is simply a configuration item for sending a text message (a command line) to a person. It specifies the email address, pager/mobile number and etc. so that the command know where to send the message to.

Notification options. It can also specify what kinds of notifications it will accept: it can choose to accept down (from up to down), recovery (from down to up), unreachable (from up or down to unreachable), etc.
Notification period. It can also specify what time of the day on which days notifications can be sent. Usually it should be 24x7 for critical hosts, but if the host is unimportant and this is contact sending SMS, you may configure it to only send in business hours. The notification outside of the period is not lost though: Nagios will schedule it to be sent at the start of the next time slot (e.g., next morning).
Notification settings for a host. The notification options and notification period can also be specified for a host. It means some hosts may be more important and some admins may be available only for certain issues in certain time periods.

Below is an example of a contact (using the notify-by-email command above):

define contact {
  contact_name kent
  email kent@foo.com
  host_notification_options d,r,u
  host_notification_period 24x7
  host_notification_command notify-by-email
}

Contact group. A contact group is just a set of contacts (e.g., for all the administrators). You define the individual contacts first and then add them to a group. This way, you can configure the host checking to notify a contact group instead of individuals. Below is an example:

define contactgroup {
  contactgroup_name admins
  members kent, paul
}

Check period. Instead of performing checking at all time, Nagios requires that for each host you specify a period in which checking is performed. You can set it to 24x7 or your business hours depending on the supposed up-time window of the host.

Host configuration example. The configuration of host in Nagios includes the address (hostname or IP) and the various checking and notification settings mentioned above. Below is an example (using the check-host-alive command above):

define host {
  host_name host1
  address 192.168.1.10
  parents router1
  check_interval 5
  check_period 24x7
  check_command check-host-alive
  max_check_attempts 4
  contact_groups admins
}

define host {
  host_name router1
  ...
}

Time period. The time period used above such as 24x7 is defined like below:

define timeperiod {
  timeperiod_name 24x7
  sunday 00:00-24:00
  monday 00:00-24:00
  tuesday 00:00-24:00
  ...
}

Object and object inheritance. Each item above such as host, contact, command is called an "object" in Nagios. To save on specifying all those options, Nagios allows you to define an object template that can be inherited by objects. For example, there is a template coming with Nagios called generic-host defined as below to set sensible values for many settings. The name is the name of the template so that you can refer to it by name and "register 0" tells Nagios not to treat it as a real host:

define host {
  name generic-host
  register 0
  max_check_attempts 10
  check_command check-host-alive
  contact_groups admins
  ...
}

To use ("inherit") the template, just do it like this:

define host {
  host_name host1
  address 192.168.1.10
  parents router1
  use generic-host
  check_period 24x7
}

Service monitoring. Just monitoring if a host is up or down isn't enough. It is needed to monitor the services such as website or email services are running properly. This is configured with a service object. A service object specifies the command used for checking (so the actual meaning of "up" is up to you to decide) and one or more hosts to provide the IP addresses. It means Nagios will perform the checking on each such host independently and pass the IP to the command via the $HOSTADDRESS$ macro. This is useful if, say, you have Apache running dozens of hosts, as you only need one service object. So, the service object is not really a single service, but just a way to apply a type of service checking across multiple hosts. Other than this, a service checking is very much similar to host checking and support similar settings such as max check attempts, check interval, contact, notification options, etc. Below is an sample service object (the check-http command will simply check if it can successfully retrieve a web page from the IP):

define service {
  service_description website
  host_name host1, host2, host3
  max_check_attempts 4
  check_interval 3
  check_period 24x7
  check_command check-http
  notification_options c,w,r
  notification_period 24x7
  contact_groups admins
  ...
}

Service state. Instead of up, down, unreachable, a service (running on a particular host) can be up, critical (obvious error such as no response or connection refused or 404 error), warning (e.g., slow response. The actual meaning is defined by the command ) or unknown (usually meaning that the check command itself has failed). The corresponding notification options are r (recover), c (critical), w (warning).
Service notification settings in a contact. Probably because the notification options for service checking are different from those for host checking, Nagios requires that in a contact object you specify the notification settings separately for host notifications and for service notifications:

define contact {
  contact_name kent
  email kent@foo.com
  host_notification_options d,r,u
  host_notification_period 24x7
  host_notification_command notify-by-email
  service_notification_options c,w,r
  service_notification_period 24x7
  service_notification_command notify-by-email
}

Passing explicit arguments to a command. Just checking if a web page can be retrieved may not be good enough. A better check is to check if the web page contains the right content (e.g., containing a particular string). To do that, you can define your own command and pass the expected string (e.g., abc) to it. This is done using an exclamation mark and it is retrieved by $ARG1$ in the command definition:

define service {
  service_description website
  host_name host1, host2, host3
  max_check_attempts 4
  check_interval 3
  check_period 24x7
  check_command check-http-string!abc
  ...
}

define command {
  command_name check-http-string
  command_line /usr/lib/nagios/plugins/check_http -H $HOSTNAME$ -I $HOSTADDRESS$ -s $ARG1$
}

Finding out the plugins. There are many commands (both the defined command objects and the actual programs) coming with Nagios. They are called plugins and many 3rd party ones can be installed. To find out what is available and how to use each one, just do it like:

# find the command object definitions by the plugins
$ ls /etc/nagios-plugins/config
  ...
# learn about the options supported by each program
$ /usr/lib/nagios/plugins/check_http --help
  ...

Inheritance for service. Just like a host object, you can use object inheritance for a service object too. Nagios comes with a service template called "generic-service" defining many defaults for you to use:

define service {
  service_description website
  use generic-service
  check_command check-http-string!abc
  ...
}

Host group. Instead of listing individual hosts in a service object, it may be a better way to define a "host group" named web-servers containing such hosts and just let the service object refer to the host group. This is particularly useful if you need to perform two kinds of service checking on the same group (e.g., checking web and SSH).

define service {
  service_description website
  hostgroup_name web-servers
  ...
}

define hostgroup {
  hostgroup_name web-servers
  members host1, host2, host3
}

Service dependency. If the DNS or LDAP is down, then many other services will also not function. So, you will receive lots of notifications, burying the real issue. If you don't want to receive such "bogus" notifications, you declare that such services are dependent on the DNS or LDAP services. This way, when an upper level service fails, Nagios will check if the lower level service also fails. Given a specific option, you can tell Nagios to not send notification for the upper level service when the lower level service is in a certain "failure" states (e.g., critical or warning). As this dependency is about a service on a host depending on another service on another host, you must specify the hosts in addition to the services:

define servicedependency {
  dependent_service_description website
  dependent_host_name host1
  service_description dns
  host_name host2
  notification_failure_criteria c,w
}

Service-host dependency. In principle, the same dependency should be specified for service and host as if a host is down, it is impossible for the service to continue to run properly. However, at present Nagios doesn't support this concept. So, if a host is down, you will still receive notifications for the services running on it.

Kent Tong's personal thoughts on information technology

Tuesday, April 16, 2013

Network access related protocols: PPP, EAP, RADIUS, 802.1x, L2TP, GRE

Tuesday, April 2, 2013

Concepts of Nagios