Service Checks
  • 18 Oct 2022
  • 9 Minutes to read
  • Dark
    Light
  • PDF

Service Checks

  • Dark
    Light
  • PDF

Description

A service check is an application-level monitoring check added to a managed device in Netreo that monitors the status of an individual process or resource (application, interface, etc.). The statuses of service checks are displayed in a variety of locations in Netreo, but by far the most common is in a Tactical Overview dashboard widget.

The Tactical Overview dashboard widget showing aggregated statuses of different checks for different device groups. The SERVICES column for each row shows the total number of service checks in each state for that group. The HOSTS column reflects the statuses of the host availability service checks for the devices in that group.

Each service check runs repeatedly according to a configurable schedule (default timing is 3 minutes between executions). Netreo staggers the execution of service checks to prevent large numbers of checks from all running simultaneously, which could produce prohibitive amounts of traffic on a network.

When run, a service check queries its target for a response code. If the target returns a failure code (or no code) the check displays as failed in Netreo dashboards and (typically) generates an alarm. Additionally, a service check can be configured to run commands on its target when the check fails (such as reboot or restart-service commands). This provides the possibility of resolving some issues automatically, without the need to involve personnel.

For issues that cannot be resolved automatically, users have the ability to acknowledge failing service checks to indicate that they are currently working on the problem. Service checks that have been acknowledged by a user are visually indicated in dashboards.

By default, Netreo automatically adds several service checks to every managed device (through the "Default" and "Windows Default" device templates) to provide basic monitoring services. However, there are many more service checks that may be added to suit your specific monitoring needs.

Service checks are categorized into the following basic types.

  • Cloud Checks - For monitoring cloud-based resources.
  • Firewall Checks - For monitoring firewall resources.
  • Generic Passive Checks - For monitoring resources via an indirect source.
  • HP Insight Manager Agent Checks - For monitoring certain hardware systems.
  • Interface Checks - For monitoring interfaces.
  • Network Application Checks - For monitoring network application resources.
  • Network Connectivity Checks - For monitoring various types of connectivity.
  • Netreo Checks - For monitoring elements of Netreo itself.
  • System Checks - For monitoring core processes.
  • Web Checks - For monitoring web-based resources.

Details

(See Service Check Management  for information about creating and managing service checks on hosts directly.)

(See Device Template Management for information about creating and managing service checks on hosts using device templates- the preferred method.)

Service Check States

Service checks always display one of the following states when viewed in dashboard widgets.

StateDescription
OK(Green) The check query has returned a success code.
WARNING(Yellow) Very rare. The check query has returned a warning code. For all operations involving service checks, warning codes are treated as failure codes.
CRITICAL(Red) The check query has returned a failure code or no code.
ACKNOWLEDGED(Blue) Indicates a service check in a CRITICAL or WARNING state that has been acknowledged by a user. This state is technically an incident state, not a check state, but its display in the dashboards helps to distinguish between problems that are new and problems that are already being addressed.
UNKNOWN(Orange) The check query has returned a value that the check cannot understand. This is likely due to a configuration error in the check.

The Tactical Overview dashboard widget is useful for displaying service check statuses for groups of devices.

Active and Passive Service Checks

Service checks are either active or passive. This is discussed in more detail below, but briefly:

  • Active service checks create their own processes in memory while they do their work and follow their own timing schedule for their query.
  • Passive service checks wait for some other process to update them (usually an active service check or some other active process). This means that passive service checks update according to the schedule of whatever process is updating them.

Whether a given service check is active or passive can be seen on the Service Check Administration page (Administration > Change Devices > Manage Service Checks from the main menu).

Active Service Checks

Active service checks create their own process in memory, and actively query their specific target process or resource for a response. The response code returned to the service check determines the state of that check.

If the response is a success code, the check remains in the OK state and continues to run its query according to its configured schedule.

If the response is a failure code, the service check enters what is called a soft CRITICAL state. While in this soft state, the service check retries its query several times (typically at a faster frequency). When it reaches a set number of failure responses (default is 3), the check then enters what is called a hard CRITICAL state and generates an alarm (new alarms open an incident in Netreo). Warning response codes follow this same pattern for WARNING states. (There is no visual distinction between hard and soft states in the dashboard indicators, but the Services tab of the Device Dashboard for the failing device shows the history of soft and hard states for the check - including its current state.)

A service check in a hard state continues to retry its query at the (typically increased) configured frequency. If, at any time, it again receives a success code it immediately recovers to the OK state, clears its alarm, and signals any opened incident that it has recovered.

Service checks are designed to retry their query several times before generating an alarm in order to prevent them from immediately alerting users at the slightest temporary glitch. The retry schedule and the number of failures required to generate an alarm is adjustable in the configuration options of each service check.

Passive Service Checks

Passive service checks do not create their own process in memory. They do nothing until they receive a response code from the active process that updates them. This means that they always remain in their current state (whatever that state may be) until the active process updates them with a success or failure code.

If a passive service check is updated with a success code, the check remains in the OK state and waits for the next update.

If a passive service check is updated with a failure code, it increments its exception counter and enters the soft CRITICAL state (similar to active service checks). When a set number of exceptions occurs (default is 3), the check enters the hard CRITICAL state (again, much like active service checks) and generates an alarm (new alarms open an incident in Netreo). Warning response codes follow this same pattern for WARNING states. (There is no visual distinction between hard and soft states in the dashboard indicators, but the Services tab of the Device Dashboard for the failing device shows the history of soft and hard states for the check - including its current state.)

If the check is then updated with a success code, it immediately recovers to the OK state, clears it exception counter to zero, clears its alarm, and signals any opened incident that it has recovered. (Note: A passive service check can become stuck in its current state if it is never updated.)

Like active service checks, passive service checks are designed to require a number of exceptions before generating an alarm in order to prevent them from immediately alerting users at the slightest temporary glitch. The number of exceptions required to generate an alarm is adjustable in the configuration options of each passive check.

Service Check Alarms

It is important to remember that although a service check may be showing as failed in a dashboard, an alarm is not generated (and thus an incident is not opened) until the check reaches a hard failed state (as explained in Active and Passive Service Checks above).

A newly generated alarm always attempts to open a new incident in Netreo - although this may be prevented by the check's host checking logic (see below) or Netreo's incident management system for housekeeping purposes, such as if an incident already exists for the current issue.

Host Availability Check

In order for Netreo to monitor the network availability of any given managed device, a service check must be added to it specifically for that purpose.

The "Default" device template automatically adds a "Ping this host" network connectivity service check (named "PING") to all managed devices for the purpose of availability monitoring. Any time you see a reference to a "host availability check," it is referring to the specific service check added to a given resource to monitor its availability on the network.

Certain monitored resources, however, may not respond to the "Ping this host" service check and require you to add a different service check to monitor that resource's availability (such as "Check TCP Port," used when ICMP requests are not allowed). This can be easily accomplished by using a customized device template applied to the particular device type of the resource. (Note: The service check you add to your custom device template to act as the host availability check must be given the name "PING" in its description field so that it overrides the "Ping this host" check provided in the "Default" device template. See Device Templates for more information.)

Non-"pingable" resources
If the "Ping this host" service check won't work for a particular managed device, you will need to add a service check of an appropriate type for that resource to act as the host availability check (for example, a "Check TCP Port" service check). This should preferably be done using a device template.

Whatever service check is ultimately used for a given resource as the host availability check, that is the service check that is executed during a "host check" (see Host Check below).

Host Check

(Note: A host check is not a type of service check. It refers to an internal system behavior related to service checks that is used to assist Netreo's incident management system.)

The term "host check" refers to the unscheduled execution of a normally scheduled host availability service check (see Host Availability Check above) already assigned to a particular resource for the purpose of immediately determining if that host is up or down. (Netreo is simply checking a host for its availability status at that moment, thus the term "host check.")

The host checking behavior is used by Netreo's incident management system for the purposes of alert-noise reduction and root-cause analysis, and relies heavily on proper device parenting.

A "host check" is triggered automatically when any service check enters a hard CRITICAL or WARNING state and has generated an alarm. The hosts to be checked include the owner of the failed service check and, if that host is down, all of that host's immediate parents in the network hierarchy.

If the host owning the failed service check is determined to be "up," that information is provided to incident management.

If the host owning the failed service check is determined to be "down," then that information is provided to incident management and the host availability service checks of any immediate parents of that host are also checked.

This process continues up the network hierarchy until Netreo discovers a parent who is up or a host that does not have any parents configured. The downed host that is the child of the available host is the root cause of the problem. That information (and the list of downed hosts) is then provided to incident management, which uses it to suppress unnecessary alert notifications.

Due to its nature, the host checking behavior is triggered by failing service checks only. No other types of monitoring check provokes the host checking behavior.

See Incident Management for more information on how host checking is used by the incident management system.


Was this article helpful?

What's Next