Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

HealthBot Concepts

 

HealthBot is a highly programmable telemetry-based analytics application. With it, you can diagnose and root cause network issues, detect network anomalies, predict potential network issues, and create real-time remedies for any issues that come up.

To accomplish this, network devices and HealthBot have to be configured to send and receive large amounts of data, respectively. Device configuration is covered throughout this and other sections of the guide.

Configuring HealthBot, or any application, to read and react to incoming telemetry data requires a language that describes several elements that are specific to the systems and data under analysis. This type of language is called a Domain Specific Language (DSL), i.e., a language that is specific to one domain. Any DSL is built to help answer questions. For HealthBot, these questions are:

  • Q: What components make up the systems that are sending data?

    A: Network devices are made up of memory, cpu, interfaces, protocols and so on. In HealthBot, these are called HealthBot Topics.

  • Q: How do we gather, filter, process, and analyze all of this incoming telemetry data?

    A: HealthBot uses HealthBot Rules - Basics that consist of information blocks called sensors, fields, variables, triggers, and more.

  • Q: How do we determine what to look for?

    A: It depends on the problem you want to solve or the question you want to answer. Healthbot uses HealthBot Playbooks to create collections of specific rules and apply them to specific groups of devices in order accomplish specific goals. For example, part of the system-kpis-playbook can alert a user when system memory usage crosses a user-defined threshold.

This section covers these key concepts and more, which you need to understand before using HealthBot.

HealthBot Data Collection Methods

In order to provide visibility into the state of your network devices, HealthBot first needs to collect their telemetry data and other status information. It does this using sensors.

HealthBot supports sensors that “push” data from the device to HealthBot and sensors that require HealthBot to “pull” data from the device using periodic polling.

Data Collection - ’Push’ Model

As the number of objects in the network, and the metrics they generate, have grown, gathering operational statistics for monitoring the health of a network has become an ever-increasing challenge. Traditional ’pull’ data-gathering models, like SNMP and the CLI, require additional processing to periodically poll the network element, and can directly limit scaling.

The ’push’ model overcomes these limits by delivering data asynchronously, which eliminates polling. With this model, the HealthBot server can make a single request to a network device to stream periodic updates. As a result, the ’push’ model is highly scalable and can support the monitoring of thousands of objects in a network. Junos devices support this model in the form of the Junos Telemetry Interface (JTI).

HealthBot currently supports four ‘push’ ingest types.

  • Native GPB

  • NetFlow

  • OpenConfig

  • Syslog

These push-model data collection—or ingest—methods are explained in detail in the HealthBot Data Ingest Guide.

Data Collection - ’Pull’ Model

While the ’push’ model is the preferred approach for its efficiency and scalability, there are still cases where the ’pull’ data collection model is appropriate. With the ’pull’ model, HealthBot requests data from network devices at periodic intervals.

HealthBot currently supports two ‘pull’ ingest types.

  • iAgent (CLI/NETCONF)

  • SNMP

These pull-model data collection—or ingest—methods are explained in detail in the HealthBot Data Ingest Guide.

HealthBot Topics

Network devices are made up of a number of components and systems from CPUs and memory to interfaces and protocol stacks and more. In HealthBot, a topic is the construct used to address those different device components. The Topic block is used to create name spaces that define what needs to be modeled. Each Topic block is made up of one or more Rule blocks which, in turn, consist of the Field blocks, Function blocks, Trigger blocks, etc. See HealthBot Rules - Deep Dive for details. Each rule created in HealthBot must be part of a topic. Juniper has curated a number of these system components into a list of Topics such as:

  • chassis

  • class-of-service

  • external

  • firewall

  • interfaces

  • kernel

  • linecard

  • logical-systems

  • protocol

  • routing-options

  • security

  • service

  • system

You can create sub-topics underneath any of the Juniper topic names by appending .<sub-topic> to the topic name. For example, kernel.tcpip or system.cpu.

Any pre-defined rules provided by Juniper fit within one of the Juniper topics with the exception of external, The external topic is reserved for user-created rules. In the HealtBot web GUI, when you create a new rule, the Topics field is automatically populated with the external topic name.

HealthBot Rules - Basics

HealthBot’s primary function is collecting and reacting to telemetry data from network devices. Defining how to collect the data, and how to react to it, is the role of a rule.

HealthBot ships with a set of default rules, which can be seen on the Configuration > Rules page of the HealthBot GUI, as well as in GitHub in the healthbot-rules repository. You can also create your own rules.

The structure of a HealthBot rule looks like this:

To keep rules organized, HealthBot organizes them into topics. Topics can be very general, like system, or they can be more granular, like protocol.bgp. Each topic contains one or more rules.

As described above, a rule contains all the details and instructions to define how to collect and handle the data. Each rule contains the following required elements:

  • The sensor defines the parameters for collecting the data. This typically includes which data collection method to use (as discussed above in HealthBot Data Collection Methods), some guidance on which data to ingest, and how often to push or pull the data. In any given rule, a sensor can be defined directly within the rule or it can be referenced from another rule.

    • Example: Using the SNMP sensor, poll the network device every 60 seconds to collect all the device data in the Juniper SNMP MIB table jnxOperatingTable.

  • The sensor typically ingests a large set of data, so fields provide a way to filter or manipulate that data, allowing you to identify and isolate the specific pieces of information you care about. Fields can also act as placeholder values, like a static threshold value, to help the system perform data analysis.

    • Example: Extract, isolate, and store the jnxOperating15MinLoadAvg (CPU 15-minute average utilization) value from the SNMP table specified above in the sensor.

  • Triggers periodically bring together the fields with other elements to compare data and determine current device status. A trigger includes one or more ’when-then’ statements, which include the parameters that define how device status is visualized on the health pages.

    • Example: Every 90 seconds, check the CPU 15min average utilization value, and if it goes above a defined threshold, set the device’s status to red on the device health page and display a message showing the current value.

The rule can also contain the following optional elements:

  • Vectors allow you to leverage existing elements to avoid the need to repeatedly configure the same elements across multiple rules.

    • Examples: A rule with a configured sensor, plus a vector to a second sensor from another rule; a rule with no sensors, and vectors to fields from other rules

  • Variables can be used to provide additional supporting parameters needed by the required elements above.

    • Examples: The string “ge-0/0/0”, used within a field collecting status for all interfaces, to filter the data down to just the one interface; an integer, such as “80”, referenced in a field to use as a static threshold value

  • Functions allow you to provide instructions (in the form of a Python script) on how to further interact with data, and how to react to certain events.

    • Examples: A rule that monitors input and output packet counts, using a function to compare the count values; a rule that monitors system storage, invoking a function to cleanup temp and log files if storage utilization goes above a defined threshold

Note

Rules, on their own, don’t actually do anything. To make use of rules you need to add them to HealthBot Playbooks.

HealthBot Rules - Deep Dive

A rule is a package of components, or blocks, needed to extract specific information from the network or from a Junos device. Rules conform to a specifically tailored domain specific language (DSL) for analytics applications. The DSL is designed to allow rules to capture:

  • The minimum set of input data that the rule needs to be able to operate

  • The minimum set of telemetry sensors that need to be configured on the device(s)

  • The fields of interest from the configured sensors

  • The reporting or polling frequency

  • The set of triggers that operate on the collected data

  • The conditions or evaluations needed for triggers to kick in

  • The actions or notifications that need to be performed when a trigger kicks in

The details around rules, topics and playbooks are presented in the following sections.

Rules

Rules are meant to be free of any hard coding. Think of threshold values; If a threshold is hard coded, there is no easy way to customize it for a different customer or device that has different requirements. Therefore, rules are defined using parameterization to set the default values. This allows the parameters to be left at default or be customized by the operator at the time of deployment. Customization can be done at the device group or individual device level while applying the HealthBot Playbooks in which the individual rules are contained.

Rules that are device-centric are called device rules. Device components such as chassis, system, linecards, and interfaces are all addressed as HealthBot Topics in the rule definition. Generally, device rules make use of sensors on the devices.

Rules that span multiple devices are called network rules. Network rules:

  • must have a rule-frequency configured

  • must not contain sensors

  • cannot be mixed with device rules in a playbook

To deploy either type of rule, include the rule in a playbook and then apply the playbook to a device group or network group.

Note

HealthBot comes with a set of pre-defined rules.

Not all of the blocks that make up a rule are required for every rule. Whether or not a specific block is required in a rule definition depends on what sort of information you are trying to get to. Additionally, some rule components are not valid for network rules. Table 1 lists the components of a rule and provides a brief description of each one.

Table 1: Rule Components

Block

What it Does

Required in Device Rules?

Valid for Network Rules?

Sensors

The Sensors block is like the access method for getting at the data. There are multiple types of sensors available in HealthBot: OpenConfig, Native GPB, iAgent, SNMP, and syslog.

It defines what sensors need to be active on the device in order to get to the data fields on which the triggers eventually operate. Sensor names are referenced by the Fields.

OpenConfig and iAgent sensors require that a frequency be set for push interval or polling interval respectively. SNMP sensors also require you to set a frequency.

No–Rules can be created that only use a field reference from another rule or a vector with references from another rule. In these cases, rule-frequency must be explicitly defined.

No

Fields

The source for the Fields block can be a pointer to a sensor, a reference to a field defined in another rule, a constant, or a formula. The field can be a string, integer or floating point. The default field type is string.

Yes-Fields contain the data on which the triggers operate. Starting in HealthBot release 3.1.0, regular fields and key-fields can be added to rules based on conditional tagging profiles. See the Tagging section below.

Yes

Vectors

The Vectors block allows handling of lists, creating sets, and comparing elements amongst different sets. A vector is used to hold multiple values from one or more fields.

No

Yes

Variables

The Variables block allows you to pass values into rules. Invariant rule definitions are achieved through mustache-style templating like {{<placeholder-variable> }}. The placeholder-variable value is set in the rule by default or can be user-defined at deployment time.

No

No

Functions

The Functions block allows you to extend fields, triggers, and actions by creating prototype methods in external files written in languages like python. The functions block includes details on the file path, method to be accessed, and any arguments, including argument description and whether it is mandatory.

No

No

Triggers

The Triggers block operates on fields and are defined by one or more Terms. When the conditions of a Term are met, then the action defined in the Term is taken.

By default, triggers are evaluated every 10 seconds, unless explicitly configured for a different frequency.

By default, all triggers defined in a rule are evaluated in parallel.

Yes–Triggers enable rules to take action.

Yes

Rule Properties

The Rule Properties block allows you to specify metadata for a HealthBot rule, such as hardware dependencies, software dependencies, and version history.

No

Yes

Sensors

When defining a sensor, you must specify information such as sensor name, sensor type and data collection frequency. As mentioned in Table 1, sensors can be one of the following:

  • OpenConfigFor information on OpenConfig JTI sensors, see the Junos Telemetry Interface User Guide.
  • Native GPBFor information on Native GPB JTI sensors, see the Junos Telemetry Interface User Guide.
  • iAgentThe iAgent sensors use NETCONF and YAML-based PyEZ tables and views to fetch the necessary data. Both structured (XML) and unstructured (VTY commands and CLI output) data are supported. For information on Junos PyEZ, see the Junos PyEz Documentation.
  • SNMPSimple Network Management Protocol.
  • syslogsystem log
  • BYOIBring your own ingest – Allows you to define your own ingest types.
  • FlowNetFlow traffic flow analysis protocol
  • sFlowsFlow packet sampling protocol

When different rules have the same sensor defined, only one subscription is made per sensor. A key, consisting of sensor-path for OpenConfig and Native GPB sensors, and the tuple of file and table for iAgent sensors is used to identify the associated rule.

When multiple sensors with the same sensor-path key have different frequencies defined, the lowest frequency is chosen for the sensor subscription.

Fields

There are four types of field sources, as listed in Table 1. Table 2 describes the four field ingest types in more detail.

Table 2: Field Ingest Type Details

Field Type

Details

Sensor

Subscribing to a sensor typically provides access to multiple columns of data. For instance, subscribing to the OpenConfig interface sensor provides access to a bunch of information including counter related information such as:

/interfaces/counters/tx-bytes,

/interfaces/counters/rx-bytes,

/interfaces/counters/tx-packets,

/interfaces/counters/rx-packets,

/interfaces/counters/oper-state, etc.

Given the rather long names of paths in OpenConfig sensors, the Sensor definition within Fields allows for aliasing, and filtering. For single-sensor rules, the required set of Sensors for the Fields table are programmatically auto-imported from the raw table based on the triggers defined in the rule.

Reference

Triggers can only operate on Fields defined within that rule. In some cases, a Field might need to reference another Field or Trigger output defined in another Rule. This is achieved by referencing the other field or trigger and applying additional filters. The referenced field or trigger is treated as a stream notification to the referencing field. References aren’t supported within the same rule.

References can also take a time-range option which picks the value, if available, from the time-range provided. Field references must always be unambiguous, so proper attention must be given to filtering the result to get just one value. If a reference receives multiple data points, or values, only the latest one is used. For example, if you are referencing a the values contained in a field over the last 3 minutes, you might end up with 6 values in that field over that time-range. HealthBot only uses the latest value in a situation like this.

Constant

A field defined as a constant is a fixed value which cannot be altered during the course of execution. HealthBot Constant types can be strings, integers, and doubles.

Formula

Raw sensor fields are the starting point for defining triggers. However, Triggers often work on derived fields defined through formulas by applying mathematical transformations.

Formulas can be pre-defined or user-defined (UDF). Pre-defined formulas include: Min, Max, Mean, Sum, Count, Rate of Change, Elapsed Time, Standard Deviation, Microburst, Dynamic Threshold, Anomaly Detection, Outlier Detection, and Predict.

Some pre-defined formulas can operate on time ranges in order to work with historical data. If a time range is not specified, then the formula works on current data, specified as now.

Vectors

Vectors are useful in helping to gather multiple elements into a single rule. For example, using a vector you could gather all of the interface error fields. The syntax for Vector is:

$field-n can be field of type reference.

The fields used in defining vectors can be direct references to fields defined in other rules:

This syntax allows for optional filtering through the <field-name>=<field-value> portion of the construct. Vectors can also take a time-range option that picks the values from the time-range provided. When multiple values are returned over the given time-range, they are all selected as an array.

The following pre-defined formulas are supported on vectors:

  • unique @vector1–Returns the unique set of elements from vector1

  • @vector1 and @vector2–Returns the intersection of unique elements in vector1 and vector2.

  • @vector1 or @vector2–Returns the total set of unique elements in the two vectors.

  • @vector1 unless @vector2–Returns the unique set of elements in vector-1, but not in vector-2

Variables

Variables are defined during rule creation on the Variables page. This part of variable definition creates the default value that gets used if no specific value is set in the device group or on the device during deployment. For example, the check-interface-status rule has one variable called interface_name. The value set on the Variables page is a regular expression (regex), .*, that means all interfaces.

If applied as-is, the check-interface-status rule would provide interface status information about all the interfaces on all of the devices in the device group. While applying a playbook that contains this rule, you could override the default value at the device group or device level. This allows you flexibility when applying rules. The order of precedence is device value overrides device group value and device group value overrides the default value set in the rule.

Best Practice

It is highly recommended to supply default values for variables defined in device rules. All Juniper-supplied rules follow this recommendation. Default values must not be set for variables defined in network rules.

Functions

Functions are defined during rule creation on the Functions tab. Defining a function here allows it to be used in Formulas associated with Fields and in the When and Then sections of Triggers. Functions used in the when clause of a trigger are known as user-defined functions. These must return true or false. Functions used in the then clause of a trigger are known as user-defined actions.

Triggers

Triggers play a pivotal role in HealthBot rule definitions. They are the part of the rule that determines if and when any action is taken based on changes in available sensor data. Triggers are constructed in a when-this, then-that manner. As mentioned earlier, trigger actions are based on Terms. A Term is built with when clauses that watch for updates in field values and then clauses that initiate some action based on what changed. Multiple Terms can be created within a single trigger.

Evaluation of the when clauses in the Terms starts at the top of the list of terms and proceeds to the bottom. If a term is evaluated and no match is made, then the next term is evaluated. By default, evaluation proceeds in this manner until either a match is made or the bottom of the list is reached without a match.

Pre-defined operators that can be used in the when clause include:

Note

For evaluated equations, the left-hand side and right-hand side of the equation are shortened to LHS and RHS, respectively in this document.

  • greater-than–Used for checking if one value is greater than another.

    • Returns: True or False

    • Syntax: greater-than <LHS> <RHS> [time-range <range>]

    • Example: //Memory > 3000 MB in the last 5 minutes

      when greater-than $memory 3000 time-range 5m;

  • greater-than-or-equal-to–Same as greater-than but checks for greater than or equal to (>=)

  • less-than

    • Returns: True or False

    • Syntax: less-than <LHS> <RHS> [time-range <range>]

    • Example: //Memory < 6000 MB in the last 5 minutes

      when less-than $memory 6000 time-range 5m;

  • less-than-or-equal-to–Same as less-than but checks for less than or equal to (<=)

  • equal-to–Used for checking that one value is equal to another value.

    • Returns: True or False

    • Syntax: equal-to <LHS> <RHS> [time-range <range>]

    • Example: //Queue’s buffer utilization % == 0 

      when equal-to $buffer-utilization 0;

  • not-equal-to–Same as equal-to but checks for negative condition (!=)

  • exists–Used to check if some value exists without caring about the value itself. Meaning that some value should have been sent from the device.

    • Returns: True or False

    • Syntax: exists <$var> [time-range <range>]

    • Example: //Has the device configuration changed? 

      when exists $netconf-data-change

  • matches-with (for strings & regex)–Used to check for matches on strings using Python regex operations. See Python Regular Expressions for details.

    Note

    LHS, or left hand side, is the string in which we are searching; RHS, or right hand side, is the match expression. Regular expressions can only be used in RHS.

    • Returns: True or False

    • Syntax: matches-with <LHS> <RHS> [time-range <range>]

    • Example: //Checks that ospf-neighbor-state has been UP for the past 10 minutes

      when matches-with $ospf-neighbor-state “^UP$” time-range 10m;

  • does-not-match-with (for strings & regex)–Same as matches-with but checks for negative condition

  • range–Checks whether a value, X, falls within a given range such as minimum and maximum (min <= X <= max)

    • Returns: True or False

    • Syntax: range <$var> min <minimum value> max <maximum value> [time-range <range>]

    • Example: //Checks whether memory usage has been between 3000 MB and 6000 MB in the last 5 minutes

      when range $mem min 3000 max 6000 time-range 5m;

  • increasing-at-least-by-value–Used to check whether values are increasing by at least the minimum acceptable rate compared to the previous value. An optional parameter that defines the minimum acceptable rate of increase can be provided. The minimum acceptable rate of increase defaults to 1 if not specified.

    • Returns: True or False

    • Syntax:

      increasing-at-least-by-value <$var> [increment <minimum value of increase between successive points>]

      increasing-at-least-by-value <$var> [increment <minimum value of increase between successive points>] time-range <range>

    • Example: Checks that the ospf-tx-hello has been increasing steadily over the past 5 minutes.

      when increasing-at-least-by-value $ospf-tx-hello increment 10 time-range 5m;

  • increasing-at-most-by-value–Used to check whether values are increasing by no more than the maximum acceptable rate compared to the previous value. An optional parameter that defines the maximum acceptable rate of increase can be provided. The maximum acceptable rate of increase defaults to 1 if not specified.

    • Returns: True or False

    • Syntax:

      increasing-at-most-by-value <$var> [increment <maximum value of increase between successive points>]

      increasing-at-most-by-value <$var> [increment <maximum value of increase between successive points>] time-range <range>

    • Example: Checks that the error rate has not increased by more than 5 in the past 5 minutes.

      when increasing-at-most-by-value $error-count increment 5 time-range 5m;

  • increasing-at-least-by-rate–Used for checking that rate of increase between successive values is at least given rate. Mandatory parameters include the value and time-unit, which together signify the minimum acceptable rate of increase.

    • Returns: True or False

    • Syntax:

      This syntax compares current value against previous value ensuring that it increases at least by value rate.

      increasing-at-least-by-rate <$var> value <minimum value of increase between successive points> per <second|minute|hour|day|week|month|year> [time-range <range>]

      This syntax compares current value against previous value ensuring that it increases at least by percentage rate

      increasing-at-least-by-rate <$var> percentage <percentage> per <second|minute|hour|day|week|month|year> [time-range <range>]

    • Example: Checks that the ospf-tx-hello has been increasing strictly over the past five minutes.

      when increasing-at-least-by-rate $ospf-tx-hello value 1 per second time-range 5m;

  • increasing-at-most-by-rate–Similar to increasing-at-least-by-rate, except that this checks for decreasing rates.

Using these operators in the when clause, creates a function known as a user-defined condition. These functions should always return true or false.

If evaluation of a term results in a match, then the action specified in the Then clause is taken. By default, processing of terms stops at this point. You can alter this flow by enabling the Evaluate next term button at the bottom of the Then clause. This causes HealthBot to continue term processing to create more complex decision-making capabilities like when-this and this, then that.

The following is a list of pre-defined actions available for use in the Then section:

  • next

  • status

Tagging

Starting with Release 3.1.0, HealthBot supports tagging. Tagging allows you to insert fields, values, and keys into a HealthBot rule when certain conditions are met. See Healthbot Tagging for details.

Rule Properties

The Rule Properties block allows you to specify metadata for a HealthBot rule, such as hardware dependencies, software dependencies, and version history. This data can be used for informational purposes or to verify whether or not a device is compatible with a HealthBot rule.

HealthBot Playbooks

In order to fully understand any given problem or situation on a network, it is often necessary to look at a number of different system components, topics, or key performance indicators (KPIs). HealthBot operates on playbooks, which are collections of rules for addressing a specific use case. Playbooks are the HealthBot element that gets applied, or run, on your device groups or network groups.

HealthBot comes with a set of pre-defined Playbooks. For example, the system-KPI playbook monitors the health of system parameters such as system-cpu-load-average, storage, system-memory, process-memory, etc. It then notifies the operator or takes corrective action in case any of the KPIs cross pre-set thresholds. Following is a list of Juniper-supplied Playbooks.

  • bgp-session-stats

  • route-summary-playbook

  • lldp-playbook

  • interface-kpis-playbook

  • system-kpis-playbook

  • linecard-kpis-playbook

  • chassis-kpis-playbook

You can create a playbook and include any rules want in it. You apply these playbooks to device groups. By default, all rules contained in a Playbook are applied to all of the devices in the device group. There is currently no way to change this behavior.

If your playbook definition includes network rules, then the playbook becomes a network playbook and can only be applied to network groups.

Related Documentation

Release History Table
Release
Description
Starting in HealthBot release 3.1.0, regular fields and key-fields can be added to rules based on conditional tagging profiles.
Starting with Release 3.1.0, HealthBot supports tagging.