Alarms

With Contrail Insights Alarms, you can configure an alarm to be generated when a condition is met in the infrastructure. Contrail Insights performs distributed analysis of metrics at the point of collection for efficient and responsive detection of events that match an alarm. Contrail Insights has two types of alarms:

Static—User-provided static threshold is used for comparison.
Dynamic—Dynamically-learned adaptive threshold is used for comparison.

Sections in this topic include:

Contrail Insights Alarms Overview

For both static and dynamic alarms, Contrail Insights Agent continuously collects measurements of metrics (see Metrics Collected by Contrail Insights) for different entities, such as hosts, instances, and network devices. Beyond simple collection, the agent also analyzes the stream of metrics at the time of collection to identify alarm rules that match. For a particular alarm, the agent aggregates the samples according to a user-specified function (average, standard deviation, min, max, sum) and produces a single measurement for each user-specified measurement interval. For a given measurement interval, the agent compares each measurement to a threshold. For an alarm with a static threshold, a measurement is compared to a fixed value using a user-specified comparison function (above, below, equal). For dynamic thresholds, a measurement is compared with a value learned by Contrail Insights over time.

You can further configure alarm parameters that require multiple intervals to match. This allows you to configure alarms to match sustained conditions, while also detecting performance over small time periods. Maximum values over a wide time range can be over-exaggerate conditions. Yet, averages can dilute the information. A balance is better achieved by measuring over small intervals and watching for repeated matches in multiple intervals. For example, to monitor CPU usage over a three-minute period, an alarm may be configured to compare average CPU utilization over fiveseconds intervals, yet only raise an alarm when 36 (or some subset of 36) intervals match the alarm condition. This provides better visibility into sustained performance conditions than a simple average or maximum over three minutes.

Dynamic thresholds enable outlier detection in resource consumption based on historical trends. Resource consumption may vary significantly at various hours of the day and days of the week. This makes it difficult to set a static threshold for a metric. For example, 70% CPU usage may be considered normal for Monday mornings between 10:00 AM and 12:00 PM, but the same amount of CPU usage may be considered abnormally high for Saturday nights between 9:00 PM and 10:00 PM.

With dynamic thresholds, Contrail Insights learns trends in metrics across all resources in scope to which an alarm applies. For example, if an alarm is configured for a host aggregate, Contrail Insights learns a baseline from metric values collected for hosts in that aggregate. Similarly, an alarm with a dynamic threshold configured for a project learns a baseline from metric values collected for instances in that project. Then, the agent generates an alarm when a measurement deviates from the baseline value learned for a particular time period.

When creating an alarm with a dynamic threshold, you select a metric, a period of time over which to establish a baseline, and the sensitivity to measurements that deviate from the baseline. The sensitivity can be configured as high, medium, or low. Higher sensitivity will report smaller deviations from the baseline and vice versa.

Contrail Insights Alarms Operation

Contrail Insights Agent performs distributed, real-time statistical analysis on a time-series data stream. Agent analyzes metrics over multiple measurement intervals using a configurable sliding window mechanism. An alarm is generated when the Contrail Insights Agent determines that metric data matches the alarm criteria over a configurable number of measurement intervals. The type of sample aggregation and the threshold for an alarm is configurable. Two types of alarms are supported: static and dynamic. The difference is how the threshold is determined and used to compare measured metric data. The following sections describe the overall sliding window analysis, and explains the details of static thresholds and dynamic baselines used by the analysis.

Sliding Window Analysis
Static Alarm
Dynamic Alarm

Sliding Window Analysis

Contrail Insights Agent evaluates alarms using sliding window analysis. The sliding window analysis compares a stream of metrics within a configurable measurement interval to a static threshold or dynamic baseline. The length of each measurement interval is configurable to one-second granularity. In each measurement interval, raw time-series data samples are combined using an aggregation function, such as average, max, and min. The aggregated value is compared against the static threshold or dynamic baseline using a configurable comparison function, such as above or below. Multiple measurement intervals comprise a sliding window. A configurable number of intervals in the sliding window must match the rule criteria for the agent to generate a notification for the alarm.

Figure 1: Alarm Generation Mechanics

Figure 1 shows an example in which the sliding window consists of six adjacent measurement intervals (i1 to i6), as specified by the Interval Count parameter. In measurement interval i1, the average of samples S1, S2, S3 is computed as S_avg. Depending on the alarm type static or dynamic, S_avg is then compared with the configured static threshold or dynamically learned baseline using a user-specified comparison function such as above or below. The output of the comparison determines whether a specific measurement interval is marked as an interval with exception. This evaluation is repeated for each measurement interval within the sliding window (for example, i1 to i6).

In the example in Figure 1, the agent determines that two intervals, i2 and i5, are intervals with exception by comparing the aggregate value for the measurement interval with a static threshold or dynamic baseline, depending on alarm type. Assuming interval i1 is the first interval for which the alarm is configured, the alarm becomes active at end of interval i6, when Contrail Insights Agent determines that at least two out of the most recent six measurement intervals are marked as exceptions. When an alarm is configured using the Dashboard, Interval Count, and Intervals with Exception are set to 1 by default. As a result, the agent can generate an alarm after processing data for one measurement interval.

Static Alarm

A static alarm threshold is provided at the time of alarm definition. Figure 2 depicts an example of a static alarm definition, followed by the equivalent JSON used for API configuration of an alarm. The condition defined in the example is to evaluate an average of host.cpu.usage samples over a 60 second measurement interval. The measured value is compared against a static threshold of 80% to determine if a given measurement interval matches the alarm rule. Figure 2 identifies the components in a static alarm definition.

Figure 2: Static Alarm Definition

The following example shows the JSON equivalent to the static alarm definition shown in Figure 2:

Dynamic Alarm

A dynamic alarm threshold is learned by Contrail Insights using historical data for the set of entities for which an alarm is configured. Figure 3 shows an example of a dynamic alarm definition, followed by the equivalent JSON used for API configuration of an alarm. Figure 3 identifies the components in a dynamic alarm definition.

Figure 3: Dynamic Alarm Definition

The following example shows the JSON equivalent to the static alarm definition shown in Figure 3:

When using a dynamic threshold, you do not configure a static threshold value. Instead, you specify three parameters that control how the learning is performed. The learning algorithm produces a baseline across the entities. The baseline is comprised of a mean value and a standard deviation. The baseline is updated continuously as additional metric data is collected.

Following is a list of the three learning parameters and information about how they work:

BaselineAnalysisAlgorithm

Selects the machine learning algorithm used for determining the dynamic threshold. The following algorithms are available:

k-means

Contrail Insights employs a k-means algorithm to produce an expected operating range for a set of entities at a granularity of each hour of each day (up to one week). The learned baselines are computed using data from a configurable learning period duration. The baselines are updated continuously over time, based on the most recent data. The k-means Baseline Analysis Algorithm is useful for observing performance that is unexpected for a given time of day.

For example, a k-means algorithm may learn a dynamic baseline for 1:00 PM - 2:00 PM that may be 80% +/- 10%, whereas, the baseline between 3:00 AM - 4:00 AM may be 20% +/- 5%. An alarm is raised if the measured metric is 75% of the value between 3:00 AM - 4:00 AM, but the same measurement is acceptable during 1:00 PM - 2:00 PM time period.

ewma

The Exponentially Weighted Moving Average (EWMA) algorithm produces a single baseline that is updated hourly. The configurable Learning Period duration allows you to control the relative weight assigned to recent data versus older data. This algorithm is useful to create an alarm that can detect sudden changes in a metric.

For example, an EWMA algorithm can learn a dynamic baseline of 60% +/- 10% from data over the last 24 hours. This baseline is used for the next 1-hour interval to determine if real-time data deviates from the normal operating region. After every 1-hour interval, the EWMA baseline is updated and a new updated baseline is used for alarm generation in the future.

LearningPeriodDuration

A dynamic baseline is determined using the historical data. This parameter determines the length of time period from which most recent historical data is used to compute a dynamic baseline. For example, 1 hour, 1 day, or 1 week. At the time of rule configuration, Contrail Insights might not yet have enough historical data for a given entity. In this case, learning is performed as data becomes available. Alarm evaluation begins after one Learning Period of data is available and baselines are generated.

Sensitivity

The sensitivity of a dynamic alarm controls the allowable magnitude of deviation from the learned mean. The sensitivity parameter controls a multiplier of the learned standard deviation. You can select low, medium, or high as sensitivity. Contrail Insights Agent compares real-time measurements to the range defined by:

mean - sensitivity * std_dev < x < mean + sensitivity * std_dev

Alarm Definition

Figure 2 shows an example of a static alarm definition and is followed by the JSON for the same rule. Every alarm definition has the following components shown in Figure 4.

Figure 4: Static Alarm Rule Configuration Example

The listed components for alarm definition are numbered and described in the following text:

1. Name

A name identifies the alarm. Name is displayed in the Dashboard and is the user-facing identifier for external notification systems.

2. Module

When Alarms is selected, you can configure alarms for entities such as hosts, instances, and network devices. When Service Alarms is selected, then you are able to configure alarms for services such as RabbitMQ, MySQL, ScaleIO, and OpenStack services.

3. Alarm Rule Type

This determines the type of threshold that alarm uses to determine if alarm should be generated or not. Following are the two types that are supported.

Static—When an alarm is defined as static, the rule definition should include a predefined static threshold. For example, cpu.usage static threshold can be 80%.
Dynamic—When an alarm is defined as dynamic, the baseline is learned using historical data. Additional parameters are required such as baseline analysis algorithm, learning period duration, and sensitivity.

4. Event Rule Scope

Type of entity such as host, instance, or network device to which the alarm applies. For example, if scope is selected as Instance, then you can further select to configure rule to all instances present in the infrastructure, or instances that are present in a specific project or an aggregate.

5. Aggregate

Select the set of entities an alarm will monitor. If Scope is Instance, then you can configure an alarm for the set of instances present in a specific project, aggregate, or all instances in the infrastructure. If Scope is Host, then you can configure an alarm for a set of hosts present in a specific aggregate or all hosts in the infrastructure.

6. Alarm Mode

Mode can be configured as an alert or event.

Alert—An alarm with the mode set to Alert has state. Events are generated and recorded only for changes in the state of the alarm. Table 1 shows all possible states for an alarm with the mode configured as alert. Figure 5 shows an example of different state transitions for an alarm for the cpu.usage metric with a static threshold of 50%.
Event—An alarm with the mode set to Event is evaluated similar to an alarm with the mode set to Alert. The key difference is that an alarm with the mode set to Event keeps generating notifications with a state of triggered for each interval in which the condition for the alarm is satisfied. When the conditions for an alarm are not satisfied, then the agent stops generating notifications about the alarm. As shown in Figure 6, an alarm with the mode set to Event generates significantly more notifications compared to an alarm with the mode set to alert.

Figure 5: Alarm State Transition with Mode as Alert for Cpu.usage Static Threshold = 50%

Table 1: States for Alarm Mode Defined as Alert
State	Description
Learning	This is the initial state of each alarm. In this state, the alarm is processing real-time data and alarm stays in this state until sufficient data has been processed to make the decision about if an alarm should be generated or not. The duration of the learning period depends on the sliding window parameters. Figure 5 shows the learning state when rule is configured in the system.
Active	The condition specified by an alarm is met. Alarm will stay in this state as long as alarm conditions are satisfied. Figure 5 shows the active state when CPU usage is detected as 76.05%.
Inactive	Condition specified by an alarm is not met. In Figure 5, after the learning state, the alarm transitions to inactive state because CPU usage was 13.5% (below the 50% threshold). The alarm transitions from active state to inactive state when CPU usage drops to 15.65%.
Disabled	Agent is not actively analyzing data for this alarm. The alarm is either deleted or temporarily deactivated by the user.

Figure 6: Alarm State Transition with Mode as Event

Table 2: States for Alarm Mode Defined as Event
State	Description
Enabled	This is the initial state of the alarm with the mode set to Event when a rule is configured. It stays in this state until conditions are met to generate an alarm. Figure 6 shows state enabled is logged when alarm with mode as event is configured.
Triggered	When conditions for alarm generation are satisfied, then an alarm is generated with a state of triggered. Alarm generation is logged at the end of each measurement interval as long conditions for alarms continue to be met. In Figure 6, seven alarm events are generated for the duration when cpu.usage stays above 50%.
Disabled	Agent is not actively analyzing data for this alarm. The alarm is either deleted or has been temporarily deactivated by the user.

7. Metric Name

Metrics Collected by Contrail Insights that will be monitored. For example, host.cpu.usage or instance.cpu.usage.

8. Aggregation Function

Determines how data samples received in one measurement interval are processed to generate an aggregated value for comparison. Agent collects multiple samples of a metric during a measurement interval. Agent combines the samples according to the aggregation function, in order to determine a single value for comparison with the threshold (static or dynamic) in a measurement interval. Table 3 lists and describes the aggregation functions for alarm processing.

Table 3: Aggregation Functions for Alarm Processing
Aggregation Function	Description
Average	Statistical average of all data samples received within one measurement interval. Example: Generate Host Alert when Cpu-Usage Average during a 60 seconds interval is Above 80% of 2 of the last 3 measurement intervals. In this example, the measurement interval is 60 seconds. An alarm is generated if the average of the CPU usage samples exceeds 80% in any 2 measurement intervals out of 3 adjacent measurement intervals.
Sum	Sum of all data samples received within one measurement interval. Example: Generate Host Alert when Cpu-Usage Sum during a 60 seconds interval is Above 250% of 2 of the last 3 measurement intervals. In this example, An alarm is generated if the CPU usage sum is above 250% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration.
Max	Maximum sample value observed within one measurement interval. Example: Generate Host Alert when Cpu-Usage Max during a 60 seconds interval is Above 95% of 2 of the last 3 measurement intervals. In this example, the alarm is generated if the maximum CPU usage is above 95% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration.
Min	Minimum sample value observed within one measurement interval. Example: Generate Host Alert when Cpu-Usage Min during a 60 seconds interval is Below 5% of 2 of the last 3 measurement intervals. In this example, the alarm is generated if the minimum CPU usage is below 5% in any 2 measurement intervals out of 3 adjacent measurement intervals, where each measurement interval is 60 seconds in duration.
Std-Dev	Standard Deviation of the time-series data is determined based on the samples received until current measurement interval. Example: Generate Host Alert when Cpu-Usage std-dev during a 60 seconds interval is Above 2 sigma of 2 of the last 3 measurement intervals. In this example, the alarm is generated when the raw time series samples are above `mean + 2*sigma` in at least 2 measurement intervals out of the last 3 measurement intervals, where each measurement interval is a duration of 60 seconds.

9. Comparison Function

Determines how to compare output of the Aggregation Function with the static or dynamic threshold. Table 4 shows different comparison functions supported for Contrail Insights alarms. Figure 7 and Figure 8 show examples of the Comparison Function, showing both increases and decreases at a minimum rate.

Figure 7: Comparison Function Showing Increasing-at-a-minimum-rate-of

Figure 8: Comparison Function Showing Decreasing-at-a-minimum-rate-of

Table 4: Comparison Functions for Alarm Processing
Comparison Operator	Description
Above	Determine if result of the aggregation function within a given measurement interval is above the threshold. Note: For dynamic threshold above, Contrail Insights compares whether the result of the aggregation function is outside of the normal operating region (mean +/- sigma*sensitivity).
Below	Determine if result of the aggregation function determined for a given measurement interval is below the threshold. Note: For dynamic threshold, below compares whether the result of aggregation function is within the normal operating region (mean +/- sigma*sensitivity).
Equal	Determine if result of the aggregation function is equal to the threshold.
Increasing-at-a-minimum-rate-of	This comparison function is useful when you are interested in tracking a sudden increase in the value of a given metric instead of its absolute value. For example, if ingress or egress network bandwidth starts increasing within short intervals then you might want to raise an alarm. Figure 7 shows sudden increase in metric average between measurement interval i1 and i2. Similarly, sudden increase is observed in metric average between measurement intervals i4 to i5. Example: Generate Host Alert when the host.network.ingress.bit_rate average during a 60 seconds interval is increasing-at-a-minimum-rate-of 25% of 2 of the last 3 measurement intervals. In the example, if the mean ingress bit rate increases by at least 25% in 2 measurement intervals out of 3, then an alarm is raised.
Decreasing-at-a-minimum-rate-of	This comparison function is useful when you are interested in tracking sudden decrease in the value of a given metric instead of its absolute value. For example, egress network bandwidth starts decreasing within short intervals then you might want to raise an alarm to investigate the root cause. Figure 8 shows sudden decrease in metric average between measurement interval i1 and i2. Similarly, sudden decrease is observed in metric average between measurement intervals i3 and i4. Example: Generate Host Alert when the host.network.egress.bit_rate average during a 60 seconds interval is decreasing-at-a-minimum-rate-of 25% of 2 of the last 3 measurement intervals. In the example, if the mean egress bit rate decreases by at least 25% in 2 measurement intervals out of 3, then an alarm is raised.

10. Threshold

A numeric value to which measurements are compared. Contrail Insights supports two types of thresholds: static or dynamic.

Static Threshold—A fixed value that is specified when an alarm is configured. For example host.cpu.usage above 90%, where 90% is the static threshold.
Dynamic Threshold—The threshold is learned dynamically by the system. Unsupervised learning is used to learn about historical trends to determine the dynamic threshold. For example, if an event rule is defined for Host aggregate, then the dynamic baseline is determined for the aggregate by applying the baseline analysis algorithm to data received from all member hosts of the aggregate. Figure 9 shows the dynamic baseline determined using the most recent 24-hour time frame of historical data and k-means clustering algorithm. This baseline is used for the next 24 hours for alarm generation while considering the hour of the day and its corresponding baseline mean and standard deviation. For example, on Tuesday 8:00 AM - 9:00 AM, a baseline computed for Monday 8:00 AM - 9:00 AM is used as a reference threshold for alarm generation.

Figure 9 shows the dynamic baseline computed by 24 hours of data and the k-means clustering algorithm. For a given hour of the day, the blue dot is the mean; the green bar is the mean + std-dev; the purple bar is mean - std-dev.

Figure 9: Dynamic Baseline Determined by Last 24 Hours of Data and K-Means Clustering Algorithm

Figure 10 shows the dynamic baseline computed by 24 hours of historical data using the EWMA algorithm. This baseline is used for the next 1 hour for alarm generation until it is updated again using the most recent 24 hours of data.

Figure 10: Dynamic Baseline Determined by Last 24 Hours of Historical Data Using EWMA

Figure 11 shows the mandatory parameters that must be specified to configure a dynamic alarm.

Figure 11: Required Parameters for the Dynamic Threshold in the Alarm Definition

Table 5 describes the required parameters for a dynamic alarm and the supported options.

Table 5: Required Parameters for Dynamic Alarm
Required Parameters for Dynamic Threshold	Description	Supported Options
Baseline Analysis Algorithm	Baseline Analysis Algorithm is used to perform unsupervised learning on historical data. The baseline analysis is performed continuously as new data is received.	K-Means clustering Exponential Weighted Mean Average (EWMA)
Learning Period Duration	The Learning Period Duration specifies the amount of historical data used by the Baseline Analysis Algorithm to determine a baseline. The dynamic baseline is continuously updated using data from the most recent Learning Duration. When a dynamic alarm is configured, baseline analysis is performed using data from the most recent Learning Duration, if available. If there is not sufficient data available, Contrail Insights Agent evaluates metrics as soon as enough data is present to learn the first set of baselines. Example: When Learning Duration is 1 day, the agent compares metrics to per-hour baselines for the last 24 hours. Example: When Learning Duration is 1 week, the agent compares metrics to per-hour baselines for the last 7 x 24 hours.	1 week—Baseline is determined for each hour of last 1 week of data. Next 1 week of baselines are determined based on data of the last week. 1 month—Baseline is determined based on last 4 weeks of data. Baselines are learned for each hour of each day of week (7 x 24 baselines). Next 1 week of baselines are determined based on data of the last 4 weeks. For example, a baseline on Monday at 2:00 PM - 3:00 PM is learned using metric data from the last 4 Mondays at 2:00 PM - 3:00 PM.
Sensitivity	The dynamic baseline provides a normal operating region of a given metric for a given scope. As seen in Figure 9, the dynamic baseline is a tuple which has mean and std-dev applicable for a specific hour of the day. The sensitivity factor determines what is the allowable band of operation. Measurements outside of the band of operation cause an interval with exception. For example, if the baseline mean is 20 and std-dev is 2, then normal operating region is between 18 and 22. When sensitivity is low then normal operating region is treated as 10 (mean - 5std-dev) and 30 (mean + 5std-dev). In this case, if the measured average of a metric is between 10 and 30, then no alarm is raised. In contrast, if the average is 5 or 35, then an alarm is raised.	Low—Any data point beyond `5 * std-dev` from the baseline mean is outlier. Medium—Any data point beyond `3 * std-dev` from baseline mean is outlier. High—Any data point beyond `2 * std-dev` from baseline mean is outlier.

11. Alarm Severity	Indicates seriousness of the alarm. Critical indicates a major alarm. Information indicates a minor alarm.
12. Notification	Methods of notification alerting you to conditions of operation.
13. Interval Duration	The duration of one measurement interval in seconds. Depending on the sampling frequency of a metric under observation, one or more raw samples might be received within an interval duration. All raw samples received within Interval duration are processed using aggregation functions such as average, sum, max, min, and std-dev.
14. Intervals with Exception	This is the minimum number of measurement intervals within the sliding window for which a condition for an alarm must be met to raise the alarm. In Figure 3, there are two Intervals with Exception: i2 and i5. When configuring an alarm in the Dashboard, Intervals with Exception is set to 1 by default. The Interval with Exception can be specified in the Dashboard by selecting Alarms > Add New Rule. Then select Advanced to view the Advanced settings. Intervals with Exception can not be greater than the Interval Count.
15. Interval Count	Maximum number of adjacent measurement intervals for which a statistical analysis is performed before deciding if an alarm is generated or not. In Figure 3, there are 6 measurement Intervals (i1 to i6) in the sliding window. Each measurement interval has duration specified by the Interval Duration parameter. When configuring an alarm in Dashboard, Interval Count is set to 1 by default. The Interval Count can be specified in the Dashboard by selecting Alarms > Add New Rule. Then select Advanced to view the Advanced settings.
16. Status	Used to set and also verify status of alarm rule. Set status as enabled or disabled.

ON THIS PAGE