Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?


Explainable AI

SUMMARY Get familiar with the AI technology behind the Juniper Mist™ features.

AI Technology and Juniper Mist

Here's a quick introduction to the AI technology that powers Juniper Mist.

This series explores some of the key tools within our rich data science toolbox that powers the AI-driven enterprise. The tools are built into the Juniper Mist AI-driven platform that delivers an amazing experience to your employees, customers, and guests. As you learn more about the AI technology used by Juniper Mist, you'll see that the journey to an AI-driven network requires rich data, AI primitives, a well-stocked data science toolbox, and a virtual assistant.

All of these components are required as the network evolves to become self-driving. The data science tools vary in algorithm complexity and increasing intelligence from regression to deep learning. Mutual information is used to understand the scope of impact of an issue.

Decision trees are supervised learning used to determine network health by analyzing data, extracting feature information, and building models to predict failure or success of common networking problems. LSTM, or long short-term memory networks, are a special kind of recurrent neural network that use reasoning and previous network events to make informed decisions on current network issues. Reinforcement learning is used to realize a self-driving network that learns and optimizes wired and wireless settings for the best user experience.

To learn more about any of these tools and the data science toolbox, watch our AI Technical Whiteboard series.

Key concepts:

  • Mutual Information

  • Decision Tree

  • LSTM (Long Short-Term Memory) Networks

  • Reinforcement Learning

Natural Language Processing and the Marvis Virtual Network Assistant

Natural Language Processing (NLP) is used to help power your human language engagements with Marvis (the AI engine) when asking about network health, troubleshooting, or when taking corrective actions.

Today in the Tech Whiteboard series, we talk about NLP or Natural Language Processing and how it impacts networking and more specifically AIOps. Natural Language Processing gives machines the ability to derive meaning from human language. NLP is a combination of linguistics and AI, specifically machine learning.

Right here is where NLP lies. Let's take a look at a question you might ask Marvis, our virtual network assistant. NLP converts this question into more general meaning that our models know how to interpret to provide you with actionable insights about your network needs.

Let's take a look at what's really happening here. The first step in NLP is to clean up the text and convert the words into a form the computer can understand. First, stop words or unimportant information like and and the are removed.

The remaining text is then split into smaller units like words and phrases, a process called tokenization. Next, featurization occurs, meaning each word is transformed into a vector. Vectors numerically capture the features or information about a word in a way that the computer can understand and process.

Here's an example of vectorized words. More semantically similar words fall closer together. This is a crucial concept that allows NLP to be possible.

Vector representations of words can extend past 3D. Higher dimensional vectors can numerically capture more meaning about a word. While each word is represented by a vector, we need to come up with an encoded vector representation for the overall sentence.

Sentence encoded vectors are valuable because they allow information about the order of the words to be captured because words can have varying meanings depending on their context or position in a sentence. At this point in the process, embedding models are used. Embedding models map categorical data such as words or sentences into high dimensional vectors which capture semantic meaning about the text.

Embedding models are usually pre-trained on a large amount of data outside of your own, like Wikipedia, which harnesses the power of transfer learning or leveraging prior knowledge from one domain and task into a different domain and task. An example of a pre-trained embedding model is Word2Vec, which is trained on all the word data in Wikipedia, meaning it's able to embed extra meaning about the semantics of text into vectors because it's learned from so many examples what words can mean in certain contexts. The embeddings can now be fed into a machine learning model.

The machine learning model learns how to understand the meaning of unseen words by comparing the similarity between the input word vectors and the word vectors whose meanings are known from your training data. You can make sure that your model is able to recognize certain meanings by including them in your training data set. Words that are semantically similar will be closer in multidimensional space, which is how the model learns how to predict meaning of unseen words.

The closest vector with a known meaning in the vector space is the predicted meaning. As a result of decades of troubleshooting top-tier networks at Juniper, we've created a high-level structured set of training data born from decades of in-depth networking knowledge. We take real customer questions and annotate them to create our training data set.

Annotation includes flagging the tokens in the question as intents, intended actions like troubleshoot, count, list, or entities, information about the intent like device name or time frame. The training data questions are also made into vectors and sentence-encoded vectors with information about the annotation flags. So when unseen questions are asked, our model can predict the user's desired intents and entities based on how similar the unseen vectors are to the known vectors which have been trained.

The benefits of NLP are clear. Resolving network issues in minutes, not days, just by asking a single question as opposed to poking around the network looking for clues, and the versatility of using that same interface to perform tasks such as firmware upgrades, allows Marvis to become a virtual member of the IT operations team. In networking, we use NLP to allow our customers to interface with Marvis, pushing AIOps to the next level.

We hope this episode helped to uncover some of the magic and mystery behind our AI-driven network solutions.

Key concepts:

  • NLP

  • AIOps (AI for IT Operations)

  • Tokenization

  • Featurization

  • Sentence Encoded Vectors

  • Embedding Models

  • Transfer Learning

Mutual Information and Juniper Mist SLE Metrics

Mutual Information is used to figure out which network features are having the most impact on the failure or success of your SLE (Service Level Expectation) metrics and services.

Today we're talking about how the Juniper Mist AI-driven platform uses mutual information to help you understand which network features, such as mobile device type, client OS, or access point, have the most information for predicting failure or success in your SLE client metrics. Let's start with a definition of mutual information. Mutual information is defined in terms of the entropy between random variables.

Mathematically, the equation for mutual information is defined as the entropy of random variable X minus the conditional entropy of X given Y. Now, what does this mean? Let me give you an example. Let's say Y is one of our random variables that we want to predict and represents the SLE metric time to connect. And it can be one of two possible values, pass or fail.

Next, we have another random variable X that represents a network feature that can have a possible value of present or not present. An example of a network feature can be a device type, OS type, time interval, or even a user or an AP. Any possible feature of the network can be represented by the random variable.

Next, we'll look at what we mean by entropy. For most people, when they hear the term entropy, they think of the universe and entropy always increasing as the universe tends towards less order and more randomness or uncertainty. So, entropy represents the uncertainty of a random variable.

And the classic example is a coin toss. If I have a fair coin and I want to flip that coin, the entropy of that random variable is going to be given by the sum of the probability of Xi times the log two of the probability of X. And for that fair coin, the probability is that 50% will be heads plus 50% will be tails. And the entropy is going to be equal to one, the maximum entropy possible.

When you have maximum uncertainty, the random variable will have maximum entropy. If we take an example where we don't have a fair coin, we have some hustler out there and he's using a loaded coin. Let's say the probability of heads is 70% and the probability of tails is 30%.

Now, in this case, your maximum entropy is going to be 0.88. So, you can see that as the uncertainty goes down, your entropy will trend towards zero. If you were at zero entropy, that would mean no uncertainty and the coin flip would always be heads or tails. Now, let's go back and see how mutual information works with our SLE metrics.

Graphically, what does this equation look like? Let's say we look at how this circle here represents the entropy of my SLE metric Y. And this circle is the entropy of my feature random variable X. So, if you look at our equation, the conditional entropy of random variable Y, given the network feature X, is this area here. If I subtract the two, what we're looking for is this middle segment. This represents the mutual information of these two random variables.

And it gives you an indication of how well your network feature provides some information about your SLE metric random variable Y. If the network feature tells you everything about the SLE metric, then the mutual information is maximum. If it tells you nothing about the SLE metric, then the mutual information between X and Y is zero. Now, mutual information tells you how much information the network feature random variable Y gives you about the SLE metric time to connect, but it doesn't tell you whether the network feature is better at predicting failure or success of the SLE metric.

For that, we need something called the Pearson correlation. If you look at the picture of the correlation, it tells us a couple of things. One is the amount of correlation with a range from negative one to one. The other is the sign, negative and positive, which is a predictor of pass or fail. So now we have these two things. First is the magnitude indicating how correlated the two random variables are.

Second is the sign, which indicates failure or success. If the correlation is negative, the network feature is good at predicting failure. If it's positive, it's good at predicting pass.

If the Pearson correlation is zero, it means there is no linear correlation between the variables, but there could be mutual information between the two. But the Pearson correlation does not tell us the importance of the network feature, or if there's not enough data to make an inference between the network feature random variable and the SLE metric random variable. That's given back to our graphic of the circles.

There may be one case where I have very high entropy for both variables, but there may be another case where I have much smaller entropy on one of those variables. Both of these examples may be highly correlated with a high Pearson's value, but the entropy of mutual information will be much higher in the first case, which means this random variable has much more importance in predicting success or failure of a feature. I hope this gives a little more insight into the AI we've created at MIST.

And if you look at the MIST dashboard, the result of this process is demonstrated by our virtual assistant.

Key concepts:

  • Mutual Information
  • Pearson Correlation
  • Entropy

Reinforcement Learning and Juniper Mist Radio Resource Management

Reinforcement Learning is used to intelligently and dynamically optimize RF (Radio Frequency) in real time for the best Wi-Fi coverage, capacity, and connectivity possible. This is a far superior approach to the use of manual setttings or traditional fixed algorithms and is totally custom on a per site basis.

Radio frequency environments are inherently complex and therefore challenging to control and optimize for the efficient transmission of data. Since the inception of radio frequency, or RF, radio resource management, also known as RRM, has been a long-standing technique used to optimize the RF radio waves that transmit network traffic in wireless LANs. However, multiple interference sources like walls, buildings, and people combined with the air servings of transmission medium make RRM a challenging technique to master.

Traditionally, site surveys have been used to determine the optimal placement of Wi-Fi access points and settings for transmit power, channels, and bandwidth. However, these manual approaches can't account for the dynamic nature of the environment when the wireless network is in use, with people and devices entering or leaving and moving about. Additionally, this challenge is compounded with random RF interferences from sources like microwave ovens, radios, and aircraft radar, to name a few.

But what if the wireless network itself could perform RRM on its own? What if it could detect and respond to both interference sources, as well as the movement of people and devices, and adjust the radio settings in real time to provide the best possible wireless service? That's exactly what Juniper has done with the AI-driven MIST wireless solution, using advanced machine learning techniques. Specifically, MIST uses reinforcement learning to perform RRM. In a nutshell, a reinforcement learning machine, or agent, learns through an iterative trial and error process in an effort to achieve the correct result.

It's rewarded for actions that lead to the correct result, while receiving penalties for actions leading to an incorrect result. The machine learns by favoring actions that result in rewards. With MIST wireless, the reinforcement learning machine's value function is based on three main factors that lead to a good user experience.

Coverage, capacity, and connectivity. A value function can be thought of as an expected return based on the actions taken. The machine can execute five different actions to optimize the value function.

These are adjusting the band setting between the two wireless bands of 2.4 GHz and 5 GHz, increasing or decreasing the transmit power of the AP's radios, switching to a different channel within the band, adjusting a channel's bandwidth, and switching the BSS color, which is a new knob available to 11 AX access points. RRM will select actions with maximum future rewards for a site. Future rewards are evaluated by a value function.

The various actions taken by the learning machine, such as the increase of transmit power or switching the band from 2.4 GHz to 5 GHz, together represent a policy, which is a map the machine builds based on multiple trial and error cycles as it collects rewards, modeling actions that maximize the value function. Again, keep in mind that the value function represents good wireless user experience. As time goes on, even if random changes occur in the environment, the machine learns as it strives to maximize the value function.

The benefits of using reinforcement learning are obvious. A MIST wireless network customizes the RRM policy per site, creating a unique wireless coverage environment akin to a well-tailored suit. While large organizations with multiple sites replicate their many locations as copy exact, these sites will naturally experience variances despite best efforts.

Reinforcement learning easily fixes this, delivering real-time, actively adjusting, custom wireless environments. We hope this episode helped to uncover some of the magic and mystery behind our AI-driven network solutions.

Key concepts:

  • Reinforcement Learning

  • Value Function

  • Future Rewards

Decision Trees and Issue Detection

Decision Trees are used to identify common network issues like faulty cables, access point and switch health, and wireless coverage. This is a form of supervised learning and can be used to isolate faults.

Today we'll be looking at decision trees and how they identify common network issues like faulty network cables, AP or switch health, and wireless coverage. The algorithms used include simple, random forest, gradient boosting, and XGBoost. Our example will be a simple decision tree algorithm.

The essential steps in any machine learning project involve collecting data, training a model, then deploying that model. Decision trees are no different. Let's use a real world example.

In networking, decision trees can be applied to incomplete negotiation detection, MTU mismatch, among others. Our example today applies the fundamental problem of determining cable faults. A cable is neither 100% good nor bad, but somewhere in between.

To isolate a fault, we can create a decision tree using features of a bad cable such as frame errors and one-way traffic. We begin a decision tree by asking a question which will have the greatest impact on label separation. We use two metrics to determine that question, Gini impurity and information gain.

Gini impurity, which is similar to entropy, determines how much uncertainty there is in a node, and information gain lets us calculate how much a question reduces that uncertainty. A decision tree is based on a data set from known results. Each row is an example.

The first two columns are features that describe the label in the final column, a good or bad cable. The data set can be modified to add additional features and the structure will remain the same. A decision tree starts with a single root node which is given this full training set.

The node will then ask a yes-no question about one of the features and split into two subsets of data which is now the input of a child node. If the labels are still mixed, good and bad, then another yes-no question will be asked. The goal of a decision tree is to sort the labels until a high level of certainty is reached without overfitting a tree to the training data.

Larger trees may be more accurate and tightly fit the training data, but ones in production may be inaccurate in predicting real events. We use metrics to ask the best questions at each point until there are no more questions to ask, then prune branches starting at the leaves to address overfitting the tree. This produces an accurate model with leaves illustrating the final prediction.

Gini impurity is a metric that ranges from zero to one where lower values indicate less uncertainty or mixing at a node. It gives us our chance of being incorrect. Let's look at a mail carrier as an example.

In a town with only one person and one letter to deliver, the Gini impurity would be equal to zero since there's no chance of being incorrect. However, if there are 10 people in the town with one letter for each, the impurity would be 0.9 because now there's a 1 in 10 chance of placing the randomly selected mail into the right mailbox. Information gain helps us find the question that reduces our uncertainty the most.

It's just a number that tells us how much a question helps to unmix the labels at a node. We begin by calculating the uncertainty of our starting set. Impurity equals 0.48. Then for each question, we segment the data and calculate the uncertainty of the child nodes.

We take a weighted average of their uncertainty because we care more about a large set with low uncertainty than a small set with high uncertainty. Then we subtract that from our starting uncertainty and that's our information gain. As we continue, we'll keep track of the question with the most gain and that will be the best one to ask at this node.

Now we're ready to build the tree. We start at the root node of the tree, which is given the entire data set. Then we need the best question to ask at this node.

We find this by iterating over each of these values. We split the data and calculate the information gain for each one. As we go, we keep track of the question that produces the most gain.

Once we find the most useful question to ask, we split the data using this question. We then add a child node to this branch and it uses a subset of data after the split. Again, we calculate the information gain to discover the best question for this data.

Rinse and repeat until there are no further questions to ask and this node becomes a leaf which provides the final prediction. We hope this gives you insight into the AI we've created. If you'd like to see more from a practical perspective, visit our solutions page on

Key concepts:

  • Decision Trees

  • Random Forest

  • Gradient Boosting

  • XGBoost

  • Gini Impurity

  • Information Gain

Keep Learning

To learn more, go to this resource on What is explainable AI, or XAI?