Scaling, Troubleshooting, and Monitoring Considerations

This chapter discusses various aspects of monitoring route server scale, Junos BGP components, and client-specific BGP sessions. Monitoring a route server is not unlike normal BGP session management, which is covered so commonly in other publications that we assume the reader is well-versed in it.

Unlike those normal BGP speakers, Junos route servers have a couple of special considerations

Configuration database size
rpd memory utilization for route copies between routing instances

Monitoring the Configuration Database Size

To support large configurations, for example say a thousand route server clients resulting in more than two million lines of output or more, the default configuration database size needs to be extended and compression enabled. The following configuration stanza enables extended configuration database size but requires a Junos reboot:

Initial Junos Configuration Database

The Junos configuration database can be monitored with the show system configuration database usage command, here showing the size with a basic configuration:

Now let’s show 1,000 route-server clients, each with their own routing instance, import policy, instance-export policy, and instance-import policy:

You can see the significant database usage.

Monitoring Route Table Size

The next sample output shows summary statistics about the entries in the routing table (show route summary command) and the memory usage breakdown (show task memory detail command) for the rpd. The two commands provide a comprehensive picture of the memory utilization of the routing protocol process.

The show route summary command shows the number of routes in the various routing tables for each route server client. Within each routing table, all of the active, holddown, and hidden destinations and routes are summarized. Routes are in the holddown state prior to being declared inactive, and hidden routes are not used because of routing policy. Note that routes in the holddown and hidden states are still using memory because they appear in the routing table:

Monitoring RPD Memory Utilization

The show task memory detail command lists the data structures within the tasks run by rpd. Tasks are enabled depending on the router’s configuration. The Alloc Bytes field indicates the highest amount of memory used by the data structure. The maximum allocated blocks and bytes are high water marks for a data structure. The example below looks to be output from a very healthy route server as very little memory is being allocated:

Monitoring Client EBGP Sessions

Individual route server client EBGP sessions can be viewed either as a summarized list, or specifically, using show bgp summary on the sample topology:

Monitoring Route Distribution

The following show command views the total routes present in the route server client C3’s RIB, along with the RIBs where they are imported:

To see specific routes in C3’s RIB that will be exported to C1, based on the IXP global policy, view the source RIB and filter by using the target community:

A slightly different view, or rather a validation of the previous command, is to look at the RIB contents of all client RIBs from the perspective of what routes have been received from a specific route server client. In the next example, 192.0.2.3 is the BGP peer associated with the routing-instance C3:

Specific prefixes can also be searched for to aid in troubleshooting:

Routes may also be searched by community value or name. The search results in retrieving all the clients’ RIBs that have a match, so route propagation between client RIBs can be tracked:

Monitoring Tools: HealthBot

HealthBot is a highly automated and programmable device-level diagnostics and network analytics tool that provides consistent and coherent operational intelligence across network deployments. Integrated with multiple data collection methods (such as Junos Telemetry Interface, NETCONF, and SNMP), HealthBot aggregates and correlates large volumes of time-sensitive telemetry data, providing a multidimensional and predictive view of the network. Additionally, HealthBot translates troubleshooting, maintenance, and real-time analytics into an intuitive user experience to give network operators actionable insights into the health of an individual device and the overall network.

HealthBot BGP KPIs, located at https://github.com/Juniper/healthbot-rules/tree/master/juniper_official/Protocols/Bgp, contain readily consumable HealthBot playbooks and rules, which are specific to BGP neighbor key performance indicators (KPIs).

HealthBot monitoring dashboard displaying system and network health. Left panel shows color-coded status tiles for metrics like system.memory and protocol.bgp. Right panel lists metrics with status messages such as Routing Engine memory buffer utilization is normal.

BGP KPI rules collect the statistics from network devices then analyzes the data and acts. A BGP KPI playbook is set of rules, each rule is defined with a set of KPIs. Playbooks contain BGP session state, neighbor flap detection, received routes with static threshold, and received routes with dynamic threshold rules.

Rules are defined with default variable values that can be changed when deploying playbooks.

HealthBot RIB KPIs, located at https://github.com/Juniper/healthbot-rules/tree/master/juniper_official/Protocols/Rib, contain readily consumable HealthBot playbooks and rules that are specific to RIB route summary KPIs. RIB route summary KPI rules collect the statistics from network devices then analyze the data and act appropriately. The RIB route summary KPI playbook is set of rules, each rule is defined with set of KPIs. Playbooks contain route table summaries for ascertaining routes and protocol route summary rules with dynamic thresholds. Rules are defined with default variable values that can be changed while deploying the playbook.

HealthBot Systems KPIs, located at https://github.com/Juniper/healthbot-rules/tree/master/juniper_official/System, contain readily consumable HealthBot playbooks and rules that are specific to system KPIs. System KPI rules collect the statistics from network devices, then analyze the data and act. The system KPI playbook is a set of rules, where each rule is defined with a set of KPIs. Playbooks contain routing engine CPU, routing engine memory, Junos processes CPU, memory leak detection, and system storage rules.

Rules are defined with default variable values, which can be changed while deploying the playbook.

ON THIS PAGE