Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

NorthStar Controller Troubleshooting Guide

 

This document includes strategies for identifying whether an apparent problem stems from the NorthStar Controller or from the router, and provides troubleshooting techniques for those problems that are identified as stemming from the NorthStar Controller.

Before you begin any troubleshooting investigation, confirm that all system processes are up and running. A sample list of processes is shown below. Your actual list of processes could be different.

[root@node-1 ~]# supervisorctl status

Restart any processes that display as STOPPED instead of RUNNING.

Note

To stop, start, or restart all processes, use the service northstar stop, service northstar start, and service northstar restart commands.

To access system process status information from the NorthStar Controller Web UI, navigate to More Options>Administration and select System Health.

The current CPU %, memory usage, virtual memory usage, and other statistics for each system process are displayed. Figure 1 shows an example.

Note

Only processes that are running are included in this display.

Figure 1: Process Status Display
Process Status Display

Table 1 describes each field displayed in the Process Status table.

Table 1: Descriptions of Process Status Fields

FieldDescription

Process

The name of the NorthStar Controller process.

PID

The Process ID number.

User

The NorthStar Controller user permissions required to access information about this process.

Group

NorthStar Controller user group permissions required to access information about this process.

CPU%

Displays current percentage of CPU currently in use by this process.

Memory

Displays current percentage of memory currently in use by this process.

Virtual Memory

Displays current Virtual memory in use by this process.

CPU Time

The amount of time the CPU was used for processing instructions for the process

CMD

Displays the specific command options for the system process.

The troubleshooting information is presented in the following sections:

NorthStar Controller Log Files

Throughout your troubleshooting efforts, it can be helpful to view various NorthStar Controller log files. To access log files:

  1. Log in to the NorthStar Controller Web UI.
  2. Navigate to More Options > Administration and select Logs.

    A list of NorthStar system log and message files is displayed, a truncated example of which is shown in Figure 2.

    Figure 2: Sample of System Log and Message Files
    Sample of System Log and
Message Files
  3. Click the log file or message file that you want to view.

    The log file contents are displayed in a pop-up window.

  4. To open the file in a separate browser window or tab, click View Raw Log in the pop-up window.
  5. To close the pop-up window and return to the list of log and message files, click X in the upper right corner of the pop-up window.

Table 2 lists the NorthStar Controller log files most commonly used to identify and troubleshoot issues with the PCS and PCE.

Table 2: Top NorthStar Controller Troubleshooting Log Files

Log File

Description

Location

pcep_server.log

Log entries related to the PCEP server. The PCEP server maintains the PCEP session. The log contains information about communication between the PCC and the PCE in both directions.

To configure verbose PCEP server logging:

  1. From the NorthStar Controller CLI, run pcep_cli.
  2. Type set log-level all.
  3. Press CTRL-C to exit.

/var/log/jnc

pcs.log

Log entries related to the PCS. The PCS is responsible for path computation. This log includes events received by the PCS from the Toposerver, including provisioning orders. It also contains notification of communication errors and issues that prevent the PCS from starting up properly.

/opt/northstar/logs

toposerver.log

Log entries related to the topology server. The topology server is responsible for maintaining the topology. These logs contain the record of the events between the PCS and the Toposerver, the Toposerver and NTAD, and the Toposerver and the PCE server

/opt/northstar/logs

Table 3 lists additional log files that can also be helpful for troubleshooting. All of the log files in Table 3 are located under the /opt/northstar/logs directory.

Table 3: Additional Log Files for Troubleshooting NorthStar Controller

Log FilesDescription

cassandra.msg

Log events related to the cassandra database.

ha_agent.msg

HA coordinator log.

mlAdaptor.log

Interface to transport controller log.

net_setup.log

Configuration script log.

nodejs.msg

Log events related to nodejs.

pcep_server.log

Log files related to communication between the PCC and the PCE in both directions.

pcs.log

Log files related to the PCS, which includes any event received by PCS from Toposerver and any event from Toposerver to PCS including provisioning orders. This log also contains any communication errors as well as any issues that prevent the PCS from starting up properly.

rest_api.log

Logs files of REST API requests.

toposerver.log

Log files related to the topology server.

Contains the record of the events between the PCS and topology server, the topology server and NTAD, and the topology server and the PCE server

Note: Any message forwarded to the pcshandler.log file is also forwarded to the pcs.log file.

To see logs related to the Junos VM, you must establish a telnet session to the router. The default IP address for the Junos VM is 172.16.16.2. The Junos VM is responsible for maintaining the necessary BGP, ISIS, or OSPF sessions.

Empty Topology

Figure 3 illustrates the flow of information from the router to the Toposerver that results in the topology display in the NorthStar Controller UI. When the topology display is empty, it is likely this flow has been interrupted. Finding out where the flow was interrupted can guide your problem resolution process.

Figure 3: Topology Information Flow
Topology Information Flow

The topology originates at the routers. For NorthStar Controller to receive the topology, there must be a BGP-LS, ISIS, or OSPF session from one of the routers in the network to the Junos VM. There must also be an established Network Topology Abstractor Daemon (NTAD) session between the Junos VM and the Toposerver.

To check these connections:

  1. Using the NorthStar Controller CLI, verify that the NTAD connection between the Toposerver and the Junos VM was successfully established as shown in this example:
    [root@northstar ~]# netstat -na | grep :450
    Note

    Port 450 is the port used for Junos VM to Toposerver connections.

    In the following example, the NTAD connection has not been established:

    [root@northstar ~]# netstat -na | grep :450
  2. Log in to the Junos VM to confirm whether NTAD is configured to enable topology export. The grep command below gives you the IP address of the Junos VM.
    [root@northstar ~]# grep "ntad_host" /opt/northstar/data/northstar.cfg
    [root@northstar ~]# telnet 172.16.16.2
    login: northstar
    Password:
    northstar@northstar_junosvm> show configuration protocols | display set

    If the topology-export statement is missing, the Junos VM cannot export data to the Toposerver.

  3. Use Junos OS show commands to confirm whether the BGP, ISIS, or OSPF relationship between the Junos VM and the router is ACTIVE. If the session is not ACTIVE, the topology information cannot be sent to the Junos VM.
  4. On the Junos VM, verify whether the lsdist.0 routing table has any entries:
    northstar@northstar_junosvm> show route table lsdist.0 terse | match lsdist.0

    If you see only zeros in the lsdist.0 routing table, there is no topology that can be sent. Review the NorthStar Controller Getting Started Guide sections on configuring topology acquisition.

  5. Ensure that there is at least one link in the lsdist.0 routing table. The Toposerver can only generate an initial topology if it receives at least one NTAD link event. A network that consists of a single node with no IGP adjacency with other nodes (as is possible in a lab environment, for example), will not enable the Toposerver to generate a topology. Figure 4 illustrates the Toposerver’s logic process for creating the initial topology.
    Figure 4: Logic Process for Initial Topology Creation
    Logic Process for Initial
Topology Creation

    If an initial topology cannot be created for this reason, the toposerver.log generates an entry similar to the following example:

NTAD Version

If you see that SR LSPs have not been provisioned and the pcs.log shows messages similar to this example:

It might be that the NTAD version is incorrect. See Installing the NorthStar Controller for information on NTAD versions.

Incorrect Topology

One important function of the Toposerver is to correlate the unidirectional link (interface) information from the routers into bidirectional links by matching source and destination IPv4 Link_Identifiers from NTAD link events. When the topology displayed in the NorthStar UI does not appear to be correct, it can be helpful to understand how the Toposerver handles the generation and maintenance of the bidirectional links.

Generation and maintenance of bidirectional links is a complex process, but here are some key points:

  • For the two nodes constituting each bidirectional link, the Node ID that was assigned first (and therefore has the lower Node ID number) is given the Node A designation, and the other node is given the Node Z designation.

    Note

    The Node ID is assigned when the Toposerver first receives the Node event from NTAD.

  • Whenever a Node ID is cleared and reassigned (such as during a Toposerver restart or network model reset), the Node IDs and therefore, the A and Z designations, can change.

  • The Toposerver receives a Link Update message when a link in the network is added or modified.

  • The Toposerver receives a Link Withdraw message when a link is removed from the network.

  • The Link Update and Link Withdraw messages affect the operational status of the nodes.

  • The node operational status, together with the protocol (IGP versus IGP plus MPLS) determine whether a link can be used to route LSPs. For a link to be used to route LSPs, it must have both an operational status of UP and the MPLS protocol active.

Missing LSPs

When your topology is displaying correctly, but you have missing LSPs, take a look at the flow of information from the PCC to the Toposerver that results in tunnels being added to the NorthStar Controller UI, as illustrated in Figure 5. The flow begins with the configuration at the PCC, from which an LSP Update message is passed to the PCEP server by way of a PCEP session and then to the Toposerver by way of an Advanced Message Queuing Protocol (AMQP) connection.

Figure 5: LSP Information Flow
LSP Information Flow

To check these connections:

  1. Look at the toposerver.log. The log prints a message every 15 seconds when it detects that its connection with the PCEP server has been lost or was never successfully established. Note that in the following example, the connection between the Toposerver and the PCEP server is marked as down.
  2. Using the NorthStar Controller CLI, verify that the PCEP session between the PCC and the PCEP server was successfully established as shown in this example:
    [root@northstar ~]# netstat -na | grep :4189
    Note

    Port 4189 is the port used for PCC to PCEP server connections.

    Knowing that the session has been established is useful, but it does not necessarily mean that any data was transferred.

  3. Verify whether the PCEP server learned about any LSPs from the PCC.
    [root@user-PCS ~]# pcep_cli
    # show lsp all list

    In the far right column of the output, you see the number of LSPs that were learned. If this number is 0, no LSP information was sent to the PCEP server. In that case, check the configuration on the PCC side, as described in the NorthStar Controller Getting Started Guide.

PCC That is Not PCEP-Enabled

The Toposerver associates the PCEP sessions with the nodes in the topology from the TED in order to make a node PCEP-enabled. This Toposerver function is hindered if the IP address used by the PCC to establish the PCEP session was not the one automatically learned by the Toposerver from the TED. For example, if a PCEP session is established using the management IP address, the Toposerver will not receive that IP address from the TED.

When the PCC successfully establishes a PCEP session, it sends a PCC_SYNC_COMPLETE message to the Toposerver. This message indicates to NorthStar that synchronization is complete. The following is a sample of the corresponding toposerver log entries, showing both the PCC_SYNC_COMPLETE message and the PCEP IP address that NorthStar might or might not recognize:

Some options for correcting the problem of an unrecognized IP address are:

  • Manually input the unrecognized IP address in the device profile in the NorthStar Web UI by navigating to More Options > Administration > Device Profile.

  • Ensure there is at least one LSP originating on the router, which will allow Toposerver to associate the PCEP session with the node in the TED database.

Once the IP address problem is resolved, and the Toposerver is able to successfully associate the PCEP session with the node in the topology, it adds the PCEP IP address to the node attributes as can be seen in the PCS log:

LSP Stuck in PENDING or PCC_PENDING State

Once nodes are correctly established as PCEP-enabled, you could start provisioning LSPs. It is possible for the LSP controller status to indicate PENDING or PCC_PENDING as seen in the Tunnels tab of the Web UI network information table (Controller Status column). This section explains how to interpret those statuses.

When an LSP is being provisioned, the PCS server computes a path that satisfies all the requirements for the LSP, and then sends a provisioning order to the PCEP server. Log messages similar to the following example appear in the PCS log while this process is taking place:

The LSP controller status is PENDING at this point, meaning that the provisioning order has been sent to the PCEP server, but an acknowledgement has not yet been received. If an LSP is stuck at PENDING, it suggests that the problem lies with the PCEP server. You can log into the PCEP server and configure verbose log messages which can provide additional information of possible troubleshooting value:

pcep_cli
set log-level all

There are also a variety of show commands on the PCEP server that can display useful information. Just as with Junos OS syntax, you can enter show ? to see the show command options.

If the PCEP server successfully receives the provisioning order, it performs two actions:

  • It forwards the order to the PCC.

  • It sends an acknowledgement back to the PCS.

The PCEP server log would show an entry similar to the following example:

The LSP controller status changes to PCC_PENDING, indicating that the PCEP server received the provisioning order and forwarded it on to the PCC, but the PCC has not yet responded. If an LSP is stuck at PCC_PENDING, it suggests that the problem lies with the PCC.

If the PCC receives the provisioning order successfully, it sends a response to the PCEP server, which in turn, forwards the response to the PCS. When the PCS receives this response, it clears the LSP controller status completely, indicating that the LSP is fully provisioned and is not waiting for action from the PCEP server or PCC. The operational status (Op Status column) then becomes the indicator for the condition of the tunnel.

The PCS log would show an entry similar to the following example:

LSP That is Not Active

If an LSP provisioning order is successfully sent and acknowledged, and the controller status is cleared, it is still possible that the LSP is not up and running. If the operational status of the LSP is DOWN, the PCC cannot signal the LSP. This section explores some of the possible reasons for the LSP operational status to be DOWN.

Utilization is a key concept related to LSPs that are stuck in DOWN. There are two types of utilization, and they can be different from each other at any specific time:

  • Live utilization—This type is used by the routers in the network to signal an LSP path. This type of utilization is learned from the TED by way of NTAD. You might see PCS log entries such as those in the following example. In particular, note the reservable bandwidth (reservable_bw) entries that advertise the RSVP utilization on the link:

  • Planned utilization—This type is used within NorthStar Controller for path computation. This utilization is learned from PCEP when the router advertises the LSP and communicates to NorthStar the LSP bandwidth and the path the LSP is to use. You might see PCS log entries such as those in the following example. In particular, note the bandwidth (bw) and record route object (RRO) entries that advertise the RSVP utilization on the link:

It is possible for the two utilizations to be different enough from each other that it causes interference with successful computation or signalling of the path. For example, if the planned utilization is higher than the live utilization, a path computation issue could arise in which the PCS cannot compute the path because it thinks there is no room for it. But because the planned utilization is higher than the actual live utilization, there may very well be room.

It’s also possible for the planned utilization to be lower than the live utilization. In that case, the PCC does not signal the path because it thinks there is no room for it.

To view utilization in the Web UI topology map, navigate to Options in the left pane of the Topology view. If you select RSVP Live Utilization, the topology map reflects the live utilization that comes from the routers. If you select RSVP Utilization, the topology map reflects the planned utilization which is computed by the NorthStar Controller based on planned properties.

A better troubleshooting tool in the Web UI is the Network Model Audit widget in the Dashboard view. The Link RSVP Utilization line item reflects whether there are any mismatches between the live and the planned utilizations. If there are, you can try executing Sync Network Model from the Web UI by navigating to Administration > System Settings, and then clicking Advanced Settings in the upper right corner of the resulting window.

Note

The upper right corner button toggles between General Settings and Advanced Settings.

PCS Out of Sync with Toposerver

If the PCS becomes out of sync with Toposerver such that they do not agree on the state of LSPs, you must deactivate and reactivate the PCEP protocol in order to restore synchronization. Perform the following steps on the NorthStar server.

Caution

Be aware that following this procedure:

  • Kills the PCEP sessions for all PCCs, not just the one with which there is a problem.

  • Results in the loss of all user data which then needs to be repopulated.

  • Has an impact on a production system due to the resynchronization.

  1. Stop the PCE server and wait 10 seconds to allow the PCC to remove all lingering LSPs.
    supervisorctl stop northstar:pceserver
  2. Restart the PCE server.
    supervisorctl start northstar:pceserver
  3. Restart Toposerver.
    supervisorctl restart northstar:toposerver
    Note

    An alternative way to restart Toposerver is to perform a Reset Network Model from the NorthStar Controller web UI (Administration > System Settings, Advanced). See the Disappearing Changes section for more information about the Sync Network Model and Reset Network Model operations.

Disappearing Changes

Two options are available in the Web UI for synchronizing the topology with the live network. These options are only available to the system administrator, and can be accessed by first navigating to Administration > System Settings, and then clicking Advanced Settings in the upper right corner of the resulting window.

Note

The upper right corner button toggles between General Settings and Advanced Settings.

Figure 6 shows the two options that are displayed.

Figure 6: Synchronization Operations
Synchronization Operations

It is important to be aware that if you execute Reset Network Model in the Web UI, you will lose changes that you’ve made to the database. In a multi-user environment, one user might reset the network model without the knowledge of the other users. When a reset is requested, the request goes from the PCS server to the Toposerver, and the PCS log reflects:

The Toposerver log then reflects that database elements are being removed:

The Toposerver then requests a synchronization with both the Junos VM to retrieve the topology nodes and links, and with the PCEP server to retrieve the LSPs. In this way, the Toposerver relearns the topology, but any user updates are missing. Figure 7 illustrates the flow from the topology reset request to the request for synchronization with the Junos VM and the PCEP Server.

Figure 7: Reset Model Request
Reset Model Request

Upon receipt of the synchronization requests, Junos VM and the PCEP server return topology updates that reflect the current live network. The PCS log shows this information being added to the database:

Figure 8 illustrates the return of topology updates from the Junos VM and the PCEP Server to the Toposerver and the PCS.

Figure 8: Model Updates Using Reset Network Model
Model Updates Using Reset
Network Model

You should use the Reset Network Model when you want to start over from scratch with your topology, but if you don’t want to lose user planning data when synchronizing with the live network, execute the Sync Network Model operation instead. With this operation, the PCS still requests a topology synchronization, but the Toposerver does not delete the existing elements. Figure 9 illustrates the flow from the PCS to the Junos VM and PCEP server, and the updates coming back to the Toposerver.

Figure 9: Synchronization Request and Model Updates Using Sync Network Model
Synchronization Request
and Model Updates Using Sync Network Model

Investigating Client Side Issues

If you are looking for the source of a problem, and you cannot find it on the server side of the system, there is a debugging flag that can help you find it on the client side. The flag enables detailed messages on the web browser console about what has been exchanged between the server and the client. For example, you might notice that an update is not reflected in the Web UI. Using these detailed messages, you can identify possible miscommunication between the server and the client such as the server not actually sending the update, for example.

To enable this debug flag, modify the URL you use to launch the Web UI as follows:

Note

If you are already in the Web UI, it is not necessary to log out; simply add ?debug=true to the URL and press Enter. The UI reloads.

Figure 10 shows an example of the web browser console with detailed debugging messages.

Figure 10: Web Browser Console with Debugging Messages
Web Browser Console with
Debugging Messages

Accessing the console varies by browser. Figure 11 shows an example: accessing the console on Google Chrome.

Figure 11: Accessing the Google Chrome Console
Accessing the Google Chrome
Console

Incomplete Results of the Bandwidth Sizing Scheduled Task

If execution of the bandwidth sizing scheduled task does not result in publishing statistics for all the bandwidth sizing-enabled LSPs, check to see if the traffic statistics are being collected for all the bandwidth sizing-enabled LSPs for the scheduled duration. If traffic statistics are not available, the bandwidth statistics for those LSPs cannot be resized.

You can use the NorthStar Collector web UI to determine whether traffic statistics are being collected:

  1. Open the Tunnel tab in the network information table.
  2. Select the LSPs that have not been resized.
  3. Right-click and select View LSP Traffic.
  4. Click custom in the upper left corner, provide the schedule duration, and click Submit.

Troubleshooting NorthStar Integration with HealthBot

If update device to HealthBot is failing in NorthStar, first check to see if there are errors in the NorthStar web application server logs:

The HealthBot API server logs might also provide helpful information if update device to HealthBot is failing:

To determine if RPM probe data and LDP demands statistics collection is working, access the IAgent container logs. IAgent is used for RPM data (link latency) and LDP demands statistics collection.

To determine if JTI LSP and interface statistics data collection is working, access the fluentd container logs. Native GBP is used for JTI data collection.

To determine if statistics data is being notified from the HealthBot server to the PCS, access the PCS logs to see live statistics notification information:

Collecting NorthStar Controller Debug Files

If you are unable to resolve a problem with the NorthStar Controller, we recommend that you forward the debug files generated by the NorthStar Controller debugging utility to JTAC for evaluation. Currently all debug files are located in subdirectories under the u/wandl/tmp directory.

To collect debug files, log in to the NorthStar Controller CLI, and execute the command u/wandl/bin/system-diagnostic.sh filename.

The output is generated and is available from the /tmp directory in the filename.tbz2 debug file.