Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

HealthBot Time Series Database (TSDB)

 

HealthBot collects a lot of data through its various ingest methods. All of that data is time sensitive in some context. This is why HealthBot uses a time-series database (TSDB) to store and manage all of the information received from the various network devices. This topic is provides an overview of the TSDB as well some guidance on managing it.

Historical Context

In HealthBot releases prior to 3.0.0, there was one TSDB instance regardless of whether you ran HealthBot as a single node or as a multi-node (Docker-compose) installation. Figure 1 shows a high-level view of what this looked like.

Figure 1: Single TSDB Instance - Prior to HealthBot Release 3.0.0
Single TSDB Instance - Prior
to HealthBot Release 3.0.0

This arrangement left no room for scaling or redundancy of the TSDB. Without redundancy, there is no high availability (HA); A failure left you with no way to continue operation or restore missing data. Adding more Docker-compose nodes to this topology would only provide more HealthBot processing capability at the expense of TSDB performance.

TSDB Improvements

To address these issues and provide TSDB high availability (HA), three new TSDB elements are introduced in HealthBot Release 3.0.0, along with clusters of HealthBot nodes* for the other HealthBot microservices:

Note

*HealthBot uses Kubernetes for clustering its docker-based microservices across multiple physical or virtual servers (nodes). Kubernetes clusters consist of a primary node and multiple worker nodes. During the healthbot setup portion of HealthBot multinode installations, the installer asks for the IP addresses (or hostnames) of the Kubernetes primary node and worker nodes. You can add as many worker nodes to your setup as you need, based on the required replication factor for the TSDB databases. The number of nodes you deploy should be at least the same as the replication factor. (See the following sections for details).

For the purposes of this discussion, we refer to the Kubernetes worker nodes as HealthBot nodes. The primary node is not considered in this discussion.

Database Sharding

Database sharding refers to selectively storing data on certain nodes. This method of writing data to selected nodes distributes the data among available TSDB nodes and permits greater scaling since each TSDB instance then handles only a portion of the time series data from the devices.

To achieve sharding, HealthBot creates one database per device group/device pair and writes the resulting database to a specific (system determined) instance of TSDB hosted on one (or more) of the Healthbot nodes.

For example, say we have two devices, D1 and D2 and two device groups, G1 and G2. If D1 resides in groups G1 and G2, and D2 resides only in group G2, then we end up with 3 databases: G1:D1, G2:D1, and G2:D2. Each database is stored on its own TSDB instance on a separate HealthBot node as shown in Figure 2 below. When a new device is on-boarded and placed within a device group, HealthBot chooses a TSDB database instance on which to store that device/device-group data.

Figure 2: Distributed TSDB
Distributed TSDB

Figure 2, above, shows 3 HealthBot nodes, each with a TSDB instance and other HealthBot services running.

Note
  • A maximum of 1 TSDB instance is allowed on any given HealthBot node. Therefore, a HealthBot node can have 0 or 1 TSDB instances at any time.

  • A HealthBot node can be dedicated to running only TSDB functions. When this is done, no other HealthBot functions run on that node. This prevents other HealthBot functions from starving the TSDB instance of resources.

  • We recommend that you dedicate nodes to TSDB to provide the best performance.

  • HealthBot and TSDB nodes can be added to a running system using the HealthBot CLI.

Database Replication

As with any other database system, replication refers to storing the data in multiple instances on multiple nodes. In HealthBot, we establish a replication factor to determine how many copies of the database are needed.

A replication factor of 1 only creates one copy of data, and therefore, provides no HA. When multiple HealthBot nodes are available and replication factor is set to 1, then only sharding is achieved.

The replication factor determines the minimum number of HealthBot nodes needed. A replication factor of 3 creates three copies of data, requires at least 3 HealthBot nodes, and provides HA. The higher the replication factor, the stronger the HA and the higher the resource requirements in terms of HealthBot nodes. If you want to scale your system further, you should add HealthBot nodes in exact multiples of the replication factor, or 3, 6, 9, etc.

Consider an example where, based on device/device-group pairing mentioned earlier, HealthBot has created 20 databases. The HealthBot system in question has a replication factor of 2 and has 4 nodes running TSDB. Based on this, two TSDB replication groups are created; in our example they are TSDB Group 1 and TSDB Group 2. In Figure 3 below, the data from databases 1-10 is being written to TSDB instances 1 and 2 in TSDB group 1. Data from databases 11-20 is written to TSDB instances 3 and 4 in TSDB group 2. The outline around the TSDB instances represents a TSDB replication group. The size of the replication group is determined by the replication factor.

Figure 3: TSDB Databases
TSDB Databases

Database Reads and Writes

As shown in Figure 2, HealthBot can make use of a distributed messaging queue. In cases of performance problems or errors within a given TSDB instance, this allows for writes to the database to be performed in a sequential manner ensuring that all data is written in proper time sequence.

All HealthBot microservices use standardized database query (read) and write functions that can be used even if the underlying database system is changed at some point in the future. This allows for flexibility in growth and future changes. Other read and write features of the database system include:

  • In normal operation, database writes are sent to all TSDB instances within a TSDB group.

  • Database writes can be buffered up to 1GB per TSDB instance so that failed writes can be retried until successful.

  • If problems persist and the buffer fills up, the oldest data is dropped in favor of new data.

  • When buffering is active, database writes are performed sequentially so that new data cannot be written until the previous write attempts are successful.

  • Database queries (reads) are sent to the TSDB instance which has reported the fewest write errors in the last 5 minutes. If all instances are performing equally, then the query is sent to a randomTSDB instance in the required group.

Manage TSDB Options in the HealthBot GUI

TSDB options can be managed from within the HealthBot GUI. To do this, navigate to the Settings > System page from the left-nav bar, then select the TSDB tab from the left side of the page.

Best Practice

Adding, deleting, or dedicating TSDB nodes should be done during a maintenance window since some services will be restarted and the HealthBot GUI will likely be unresponsive while the TSDB work is performed.

In Figure 4 below, you can see that the current TSDB nodes and replication factor are shown.

Figure 4: TSDB System Settings
TSDB System Settings

Working with TSDB Nodes

  • In Figure 4 above, the field TSDB Nodes shows the currently defined TSDB instances in our HealthBot installation.
  • The pull-down list of nodes under TSDB Nodes displays all currently available HealthBot nodes.
  • Any operation you perform affects all the TSDB instances shown in the field. In this case, hosts 10.102.70.82 and 10.102.70.200.
  • You can delete a node from your HealthBot installation by clicking on the X next to the node IP or hostname that you want to remove. When you click SAVE & DEPLOY, that TSDB node is removed from HealthBot entirely.
Note

You cannot add HealthBot or TSDB nodes from the GUI in the current release of HealthBot. Refer to Add a TSDB Node to HealthBot to see the CLI procedure.

Change Replication Factor

  • Increase or decrease the replication factor as needed.
  • Click the SAVE & DEPLOY button to commit the change.

Dedicate a Node to TSDB

  • Use the pull-down menu to select the node that you want to dedicate.
  • Ensure that it is the only one in the TSDB Nodes list.
  • Click the dedicate slider so that it activates (turns blue).
  • Click the SAVE & DEPLOY button to commit the change.

Force Option

Normally, a failure in a TSDB instance can be overcome using the buffering methods described above.

In the event of a catastrophic failure of the server or storage hosting a TSDB instance, you can rebuild the server or damaged component. However, if the replication factor is 1, then the TSDB data for that instance is lost. In that case, ou need to remove the failed TSDB node from HealthBot.

  • Select the X next to the damaged node from the TSDB Nodes field.
  • Click SAVE & DEPLOY.
  • If a problem occurs and the removal is not successful, Click the force slider so that it activates (turns blue).

    This tells the system to ignore any errors encountered while adjusting the TSDB settings.

  • Click the SAVE & DEPLOY button to commit the change.

HealthBot CLI Configuration Options

The HealthBot CLI provides a means to add and delete TSDB nodes from from the system and to change the replication factor as a result.

Add a TSDB Node to HealthBot

# set healthbot system time-series-database nodes <IP address or hostname> dedicate <true or false>

or

#set healthbot system time-series-database nodes [ space-separated list of IP addresses or hostnames] dedicate <true or false>

Manage the Replication Factor

# set healthbot system time-series-database replication-factor <replication-factor>

Set the replication factor to a multiple of the number of TSDB nodes present in the system. If you have two TSDB nodes, set the replication factor at 2, 4, 6, etc.

Usage Notes

  • HealthBot performs a ping to determine if the new node(s) is reachable. A warning is shown if the ping fails.

  • The dedicate option specifies whether or not the TSDB nodes perform only TSDB functions.

Related Documentation