What is an NRE?

What is an NRE?

A network reliability engineer (NRE) is an IT operations role that applies an engineering approach to measuring and automating the reliability of the network to align with service-level objectives, agreements, and goals of the IT organization and business. The NRE’s practice is called network reliability engineering.

What does a Network Reliability Engineer do?

The proliferation of network automation technology is opening the eyes of network operators, however the focus on tooling has mostly led to incremental automated workflows in networks. The NRE’s focus on engineering is entirely opposite to the traditional approach. It combines the tasks of a software engineer (building, testing, deploying, and operating) with those of a site reliability engineer (SRE) (implementing DevOps). NREs implement DevNetOps principles and behaviors to build a network pipeline.

While some of an NRE’s work involves operations tasks, such as performing upgrades, audits, change requests, and handling incidents, their main focus is:

  • Building and deploying the network on a DevNetOps pipeline
  • Automating the handling of the network’s dynamics
  • Integrating systems
  • Automating workflows
  • Eliminating toil
  • Automating troubleshooting with proactive testing
  • Engineering reliability through automated response
  • Aligning error budgets and service-level objectives

Network Reliability Engineering Behaviors

Behavior

Description

Codify

Beginning with the acquisition of network software and hardware systems (commonly called day 0), an NRE codifies the network software artifacts, secrets, and configuration into source-code repositories similar to a software developer.

Automate

Using a DevNetOps pipeline, an NRE automates the integration of testing and reproducible, versioned deployments. Beyond the first deployment and update, an NRE also uses this pipeline to engineer in-production reliability, scale, efficiency optimizations, dynamic provisioning of networking resources for its consumers, and systems integration.

Test

Through automation, staging, stress testing, and chaos engineering, an NRE ensures that the deliveries are reliable enough to meet service-level objectives and agreements.

Monitor

An NRE monitors service-level indicators, both manually and automatically with analytics that trigger automatic response and alerts for anomalous and statistically meaningful events. Logs and telemetry are collected and analyzed to derive efficiency insights, plan capacity needs, and automate capacity on elastic cloud network infrastructure.

Measure

Finally, the NRE culture values truth and transparency, and uses indicators to measure their effectiveness in meeting reliability goals, such as MTBF and MTTR.

Benefits of Network Reliability Engineering

Reliability is the foremost value of NRE! While speed of technology advancement and speed of business are important economies, they are useless without a reliable foundation. Because DevNetOps principles value evolution and speed through small incremental changes, speed and the agility of an evolutionary architecture are often welcome by-products.

NREs gain a thorough understanding of how the network degrades and breaks under pressure, which provides opportunities to automate and document incident response. This encourages a proactive approach towards preventing production outages.

On an individual resource level, NREs report lower deployment anxiety and higher job satisfaction.

Overall, NREs establish simplicity in operations and management. In network operations, there are many variables to control, secure, and audit, resulting in massive complexity. NREs solve this complexity with a well-codified source of truth and automatic response leading toward a self-driving network.

What is the Relationship between NRE, SDN, and NFV?

You can apply network reliability engineering equally to networking hardware and software systems. Applying NRE to purely software-defined networking (SDN) (either Network Functions Virtualization (NFV) or SDN on the cloud) is easier to simulate and test—no network hardware lab or virtual lab is required. SDN control of hardware is also easier for NREs to implement because SDN systems automate and abstract the control and configuration of entire network architectures in their given domain, thus allowing NREs to avoid ‘reinventing the wheel’ of SDN systems that may exist.

Because no bug-free systems exist, an NRE’s job is never done. Similar to the SRE and DevOps culture, the NRE and DevNetOps culture values an allowance for failure resulting in quick fixes and lessons learned. Continuous improvement, or kaizen (a Japanese word for “good change”) isn’t about being in balance, it’s about recovering balance. And one cannot recover if one isn’t allowed to fail in the first place. 1Research across many fields shows that this approach leads to better outcomes. As such, NREs aim for evolution, not perfection. In splitting time between engineering and operating, NREs are well apprised of failures, record lessons, and continually incorporate improvement into tooling and automated processes.