Senior Design Team sdmay21-36 • Self-healing in Networks Using ORBIT Testbed

Current cellular networks will portion out a large amount of their budget to spend on repairing and resolving network problems and outages. The current method involves relying on human expertise to identify, diagnose, and resolve any issues with the network. This process has proven to not only be very costly, but also presents significant challenges in regards to the speed at which resolutions can be made. With 5G networks rolling out, the complexity and overall cell density of these networks will prove to be too much to handle with the current processes. To be able to keep next generation networks running, a new solution for resolving network issues must be created.

Our system can utilize an algorithm to detect a full or partial outage, diagnose the cause, and compensate for the outage. Using an algorithm to solve these issues will cause minimal user downtime during an outage while keeping the cost to maintain the network much low.

The chosen testing tool to simulate a wireless network is a radio testbed called ORBIT. The ORBIT testbed we are using is sandbox9. This testbed has 7 nodes that we are able to configure. Each node in the testbed is connected to every other node but can have its interfaces configured so that a simulated network can be constructed. The network setup that was used for testing consisted of one central controller node, three server nodes, and three client nodes.

There are two main components of our project. The network topology and the self-healing algorithm.

Network Topology:

Our designed network consists of 3 clients, 3 servers, and one central controller. The clients represent everyday network users, the servers represent network base stations, and the central controller represents a monitoring service that can assess network state. We assume that the central controller is somewhere away from the local nodes, and therefore can be conflated with the broader network. While this topology doesn’t exactly match the topology of a real-world wireless network, we believe that it is a model that can be used to easily test our self-healing algorithms. This topology represents the following real-world network:

In this network, there are users with three nearby base stations. Our algorithm attempts to connect the users back to the internet in case of failure in an efficient manner. Although our topology is wired, we believe that it represents the wireless model above.

Self-Healing:

To simulate network traffic we use UDP sockets to send traffic from the clients to the broader network and vice-versa. These UDP sockets are configured on each port of the server so we know which client is connected to the broader network. Our algorithm has three parts: Detection, Diagnosis, and Compensation. The UDP connection is constantly being monitored by the central controller so the network state can be assessed. Once a client or server misses a check in, we begin the self-healing process. First, the servers are polled and the current bandwidth capabilities of the available servers is calculated. If there is available bandwidth, the clients are redistributed to the nearest network node, if not the clients are assigned to the server with the largest available bandwidth and the clients are throttled. If there are no available servers, then we have a total network failure and there is no self-healing possible

To test the functionality of our project there are two main scenarios that we use. The first scenario shown in figure below represents a failure of the connection between a server and the internet. In this case we will remove the specified route to create an error in our system using a linux command from the central controller The response of the system is to route traffic from server 2 to server 3 and to the internet. This represents a downstream connection failure of the server to the broader network.

A second scenario shown in figure below that we use is to remove the functionality of an entire node or nodes. This is done by either removing all connections to the node or simply by turning the node off. Like in the previous test the central controller will reroute any traffic through available nodes. In the below picture it can be seen that server nodes 2 and 6 are non functioning and traffic from all client nodes are routed through server 3. The scenario that this test is representing is a complete failure of a server or base station. This can be a result of power loss, antenna damage etc.