Predicting Outages in Radio Networks with Alarm Data
04 June 2018
Modern cellular networks are complex systems offering a wide range of services and present challenges in detecting anomalous events when they do occur. The networks are engineered for high reliability and, hence, the data from these networks is predominantly normal with a small proportion being anomalous. From an operations perspective, it is important to detect these anomalies in a timely manner in order to mitigate them and preclude the occurrence of major failure events. In this context statistical and Machine Learning techniques play an important role in predicting as well as prescribing solution to upcoming events. Our goal in this paper is to anticipate upcoming outages that have vast impact on the network service textit{days} before they occur. We propose a statistically rigorous methodology for anomaly detection in alarm generation frequencies. Our approach is based on the assumptions that an anomalous increase in frequency serves as a precursor to a failure. This novel methodology can be applied to any network element that generates alarms, as well as to other systems beyond networks. The anomaly detection tool (ADT) builds a statistical model for each elements alarm-type generation process by using a probabilistic Poisson process. The Poisson model allows to profile alarm generation rates during normal and healthy periods. Deviations in real-time from the learned rates are flagged as anomalies, from which those that are for an increased alarm frequency are of interest. ADT ranks the propensity of each element to go into outage by an anomaly scoring system which allows to score network elements with highest propensity to fail. We focus our tool's application on a test case of alarms generated by Base Station Controllers (BSCs) in a 2G network. A BSC controls hundreds of BTS (Base Station Transceiver) elements to which thousands of mobile devices are connected. Our evaluation demonstrates that outages are detected by ADT at least one day before outage occurrence, thus, allowing operations teams to take proactive steps to cure problems and prevent outages. We discuss the merits of our approach, as well as a nuanced view of the network's behavior for cases in which detection fails.