Can data science and machine learning unlock the value of big data?
This blog is by Kimmo Hatonen at Nokia Networks. Twitter: @nokianetworks
In his recent blog “When big data rubber meets the telecom road", Claudio Frascoli discussed the hurdles that operators face today when they want to make use of big data. Barriers that prevent them from proceeding with their plans are not only legal in nature but also technological. Consider Wikipedia’s definition: “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” This points to the importance of data analytics: the ways we extract information from the data, compute it, and present it for decision-making.
Data analytics methodologies derive from various research areas like mathematics, statistics, signal processing, computer science, artificial intelligence, knowledge discovery, data engineering and warehousing. These methodologies have recently been combined under the "data science" umbrella. Coming together, researchers can benefit from innovations made in other areas studying the same data related problems. Data science develops methods and technologies that can be used to build data analytics and machine learning tools for network analysis and management tasks. Those can help to create a bridge from operator data assets into profitable use cases.
Model based approximation and recognition and estimation technologies form the base for adaptive functions that can be used to build machine learning applications. These are at the heart of automated management functions. Machine learning applications can be designed either to learn directly from the data or to support an expert in analyzing the data and integrating his or her own knowledge, thereby enhancing the system’s knowledge base.
Teaching old machines new tricks
Machine learning technologies come in many flavors of which supervised and unsupervised learning methods are worth mentioning. Supervised methods are to be used if one knows exactly what items need to be taught to the computer to recognize, where in the data to find multiple samples containing those, and in which variables the information about them is embedded. Let's take an old problem as an example: locating sleeping cells – cells that don't show any activity nor send alarms. A neural network can be taught to identify them if we know when and where in the history there were cells in such a condition and how to find data collected from them during those periods. Thus, we can attach labels to these data samples and engage the machine to learn characteristics that separate those samples from other samples which represent the remaining network history.
Unsupervised methods, on the other hand, are good when we want to find out what is going on in the network. They can reveal trends, correlations, clusters and other kinds of information that can be used to parse and interpret phenomena going on in the network. In my doctoral thesis*, for example, I used unsupervised algorithm to identify repeating event combinations from telco logs. When the most common event combinations are removed, log files often shrink by 70-90%. In other studies, we have shown how it is possible to identify network element, fault condition, and IP connection profiles that cover the majority of the studied population.
Major innovations in these methodologies occurred at the beginning of this millennium. In those days there were many enthusiastic young people applying new ideas to telecommunications tools and operations. But the ideas never took off. Problems with data collection and storage required all the available resources, there wasn’t enough computing power in servers, and no one was willing to pay for more powerful machines.
How to unlock the value of big data
But, oh, how times have changed! The big data boom promises great value in the detailed information hidden in operator data sets. However, only a select few big data analyst firms understand telco networks and characteristics of network data. Even fewer are capable of interpreting the data, which mirrors the rhythm of society with strong daily, weekly and yearly cycles. Each cycle consisting of hundreds of causally dependent variables, of which two similarly behaving network elements with the same configurations in equal surroundings can very seldom be found and which is collected from a network that dynamically evolves from day to day, week to week and year to year.
All networks are unique. Their dynamic nature calls for more dynamic methods to model the network data. In many cases, one can build an expressive action model for one network, but the great difficulties begin when that model is transferred to another network or the original network is upgraded. Such dynamic models can be developed only if there are large data sets over long periods of time with simultaneous configuration history and a maintenance log, documenting at least the major problems and upgrades that occurred in the network. Without the history of what has happened, it is impossible for a data scientist or domain expert to know if the findings make any sense at all.
One of the earliest discoveries, back when extensive research on knowledge discovery and data mining began in the mid-90’s, was that several competences are required in order to properly develop an analysis use case for a certain domain. In banking, for example, the competences needed include data mining and knowledge discovery, banking and banking business, data engineering and data warehousing, software development – and finally, a competent business analyst. This result still holds today: a set of multiple competencies is a prerequisite for any successful big data analysis project.
Share your thoughts on this topic by replying below – or join the Twitter discussion with @nokianetworks using #NetworksPerform #mobilebroadband #Nokia #BigData #CSPCX #MBBFuture.
* Kimmo Hätönen. Data mining for telecommunications network log analysis. Ph.D.Thesis, Dept. of Computer Science, P.O. Box 26, FIN-00014 University of Helsinki, Finland. January 2009