Import2vec: Learning embeddings for Software Libraries

01 January 2019

New Image

We consider the problem of developing suitable machine learning representations for software libraries. From other fields that heavily use machine learning we know such representations are key to the performance of downstream learning tasks. For instance, in natural language processing (NLP) the use of word embeddings ("word vectors") enables machine learning algorithms to more easily perform classification and transduction tasks on text sentences. We apply techniques from NLP to train embeddings for software libraries ("library vectors"), as identified by their import statements in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python). We demonstrate that the trained library embeddings are useful for downstream tasks such as building a contextual search engine.