Engineering a Set Intersection Counting Algorithm for Text Similarity Metrics
18 January 2017
In real-time text analytics systems, it is often the case that a list of semantically important phrases must be quickly extracted from incoming documents: e.g., emails. A bottleneck in determining this list is often the computation of mutual information scores for each pair of phrases in the document. This requires, for each pair of phrases, the computation of the number of documents in a reference corpus that contain both phrases. Sketching techniques have been proposed in the literature for approximating such set intersection counting queries, but these techniques introduce additional noise that can negatively impact the quality of the returned list of important phrases. In this paper we provide experimental evidence showing that real-time computation of exact mutual information scores is possible. To show this, we analyse three large data sets that could potentially be used as a reference corpus in a real-time system. We then develop a series of benchmarks for comparing set intersection counting data structures on these data sets, and compare several different exact algorithms from the literature. We find that by designing algorithms to exploit the distribution of phrases among documents in a typical reference corpus, we can achieve a speedups factor of between 10 and 50 for batch query processing over these naive algorithms.