Taxonomy-based Supervised Topic Labeling

08 August 2016

New Image

The rise of AI assisted applications, bots-based customer support and personal assistants is imposing higher and higher accuracy requirements in interpreting and contextualizing text data. A fundamental building block in automated understanding of text is to assign a topic label to a text document, at an appropriate level of granularity. The topic label should generalize the entities in the document, but it shouldn't be too generic. The state-of-the-art solutions to this problem use unsupervised methods that either do not leverage the taxonomy structure or model the taxonomy as undirected graphs. The undirected paths mix the hypernym and hyponym edges arbitrarily and are, often, a poor indicator of semantic relatedness. As a result, the labels generated by these approaches struggle to achieve high accuracy. We propose novel directional traversal measures based on modeling the taxonomy as a directed acyclic graph. In addition, we leverage information-theoretic measures based on Mutual Information. We combine the power of our novel graph-theoretic and information-theoretic measures with existing measures (e.g., content-based) by using them as features in a supervised learning approach. Our evaluation on Amazon Mechanical Turk shows that the topic labels generated by our supervised method are significantly more accurate than the baseline state-of-the-art approaches from the literature, on a range of document corpora.