Multi-character Multi-pattern Pattern Matching for High Speed Applications

01 January 2006

New Image

Pattern Matching Algorithms are widely used in information retrieval and content inspection applications. Efficient single and multiple pattern matching has been a well-studied problem and the Aho-Corasick algorithm was one of the earliest multi- pattern algorithms proposed that achieves a linear time search. As applications increasingly run on custom hardware for higher speeds, it is essential that the algorithms have high throughput and also reduced memory requirements. When viewed specifically, the Virus/Worm detection application using signature matching would currently require detection capabilities of around 10Gbps in metro and enterprise networks and of around 40Gbps when deployed in the core of the network. Most of the algorithms proposed in literature cannot achieve these data line rates as they can process only one character per transition in the data structure, or equivalently one character per clock cycle. In this paper, we propose a novel multi-pattern matching algorithm based on the Aho-Corasick Automaton that can process multiple characters in every transition and hence can give us significantly higher average throughput. We also propose an efficient TCAM- based hardware architecture for its implementation. We also propose some novel state and transition optimizations schemes that can achieve significant memory reductions when implemented in hardware. As pattern matching is increasingly being applied to match regular expressions, we also provide an extension to the algorithm that can match patterns having fixed-length and variable-length wildcards. We prove the theoretical correctness of our algorithm and also present extensive simulation results for virus signature detection using real virus databases to study the effectiveness of our scheme.