Principle of multimode string matching algorithm and Java implementation code
Multimode string matching algorithm here refers to the problem of finding multiple pattern character strings in a string. Generally speaking, given a long string and many short pattern strings, how to find out which pattern strings appear in the long string is what we need to think about. The algorithm is widely used in keyword filtering, intrusion detection, virus detection, word segmentation and so on. Multimode problems generally include trie tree, AC algorithm, WM algorithm and so on.
background
In practical work, the simplest and most commonly used natural language processing method is keyword matching. For example, we want to filter n texts, which itself is a filtered vocabulary. The usual filtering codes are as follows
If the number of text is n and the number of filter words is k, the complexity is O (NK); If the number of keywords is large, the efficiency of sub branches is very low.
In computer science, the ahoccorasick algorithm was developed by Alfred v Aho and Margaret J The string search algorithm invented by Corasick is used to match substrings in a limited set of "dictionaries" in an input string. The difference between it and ordinary string matching is that it matches all dictionary strings at the same time. The algorithm has approximately linear time complexity in the case of equal sharing, which is about the length of the string plus the number of all matches. However, since all matching numbers need to be found, if each substring matches each other (for example, the dictionary is a, AA, AAA, AAAA, and the input string is AAAA), the time complexity of the algorithm will be similar to the matching quadratic function.
principle
In general, keyword matching is carried out for a text, and each keyword should be calculated one by one in the matching process. In other words, every time a keyword is matched, the document must be scanned again from the beginning to the end. The idea of AC automata is to cache the following three situations through the thesaurus at the beginning:
Jump according to character transfer success (success table)
Jump according to character transfer failure (fail table)
Match successful output table (output table)
Therefore, in the process of matching, there is no need to match from the beginning of the document, but jump directly through the cache, so as to achieve approximately linear time complexity.
structure
The construction process is divided into three steps: success table, fail table and output table. The output table is supplemented in the construction of sucess and fail tables. The fail table is one-to-one and the output table is one to many.
Jump according to character transfer success (success table)
The sucess table is actually a trie tree, which is constructed in the same way as the trie tree. I won't repeat it here.
Jump according to character transfer failure (fail table)
Let the letter on this node be C and walk along his father's failure pointer until he reaches a node. His son also has a node with the letter C. Then point the failure pointer of the current node to the son whose letter is also C. If you can't find it all the way to root, point the failure pointer to root. Using breadth first search BFS, hierarchical traversal nodes are used to deal with the failure path of each node.
Match successful output table (output table)
matching
For example, add the keywords he, she,, his, hers in order. During matching ushers. First build three tables, as shown in the figure below. The solid line is the sucess table, the dotted line is the fail table, and the word behind the node is the output table.
code
summary
The above is all about the principle of multimode string matching algorithm and Java implementation code in this paper. I hope it will be helpful to you. Interested friends can continue to refer to this website:
Detailed example of Java Monte Carlo algorithm for approximate value of PI
Complete code example of red black tree implemented by Java algorithm
Java implementation of a variety of sorting algorithm code examples