From trie tree to double array trie tree
Trie tree
principle
Also known as word lookup tree, trie tree is a tree structure and a variant of hash tree. Its advantages are: using the common prefix of the string to reduce the query time, minimize the unnecessary string comparison, and realize the insertion and query operation in the constant time o (len). It is a data structure that exchanges space for time. It is widely used in the field of word frequency statistics and input statistics.
Let's see what trie tree looks like. Let's find a picture from Baidu:
When searching the dictionary tree, first check whether the first word is in the dictionary tree. If it continues down, if it is not, it does not exist in the dictionary. Therefore, for a string with length len, you can complete the query in O (len) time.
Implement trie tree
How to implement trie tree, The key of trie tree is that a node should be in o (1) Time jumps to the next level node, so the linked list method is not advisable. It is best to use an array to store the next level node. The problem is that if it is pure English letters, an array with a length of 26 can be solved. For the number of N nodes, n arrays with a length of 26 are required. However, if it contains Chinese characters, n arrays with a length of 65535 are required, which especially takes up storage space. When However, you can consider using a map to store subordinate nodes.
Define a node, including the character word of the node, as well as the values of the child node nexts and possible attachments of the node:
public static class Node<T> {
Character word;
List<T> values;
Map<Character,Node> nexts = new HashMap<>(24);
public Node() {
}
public Node(Character word) {
this.word = word;
}
public Character getWord() {
return word;
}
public void setWord(Character word) {
this.word = word;
}
public void addValue(T value){
if(values == null){
values = new ArrayList<>();
}
values.add(value);
}
public List<T> getValues() {
return values;
}
public Map<Character,Node> getNexts() {
return nexts;
}
/**
* @param node
*/
public void addNext(Node node) {
this.nexts.put(node.getWord(),node);
}
public Node getNext(Character word) {
return this.nexts.get(word);
}
}
To see how to build a dictionary tree, first define a tree that contains the root node
public static class Trie<T> {
Node<T> rootNode;
public Trie() {
this.rootNode = new Node<T>();
}
public Node<T> getRootNode() {
return rootNode;
}
}
Build the tree, split it into words, and then build the tree level by level.
public static class TrieBuilder {
public static Trie<String> buildTrie(String... values){
Trie<String> trie = new Trie<String>();
for(String sentence : values){
// 根节点
Node<String> currentNode = trie.getRootNode();
for (int i = 0; i < sentence.length(); i++) {
Character character = sentence.charAt(i);
// 寻找首个节点
Node<String> node = currentNode.getNext(character);
if(node == null){
// 不存在,创建节点
node = new Node<String>(character);
currentNode.addNext(node);
}
currentNode = node;
}
// 添加数据
currentNode.addValue(sentence);
}
return trie;
}
Trie tree application
For example, it is very simple to judge whether a word is in the dictionary tree, matching level by level, and finally judge whether the last node contains data:
public boolean isContains(String word) {
if (word == null || word.length() == 0) {
return false;
}
Node<T> currentState = rootNode;
for (int i = 0; i < word.length(); i++) {
currentState = currentState.getNext(word.charAt(i));
if (currentState == null) {
return false;
}
}
return currentState.getValues()!=null;
}
Test code:
public static void main(String[] args) {
Trie trie = TrieBuilder.buildTrie("刘德华","刘三姐","刘德刚","江姐");
System.out.println(trie.isContains("刘德华"));
System.out.println(trie.isContains("刘德"));
System.out.println(trie.isContains("刘大大"));
}
result:
true
false
false
double-array trie
During the implementation of trie number, we found that each node needs an array to store the next node, which takes up a lot of storage space and has a large space complexity. The double array trie tree is the solution to this problem. Double array trie tree is a kind of trie tree with low spatial complexity, which is applied to the field of word segmentation in languages with large character range (such as Chinese, Japanese, etc.).
principle
The principle of double array is that the trie tree that needs multiple arrays to be represented can be stored with two data, which can greatly reduce the space complexity. Specifically:
Two arrays base and check are used to maintain the trie tree. Base is responsible for recording the state, and check is responsible for checking whether each string is transferred from the same state. When check [i] is negative, it indicates that this state is the end of the string.
The above is a bit abstract. For example, suppose the values of the two words TA, TB, base and check meet the following conditions: base [t] + a.code = base [TA] base [t] + b.code = base [TB] Check [TA] = check [TB]
The two arrays will be modified during the insertion of each node. Specifically:
1. Initialize root node base [0] = 1; check[0] = 0;
2. For each group of sibling nodes, find a begin value so that check [begin + A1... An] = = 0, that is, n free spaces are found, and A1... An is the code corresponding to N nodes in siblings.
3. Then set the check of this group of sibling nodes as check [begin + A1... An] = begin
4. Then, for each sibling node, if it has no children, make its base negative; Otherwise, it is the insertion position of the child node of the node (that is, the begin value), and the child node is inserted at the same time (the iteration jumps to step 2).
码表:
胶 名 动 知 下 成 举 一 能 天 万
33014 21517 21160 30693 19979 25104 20030 19968 33021 22825 19975
DoubleArrayTrie{
char = × 一 万 × 举 × 动 × 下 名 × 知 × × 能 一 天 成 胶
i = 0 19970 19977 20032 20033 21162 21164 21519 21520 21522 30695 30699 33023 33024 33028 40001 44345 45137 66038
base = 1 2 6 -1 20032 -2 21162 -3 5 21519 -4 30695 -5 -6 33023 3 1540 4 33024
check= 0 1 1 20032 2 21162 3 21519 1540 4 30695 5 33023 33024 6 20032 21519 20032 33023
size=66039,allocSize=2097152,key=[一举,一举一动,一举成名,一举成名天下知,万能,万能胶],keySize=6,progress=6,nextCheckPos=33024,error_=0}
First floor: I [19968], 10000 [19975] base [i] = base [0] + 19968-19968 = 1 base [10000] = base [0] + 19975-19968=
realization
Refer to the open source project of Java implementation of double array trie tree: https://github.com/komiya-atsushi/darts-java
Double array trie + AC automata
See: http://www.hankcs.com/program/algorithm/aho-corasick-double-array-trie.html
Combined with AC automata + double array trie tree: AC automata can complete multi-mode matching at high speed, but whether the specific implementation is smart or not determines the final performance. Most implementations are done with a map < character, state >. Whether it is the logarithmic complexity of treemap, the huge spatial complexity of HashMap and the performance consumption of hash function, the overall performance will be reduced.
The dual array trie tree can complete single string matching at high speed o (n), and the memory consumption is controllable. However, the weakness lies in multi pattern matching. If you want to match multiple pattern strings, you must first implement prefix query, and then frequently intercept text suffixes before multiple matching. In this way, a text needs to be back scanned many times, and the performance is very low.
If we can express AC automata with double array trie tree, we can combine the advantages of the two and get an almost perfect data structure. In my java implementation, I call it aho corasickdouble array trie, which supports generics and persistence. I like it very much.