From edit distance, BK tree to text error correction
There is a very important topic in search engines, which is text error correction. There are two main methods: one is to correct errors from the dictionary, and the other is to analyze user search logs. Today, we discuss the use of dictionary based error correction. The core idea is to use BK tree based on editing distance. Let's discuss it one by one:
Edit distance
In 1965, Vladimir Levenshtein, a Russian scientist, made a clear definition of string similarity called Levenshtein Distance, which we usually call "editing distance".
The editing distance from string a to B refers to the minimum number of steps required to change a into B with only three operations: insert, delete and replace. For example, from fame to gate requires two steps (two substitutions), and from game to ACM requires three steps (delete g and E and add C). Levenshtein gives a general solution to the editing distance, which is a classical dynamic programming problem that everyone is very familiar with.
class LevenshteinDistanceFunction {
private final boolean isCaseSensitive;
public LevenshteinDistanceFunction(boolean isCaseSensitive) {
this.isCaseSensitive = isCaseSensitive;
}
public int distance(CharSequence left,CharSequence right) {
int leftLength = left.length(),rightLength = right.length();
// special cases.
if (leftLength == 0)
return rightLength;
if (rightLength == 0)
return leftLength;
// Use the iterative matrix method.
int[] currentRow = new int[rightLength + 1];
int[] nextRow = new int[rightLength + 1];
// Fill first row with all edit counts.
for (int i = 0; i <= rightLength; i++)
currentRow[i] = i;
for (int i = 1; i <= leftLength; i++) {
nextRow[0] = i;
for(int j = 1; j <= rightLength; j++) {
int subDistance = currentRow[j - 1]; // Distance without insertions or deletions.
if (!charEquals(left.charAt(i - 1),right.charAt(j - 1),isCaseSensitive))
subDistance++; // Add one edit if letters are different.
nextRow[j] = Math.min(Math.min(nextRow[j - 1],currentRow[j]) + 1,subDistance);
}
// Swap rows,use last row for next row.
int[] t = currentRow;
currentRow = nextRow;
nextRow = t;
}
return currentRow[rightLength];
}
}
BK tree
The classic application of editing distance is for spelling error detection. If the words entered by the user are not in the dictionary, automatically find the words with editing distance less than a certain number n from the dictionary, and let the user choose the correct one. N is usually taken as 2 or 3.
The difficulty of this problem is how to quickly find the most similar words in the dictionary? You can use Bayes to do English spell checking (c#). You can automatically modify a word to check whether it is in the dictionary. This is suspected of violent cracking. Is there a more elegant scheme?
In 1973, Burkhard and Keller proposed BK tree, which effectively solved this problem. The core idea of BK tree is:
令d(x,y)表示字符串x到y的Levenshtein距离,那么显然:
d(x,y) = 0 当且仅当 x=y (Levenshtein距离为0 <==> 字符串相等)
d(x,y) = d(y,x) (从x变到y的最少步数就是从y变到x的最少步数)
d(x,y) + d(y,z) >= d(x,z) (从x变到z所需的步数不会超过x先变成y再变成z的步数)
This last property is called triangular inequality. Just like a triangle, the sum of the two sides must be greater than the third side.
BK Jianshu
First, let's find a random word as the root (for example, game). When inserting a word in the future, first calculate the Levenshtein Distance between the word and the root: if the distance value is the first occurrence at the node, create a new son node; otherwise, recurse along the corresponding edge. For example, when we insert the word fame, its distance from the game is 1, so we create a new son with an edge labeled 1; the next insertion Gain, the distance between it and game is 2, so it is placed under the side numbered 2. Next time we insert gate, the distance between it and game is 1, so go down along the edge numbered 1 and recursively insert it into the subtree where fame is located; The distance between gate and fame is 2, so put gate under the fame node, and the number of edges is 2.
BK query
If we need to return a word whose distance from the wrong word is no more than N, and the distance between the wrong word and the tree root is D, then we only need to recursively consider the subtree connected by the edges numbered in the range of D-N to D + n. Because n is usually small, many subtrees can be excluded every time compared with a node.
It can be understood from the following figure (from super cool algorithm (1): BK tree (and personal understanding):
BK implementation
Knowing the principle and implementation is simple. Here, find a piece of code from GitHub
Achievements:
public boolean add(T t) {
if (t == null)
throw new NullPointerException();
if (rootNode == null) {
rootNode = new Node<>(t);
length = 1;
modCount++; // Modified tree by adding root.
return true;
}
Node<T> parentNode = rootNode;
Integer distance;
while ((distance = distanceFunction.distance(parentNode.item,t)) != 0
|| !t.equals(parentNode.item)) {
Node<T> childNode = parentNode.children.get(distance);
if (childNode == null) {
parentNode.children.put(distance,new Node<>(t));
length++;
modCount++; // Modified tree by adding a child.
return true;
}
parentNode = childNode;
}
return false;
}
Find:
public List<SearchResult<T>> search(T t,int radius) {
if (t == null)
return Collections.emptyList();
ArrayList<SearchResult<T>> searchResults = new ArrayList<>();
ArrayDeque<Node<T>> nextNodes = new ArrayDeque<>();
if (rootNode != null)
nextNodes.add(rootNode);
while(!nextNodes.isEmpty()) {
Node<T> nextNode = nextNodes.poll();
int distance = distanceFunction.distance(nextNode.item,t);
if (distance <= radius)
searchResults.add(new SearchResult<>(distance,nextNode.item));
int lowBound = Math.max(0,distance - radius),highBound = distance + radius;
for (Integer i = lowBound; i <= highBound; i++) {
if (nextNode.children.containsKey(i))
nextNodes.add(nextNode.children.get(i));
}
}
searchResults.trimToSize();
Collections.sort(searchResults);
return Collections.unmodifiableList(searchResults);
}
Text error correction using BK tree
Prepare a dictionary of 180000 film and television titles:
Test code:
static void outputSearchResult( List<SearchResult<CharSequence>> results){
for(SearchResult<CharSequence> item : results){
System.out.println(item.item);
}
}
static void test(BKTree<CharSequence> tree,String word){
System.out.println(word+"的最相近结果:");
outputSearchResult(tree.search(word,Math.max(1,word.length()/4)));
}
public static void main(String[] args) {
BKTree<CharSequence> tree = new BKTree(DistanceFunctions.levenshteinDistance());
List<String> testStrings = FileUtil.readLine("./src/main/resources/act/name.txt");
System.out.println("词典条数:"+testStrings.size());
long startTime = System.currentTimeMillis();
for(String testStr: testStrings){
tree.add(testStr.replace(".",""));
}
System.out.println("建树耗时:"+(System.currentTimeMillis()-startTime)+"ms");
startTime = System.currentTimeMillis();
String[] testWords = new String[]{
"湄公河凶案","葫芦丝兄弟","少林足球"
};
for (String testWord: testWords){
test(tree,testWord);
}
System.out.println("测试耗时:"+(System.currentTimeMillis()-startTime)+"ms");
}
result:
词典条数:18513
建树耗时:421ms
湄公河凶案的最相近结果:
湄公河大案
葫芦丝兄弟的最相近结果:
葫芦兄弟
少林足球的最相近结果:
少林足球
笑林足球
测试耗时:20ms
reference resources: http://blog.csdn.net/tradymeky/article/details/40581547 https://github.com/sk-scd91/BKTree https://www.cnblogs.com/data2value/p/5707973.html