Levenshtein Distance Algorithm and usage scenario

premise

It has been a long time since I deeply studied something related to the algorithm. After all, it is rarely used in daily life. Even if I memorize it by rote, there is no implementation scenario, which is easy to forget. Recently, an algorithm called Levenshtein Distance algorithm was used to meet the requirements of desensitization data and plaintext data matching. This paper makes a simple analysis of the principle of this algorithm, and uses this algorithm to solve several common scenarios.

What is Levenshtein Distance

Levenshtein Distance, It is generally called edit distance (Levenshtein Distance is only one of the edit distances) or levinstein distance. The algorithm concept is Vladimir levinstein, a Russian scientist (Levenshtein · Vladimir I) was proposed in 1965. The concept of this algorithm is very simple: Levenshtein Distance refers to the minimum number of editing operations required to convert one string to another between two strings. The allowed editing operations include:

Levenshtein Distance formula definition

The final value of this mathematical formula is the value of LD. for instance:

The LD value of converting the word kitten into sitting is 3:

Levenshtein Distance dynamic programming method

The dynamic programming method can be used to measure the value of LD. The steps are roughly as follows:

It is not intended to prove the above conclusion of dynamic programming (that is, the default result of this dynamic programming is correct), but directly give two examples to illustrate this problem:

Example 1:

Initialize LD matrix (3,3):

Calculate the value of the position of [0] [0], because's' ='s', the value of [0] [0] is = min (1 + 1,1 + 1,0 + 0) = 0.

Calculate the values of other positions according to this rule. The filled LD matrix is as follows:

Then the LD values of son and sun are 1.

Example 2:

Initialize LD matrix (4,3):

Then fill the matrix:

Then the LD values of Doge and dog are 1.

Implementation of Levenshtein Distance Algorithm

According to the dynamic programming method mentioned above, the algorithm of LD can be realized relatively simply. Here, Java language is selected for implementation:

public enum LevenshteinDistance {

    // 单例
    X;

    /**
     * 计算Levenshtein Distance
     */
    public int ld(String source,String target) {
        Optional.ofNullable(source).orElseThrow(() -> new IllegalArgumentException("source"));
        Optional.ofNullable(target).orElseThrow(() -> new IllegalArgumentException("target"));
        int sl = source.length();
        int tl = target.length();
        // 定义矩阵,行列都要加1
        int[][] matrix = new int[sl + 1][tl + 1];
        // 首行首列赋值
        for (int k = 0; k <= sl; k++) {
            matrix[k][0] = k;
        }
        for (int k = 0; k <= tl; k++) {
            matrix[0][k] = k;
        }
        // 定义临时的编辑消耗
        int cost;
        for (int i = 1; i <= sl; i++) {
            for (int j = 1; j <= tl; j++) {
                if (source.charAt(i - 1) == target.charAt(j - 1)) {
                    cost = 0;
                } else {
                    cost = 1;
                }
                matrix[i][j] = min(
                        // 左上
                        matrix[i - 1][j - 1] + cost,// 右上
                        matrix[i][j - 1] + 1,// 左边
                        matrix[i - 1][j] + 1
                );
            }
        }
        return matrix[sl][tl];
    }

    private int min(int x,int y,int z) {
        return Math.min(x,Math.min(y,z));
    }

    /**
     * 计算匹配度match rate
     */
    public BigDecimal mr(String source,String target) {
        int ld = ld(source,target);
        // 1 - ld / max(len1,len2)
        return BigDecimal.ONE.subtract(BigDecimal.valueOf(ld)
                .divide(BigDecimal.valueOf(Math.max(source.length(),target.length())),2,BigDecimal.ROUND_HALF_UP));
    }
}

The complexity of the algorithm is O (n * m), where N and m are the lengths of two input strings respectively. The algorithm implementation here completely refers to the inference process of the previous dynamic programming method. In fact, it is not necessary to define two-dimensional arrays (matrices). You can use two one-dimensional arrays. Please refer to the implementation of Levenshtein algorithm in Java string similarity. Run the following example:

public static void main(String[] args) throws Exception {
    String s = "doge";
    String t = "dog";
    System.out.println("Levenshtein Distance:" +LevenshteinDistance.X.ld(s,t));
    System.out.println("Match Rate:" +LevenshteinDistance.X.mr(s,t));
}
// 输出
Levenshtein Distance:1
Match Rate:0.75

Some usage scenarios of Levenshtein Distance Algorithm

The main application scenarios of LD algorithm are:

In fact, it is mainly the "string" matching scenario. Here is an example based on the actual scenario.

Desensitization data and plaintext data match

Recently, desensitized data and plaintext data have been matched in some scenes. Sometimes the files exported by a third party are desensitized files, and the format is as follows:

We have written data as follows:

To match the two pieces of data, it is concluded that the above two pieces of data correspond to the data of the same person. The principle is: if and only if the LD value of the mobile phone number is 4, the LD value of the ID card is 8 and the LD value of the name is 1, the two pieces of data match completely.

Use the algorithm previously written:

public static void main(String[] args) throws Exception {
    String sourceName = "张*狗";
    String sourcePhone = "123****8910";
    String sourceIdentityNo = "123456****8765****";
    String targetName = "张大狗";
    String targetPhone = "12345678910";
    String targetIdentityNo = "123456789987654321";
    boolean match = LevenshteinDistance.X.ld(sourceName,targetName) == 1 &&
            LevenshteinDistance.X.ld(sourcePhone,targetPhone) == 4 &&
            LevenshteinDistance.X.ld(sourceIdentityNo,targetIdentityNo) == 8;
    System.out.println("是否匹配:" + match);
    targetName = "张大doge";
    match = LevenshteinDistance.X.ld(sourceName,targetIdentityNo) == 8;
    System.out.println("是否匹配:" + match);
}
// 输出结果
是否匹配:true
是否匹配:false

Spell check

This scenario seems closer to life, that is, the spelling prompt of dictionary application. For example, throwab can be prompted when throwab is entered. The author thinks that a simple implementation is to traverse the word library beginning with T and find words with high matching degree (low LD value) for prompt (in fact, it may not be implemented in this way to meet efficiency). For example:

public static void main(String[] args) throws Exception {
    String target = "throwab";
    // 模拟一个单词库
    List<String> words = Lists.newArrayList();
    words.add("throwable");
    words.add("their");
    words.add("the");
    Map<String,BigDecimal> result = Maps.newHashMap();
    words.forEach(x -> result.put(x,LevenshteinDistance.X.mr(x,target)));
    System.out.println("输入值为:" + target);
    result.forEach((k,v) -> System.out.println(String.format("候选值:%s,匹配度:%s",k,v)));
}
// 输出结果
输入值为:throwab
候选值:the,匹配度:0.29
候选值:throwable,匹配度:0.78
候选值:their,匹配度:0.29

In this way, the child can select the throwable with the highest matching degree based on the input throwab.

Plagiarism detection

The essence of plagiarism detection is also string matching. It can be simply considered that if the matching degree is higher than a certain threshold, it is plagiarism. For example, the lyrics of "I am a little bird" are:

Suppose the author wrote a lyrics:

We can try to find out the matching degree of two sentences:

System.out.println(LevenshteinDistance.X.mr("我是一只小小小小鸟,想要飞呀飞却飞也飞不高","我是一条小小小小狗,想要睡呀睡却睡也睡不够"));
// 输出如下
0.67

It can be considered that the lyrics created by the author are completely copied. Of course, for large text plagiarism detection (such as paper duplicate checking, etc.), we need to consider the problem of execution efficiency. The solution should be similar, but we need to consider various problems such as word segmentation, case and so on.

Summary

This paper only makes a superficial analysis of Levenshtein Distance and lists some simple scenarios. In fact, this algorithm is very common in daily life. The author guesses that the word spelling check applied in the dictionary Duplicate checking (plagiarism discrimination) may be related to this algorithm. Although the learning curve of the algorithm is steep, it is indeed a sharp blade to solve the problem.

reference material:

The official account of Technology (Throwable Digest), which is not regularly pushed to the original technical article (never copied or copied):

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>