Java – text similarity function with strict document similarity

I am writing a java software and must make a final judgment on the similarity of the two documents encoded in UTF - 8

The two documents are likely to be the same or slightly different from each other because they have many common characteristics, such as date, location, creator, etc., but their text determines whether they are really so

I hope the texts of the two documents are either very similar or not at all similar, so I can set the similarity threshold quite strictly For example, I can say that the two documents are similar only when they have 90% common words, but I hope there is something more powerful, which is useful for both short and long articles

To sum up, I have:

>Two files are very similar or completely different, but: > the two files are more likely to be similar than > the files can be long (some paragraphs) or short (a few sentences)

I've tried simmetrics, which has a lot of string matching functions, but what I'm most interested in is suggesting possible algorithms

My possible candidates are:

>Levenshtein: its output is more important for essays > overlap coefficient: maybe, but it will distinguish files of different lengths?

Also consider that two texts can work only if they are identical, because I hope only a few word document can pass the similarity test

Solution

Levenshtein Distance is a standard scale because it is easy to calculate and grasp its meaning If you are alert to the number of characters in a long document, you can calculate it on words or sentences or even paragraphs instead of characters Because you want similar pairs to be very similar, they still work

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>