Detailed explanation of the maximum matching word segmentation algorithm implemented in Java

This paper describes the maximum matching word segmentation algorithm implemented in Java. Share with you for your reference, as follows:

Full text retrieval has two important processes:

1 participle

2 inverted index

Let's look at the word segmentation algorithm first

At present, there are two directions for Chinese word segmentation, one of which is to use the idea of probability to segment words. That is, if two words appear together with high frequency, we can assume that these two words are one word. Here, a formula can be used to measure: m (a, b) = P (AB) / P (a) P (b), where a represents a word, B represents a word, P (AB) represents the probability of adjacent occurrence of AB, P (a) represents the frequency of a in this article, and P (b) represents the frequency of B in this article. The advantage of using probability word segmentation is that it does not need the help of dictionary. The disadvantage is that the algorithm is troublesome, inefficient and has a certain error rate.

Another direction is to use dictionary segmentation. Is to prepare a dictionary for the program in advance, and then segment the article through this dictionary. At present, the most popular methods are forward maximum matching algorithm and reverse maximum matching algorithm. The reverse maximum matching algorithm is better in accuracy.

Take "I am a bad person" as an example, and the maximum word length is 3. The thesaurus includes me, yes, one, one, bad person and big bad person

The positive order is

I'm a I'm a i = = = > get a word is a is a = = = > get a word, a bad one = = > get a word, a bad one = = > get a word

As a result, I was a bad man

Reverse algorithm

Bad guys = = > bad guys are one = = > one I am = = > it's me = = > I

As a result, I was a bad man

The Java code is as follows

For more information about Java algorithms, readers who are interested can see the topics on this site: Java data structure and algorithm tutorial, summary of Java character and string operation skills, summary of Java DOM node operation skills, summary of java file and directory operation skills, and summary of Java cache operation skills

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>