Java implementation of sensitive word filtering example

Sensitive words and text filtering is an essential function of a website. It is very necessary to design a good and efficient filtering algorithm. Some time ago, a friend of mine (graduated right away, not long after he came into contact with programming) he asked me to help him look at a text filtering thing. It said that the retrieval efficiency is very slow. I took the program and looked at it. The whole process is as follows: read the sensitive thesaurus, if in the HashSet collection, get the page, upload the text, and then match. I think this process must be very slow. For a person who has no contact with him, I I can only think of this. A higher level is regular expression. Unfortunately, both methods are not feasible. Of course, I didn't realize that the algorithm could solve the problem, but Google knows!

Introduction to DFA

Among the algorithms of text filtering, DFA is the only better algorithm. DFA is deterministic finite automaton, that is, deterministic finite automata. It obtains the next state through event and current state, that is, event + state = nextstate. The following figure shows the transition of its state

a b b

S -----> U S -----> V U -----> V

In the algorithm of sensitive word filtering, we must reduce the operation, and DFA has almost no calculation in DFA algorithm, only the transformation of state.

Implementing sensitive word filtering with DFA algorithm in Java

The key to implement sensitive word filtering in Java is the implementation of DFA algorithm. First, we analyze the above figure. In this process, we think the following structure will be clearer.

At the same time, there is no state transition and no action, but only query. We can think that through s query u and V, through u query V and P, and through V query u P. through such a transition, we can transform the state transition into a search using java collection.

Admittedly, there are several sensitive words in our sensitive Thesaurus: Japanese, Japanese devils and Mao Ze East. So what kind of structure do I need to build?

First: query day -- > {Book}, query book -- > {man, devil}, query man -- > {null}, query ghost -- > {son}. The structure is as follows:

Let's expand this figure:

In this way, we build our sensitive thesaurus into a tree similar to one by one, so that when we judge whether a word is a sensitive word, we greatly reduce the matching range of retrieval. For example, if we want to judge the Japanese, according to the first word, we can confirm that the tree needs to be searched, and then search in this tree.

But how to judge that a sensitive word is over? Use the identification bit to judge.

So the key to this is how to build such sensitive word trees. Next, I have taken HashMap in Java as an example to implement DFA algorithm. The specific process is as follows:

Japanese, Japanese devils, for example

1. Query "day" in HashMap to see if it exists in HashMap. If it does not exist, it proves that the sensitive word beginning with "day" does not exist, so we can directly build such a tree. Skip to 3.

2. If it is found in HashMap, it indicates that there is a sensitive word beginning with "day", set HashMap = HashMap Get ("day"), skip to 1 and match "Ben" and "person" in turn.

3. Judge whether the word is the last word in the word. If it indicates the end of the sensitive word, set the flag bit Isend = 1; otherwise, set the flag bit Isend = 0;

The program is implemented as follows:

The HashMap structure obtained by running is as follows:

{five = {star = {red = {Isend = 0, flag = {Isend = 1}, Isend = 0}, middle = {Isend = 0, country = {Isend = 0, person = {Isend = 1}, male = {Isend = 0, person = {Isend = 1}

We have implemented a simple method for sensitive thesaurus, so how to realize retrieval? The retrieval process is nothing more than the get implementation of HashMap. Finding it proves that the word is a sensitive word, otherwise it is not a sensitive word. The process is as follows: if we match "long live the Chinese people".

1. The first word "medium" can be found in HashMap. Get a new map = HashMap get("")。

2. If map = = null, it is not a sensitive word. Otherwise, skip to 3

3. Get the Isend in the map, and judge whether the word is the last by whether Isend is equal to 1. If Isend = = 1, it means that the word is sensitive, otherwise skip to 1.

Through this step, we can judge that "Chinese people" is a sensitive word, but if we enter "Chinese woman", it is not a sensitive word.

At the end of the article, I provide a file download for sensitive word filtering using Java. The following is a test class to prove the efficiency and reliability of this algorithm.

Operation results:

From the above results, it can be seen that there are 771 sensitive words in the thesaurus, the detection sentence length is 184 characters, and 6 sensitive words are found. It takes a total of 1 ms. It can be seen that the speed is still very considerable.

Two document downloads are available:

Desktop. rar( http://xiazai.jb51.net/201611/yuanma/Desktop_jb51.rar )It contains two java files, one is to read the sensitive wordinit, and the other is the sensitive wordfilter, which includes judging whether there are sensitive words (iscontaintsensitiveword (string TXT, int matchtype)), obtaining sensitive words (getsensitive word (string TXT, int matchtype)) Three methods of sensitive word substitution (replacesensitive word (string TXT, int matchtype, string replacechar)).

Sensitive Thesaurus: click download

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>