Example of filtering sensitive words and advertising words by DFA algorithm in Java

1、 Foreword

The development often needs to deal with the submission of some words from users, so it involves the function of sensitive word filtering. The implementation of DFA finite state machine algorithm in resources and the creation of directed graph. The filtering of sensitive words and advertising words has been completed, and the efficiency is good, so share it.

Specific implementation:

1. Matching case filtering 2. Matching full width and half width filtering 3. Matching filtering pause word filtering. 4. Sensitive word repetition filtering.

For example:

The following types of filter detection are supported:

Fuck all lowercase

Fuck case

Fuck full width

f!!! U & C ###k pause

Fffuuuccckkk repetition

There are many ways to filter sensitive words. I will briefly describe the following:

① Query the sensitive words in the database, cycle each sensitive word, and then search the input text from beginning to end to see if there is such a sensitive word. If there is one, make a comparison

This way is to find one to deal with one.

Advantages: so easy. There is basically no difficulty in implementing it in Java code.

Disadvantages: this efficiency makes me run through 100000 grass mud horses in my heart, and does it match some egg pain? If it is in English, you will find a very speechless thing, such as English

A is a sensitive word. If I am an English document, how many times does the program have to deal with sensitive words? Who can tell me?

② The legendary DFA algorithm (finite automata) is exactly what I want to share with you. After all, it feels more general. I hope you can check the principle of the algorithm on the Internet by yourself

Information, not detailed here.

Advantages: at least higher than the sb efficiency above.

Disadvantages: it should not be difficult for those who have learned the algorithm, nor for those who have not learned the algorithm. It is just a little GG painful to understand, the matching efficiency is not high, and it consumes more memory,

The more sensitive words, the more memory they occupy.

③ The third is to write an algorithm or optimize it based on the existing algorithm, which is what little Alan pursues

One of the high realms. If any adulterous brother has his own ideas, he must not forget little Alan. He can teach little Alan two tricks by adding little Alan's QQ: 810104041.

2、 Code implementation

The directory structure is as follows:

In the resources directory:

stopwd. Txt: pause words, matching time, direct filtering.

wd. Txt: sensitive thesaurus.

1. Wordfilter sensitive word filtering class

Of which:

Iscontins: whether sensitive words are included dofilter: filter sensitive words

2. Wordnode sensitive word node

3、 Test results

The project includes sensitive thesaurus, source code, pause thesaurus, etc. you can run it directly by running Maven and jar package.

Project source code: sensitive WD filter_ jb51. rar

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>