Example of filtering sensitive words and advertising words by DFA algorithm in Java
1、 Foreword
The development often needs to deal with the submission of some words from users, so it involves the function of sensitive word filtering. The implementation of DFA finite state machine algorithm in resources and the creation of directed graph. The filtering of sensitive words and advertising words has been completed, and the efficiency is good, so share it.
Specific implementation:
1. Matching case filtering 2. Matching full width and half width filtering 3. Matching filtering pause word filtering. 4. Sensitive word repetition filtering.
For example:
The following types of filter detection are supported:
Fuck all lowercase
Fuck case
Fuck full width
f!!! U & C ###k pause
Fffuuuccckkk repetition
There are many ways to filter sensitive words. I will briefly describe the following:
① Query the sensitive words in the database, cycle each sensitive word, and then search the input text from beginning to end to see if there is such a sensitive word. If there is one, make a comparison
This way is to find one to deal with one.
Advantages: so easy. There is basically no difficulty in implementing it in Java code.
Disadvantages: this efficiency makes me run through 100000 grass mud horses in my heart, and does it match some egg pain? If it is in English, you will find a very speechless thing, such as English
A is a sensitive word. If I am an English document, how many times does the program have to deal with sensitive words? Who can tell me?
② The legendary DFA algorithm (finite automata) is exactly what I want to share with you. After all, it feels more general. I hope you can check the principle of the algorithm on the Internet by yourself
Information, not detailed here.
Advantages: at least higher than the sb efficiency above.
Disadvantages: it should not be difficult for those who have learned the algorithm, nor for those who have not learned the algorithm. It is just a little GG painful to understand, the matching efficiency is not high, and it consumes more memory,
The more sensitive words, the more memory they occupy.
③ The third is to write an algorithm or optimize it based on the existing algorithm, which is what little Alan pursues
One of the high realms. If any adulterous brother has his own ideas, he must not forget little Alan. He can teach little Alan two tricks by adding little Alan's QQ: 810104041.
2、 Code implementation
The directory structure is as follows:
In the resources directory:
stopwd. Txt: pause words, matching time, direct filtering.
wd. Txt: sensitive thesaurus.
1. Wordfilter sensitive word filtering class
Of which:
Iscontins: whether sensitive words are included dofilter: filter sensitive words
2. Wordnode sensitive word node
3、 Test results
The project includes sensitive thesaurus, source code, pause thesaurus, etc. you can run it directly by running Maven and jar package.
Project source code: sensitive WD filter_ jb51. rar