Java implementation of TFIDF algorithm code sharing
Algorithm Introduction
concept
TF-IDF (term frequency cinverse document frequency) is a commonly used weighting technique for information retrieval and information mining. TF-IDF is a statistical method to evaluate the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in the document, but at the same time it increases in the corpus The frequency of occurrence in the library decreases inversely. Various forms of TF-IDF weighting are often used by search engines as a measure or rating of the correlation between files and user queries. In addition to TF-IDF, search engines on the Internet also use a rating method based on link analysis to determine the order in which documents appear in search results.
principle
In a given document, term frequency (TF) refers to the number of times a given word appears in the document. This number is usually normalized (the numerator is generally smaller than the denominator, which is different from IDF) to prevent it from being biased towards long files. (the same word may have a higher word frequency in long files than in short files, regardless of whether the word is important or not.)
Inverse document request (IDF) is a measure of the general importance of words. The IDF of a specific word can be obtained by dividing the total number of files by the number of files containing the word, and then taking the logarithm of the obtained quotient.
The high word frequency in a specific file and the low file frequency of the word in the whole file set can produce a high weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important words.
The main idea of TFIDF is that if a word or phrase appears frequently in one article and rarely in other articles, it is considered that this word or phrase has good classification ability and is suitable for classification. TFIDF is actually TF * IDF, TF term frequency, and IDF inverse document frequency. TF indicates the frequency of entries in document D (in other words, TF term frequency refers to the number of times a given word appears in the file). The main idea of IDF is: if there are fewer documents containing entry T, that is, the smaller n is, the larger IDF is, it indicates that entry t has a good ability to distinguish categories. If the number of documents containing entry t in a certain type of document C is m, and the total number of documents containing T in other types is k, it is obvious that the number of documents containing T is n = m + K. when m is large, n is also large, and the IDF value obtained according to the IDF formula will be small, indicating that the ability to distinguish the category of entry t is not strong. (on the other hand, IDF inverse document frequency means that the fewer documents containing entries, the larger the IDF, indicating that entries have a good ability to distinguish between categories.) However, in fact, if an entry appears frequently in a class of documents, it indicates that the entry can well represent the characteristics of the text of this class. Such entries should be given high weight and selected as the characteristic word of this class of documents to distinguish them from other classes of documents. This is the deficiency of IDF
Recently, to extract domain concepts, TFIDF, as a very classic algorithm, can be used as one step.
The calculation formula is simple as follows:
Pretreatment
Since the candidate words to be processed are about 3W + and the number of corpus documents is 1W +, it is time-consuming to directly traverse the text one by one, and the processing time of each word is more than one minute.
In order to shorten the time, word segmentation is carried out first. A word is output as a line for convenient statistics. Hanlp is selected as the word segmentation tool.
Then, the documents in one field are merged into one file and divided with the "$$" identifier to facilitate the recording of the number of documents.
The following is the selected domain corpus (under the path directory):
code implementation
Operation results
The test term is "retirees", and the intermediate results are as follows:
Final result:
conclusion
It can be seen that the TFIDF value of "retirees" in the field of endowment insurance and social security is relatively high, which can be used as a basis to judge whether they are domain concepts.
Of course, although TF-IDF algorithm is very classic, it still has many shortcomings and can not rely on its results alone.
The above is all about the code sharing of Java TFIDF algorithm. I hope it will be helpful to you. Interested friends can continue to refer to this website:
Detailed example of Java Monte Carlo algorithm for approximate value of PI
Complete code example of red black tree implemented by Java algorithm
Java implementation of a variety of sorting algorithm code examples