TF-IDF understanding and its Java implementation code example

TF-IDF

preface

Some time ago, I looked at the TF-IDF I had sorted out before and posted it on my blog. Knowledge needs to be repeated constantly, otherwise I feel rusty.

TF-IDF understanding

TF-IDF (term frequency cinverse document frequency) is a commonly used weighting technology for information retrieval and information mining. The main idea of TFIDF is that if a word or phrase appears frequently in one article and rarely in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification. TFIDF is actually TF* IDF, TF term frequency, IDF inverse document frequency. TF indicates the frequency of entries in document D. The main idea of IDF is: if there are fewer documents containing entry T, that is, the smaller n is, the larger IDF is, it indicates that entry t has a good ability to distinguish categories. If the number of documents containing entry t in a certain type of document C is m, and the total number of documents containing T in other types is k, obviously, the number of documents containing T is n = m + K. when m is large, n is also large, and the IDF value obtained according to the IDF formula will be small, indicating that the ability to distinguish the category of entry t is not strong. However, in fact, if an entry appears frequently in a class of documents, it indicates that the entry can well represent the characteristics of the text of this class. Such entries should be given high weight and selected as the characteristic word of this class of documents to distinguish them from other classes of documents. This is the deficiency of IDF

TF formula:

In the above formula

IDF formula:

|D|: total number of documents in Corpus

then

TF-IDF implementation (Java)

The external plug-in ikanalyzer-2012 is used here Jar for word segmentation

The specific codes are as follows:

The results are as follows:

common problem

Lucene jar package is not added

Lucene package and JE package versions are not suitable

summary

The above is all about TF-IDF understanding and Java implementation code examples in this paper. I hope it will be helpful to you. Interested friends can continue to refer to this website:

Complete code example of red black tree implemented by Java algorithm

Heap sorting code example of Java algorithm

Detailed example of Java Monte Carlo algorithm for approximate value of PI

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>