Java – natural language processing – features of text classification
So I try to use Weka SVM to classify text So far, the feature vector I used to train SVM consists of TF-IDF statistics of unigrams and bigrams in the training text However, the results I got from testing the trained SVM model are not accurate at all, so can someone give me feedback on my program? I classify text as follows:
>Construct a dictionary composed of unigrams and bigrams extracted from the training text > calculate the number of occurrences of each unigram / bigram in each training text and the number of training texts appearing in unigram / bigram > calculate the TF-IDF of each unigram / bigram using the data in step 2 > for each document, construct a feature vector, which is the length of the dictionary, And store the corresponding TF-IDF statistics in each element of the vector (for example, the first element in the feature vector of document 1 will correspond to the TF-IDF of the first word in the dictionary relative to document 1) > attach the class label to each feature vector, To distinguish which text belongs to which author > train SVM with these eigenvectors > the eigenvectors of the test text are constructed in the same way as the training text and classified by SVM
Also, do I need to train SVM with more functions? If so, which functions are most effective in this case? Thank you very much for any help, thank you
Solution
Natural language documents usually contain many words that appear only once, also known as hapax legomenon For example, 44% of different words in Moby Dick appear only once and 17% appear twice
Therefore, including all words from the corpus usually leads to too many features In order to reduce the size of this function space, the NLP system usually uses one or more of the following:
>Delete stop words – for author classification, these are usually short common words, such as is, the, at, which, etc. > Stem - popular stem analyzers (such as porter stem analyzer) use a set of rules to standardize word deformation For example, walking, walking and walking are mapped to stem walking. > Correlation / importance threshold – calculates the Pearson correlation coefficient or the p value of each feature relative to the category label Then set the threshold and delete all functions with scores lower than the threshold. > Coverage threshold – similar to the above threshold, remove all features that do not appear in at least t documents, where t is very small (< 0.05%) relative to the size of the whole corpus. > Filter by part of speech – for example, consider only verbs or delete nouns. > Filter by system type – for example, NLP systems for clinical texts may only consider words found in medical dictionaries
For the stem, delete the stop word, index the corpus, and calculate TF_ For IDF or document similarity, I recommend Lucene Google "Lucene in 5 minutes" to get some quick and easy tutorials on using Lucene