Java fast string matching (associating text with categories)

Suppose I have a similar post

>Title: "Web: 2011 SEO" > Description: "a meeting to talk about SEO on the Internet in 2011"

In addition, I have a list of categories related to keywords:

>"It" (CAT) – > "web design", "search engine optimization", "development", "web development" (keyword)

I have multiple categories (it, art, medicine, literature, machinery, etc.)

I need to use java to automatically upgrade my posts with these categories and keywords (a tag) to improve future search

The above example: should match "SEO" and "Web", so main_ Category field should be filled with "it", subfield_ Category should be populated with "SEO" or "Web" (or both, which is great)

My problem is that the only solution I can come up with is to go into enforcement (test all words, when a list matches your category and the keywords associated with it), it will reduce my performance

Is there any better way to search? I can also modify my category - > keyword structure to do better things (I still don't know how...)

Thank you in advance!

Editor: as mentioned in the comments, accuracy is not so important I don't need 100% accurate tags because I know I can have an honest and correct number based on the original matching of the string

In addition, I think the logic is: view the post title / description, search for any matching keywords, mark the category, search for more keywords in this category, and save 3 to 5 matching keywords

Solution

You may want to try different ways to use machine learning

Algorithm description: first, create a learning sample [you determine how to mark their documents, you can manually mark the sample and use it as the input of the algorithm] Then, create bag of words for these samples using K package words [you will need to benchmark the quality to determine which K is the best, which I will explain later]

Each word is a "function". Next, for each new document, you will try to find which document in the learning sample is nearest neighbor [that is, there are most "words" in your bag of words, and the new document will be marked as the nearest neighbor

How to evaluate quality? You can check the quality by taking 10% of the documents from the learning sample and learning only the remaining 90% After learning, you can evaluate the accuracy of the algorithm by checking the accuracy of the remaining 10% Note that you may need to do this several times to find the best K [bag of words size] as described above

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>