Java – word boundary detection from text

I have this problem with word boundary recognition I deleted all the tags in the Wikipedia document, and now I want to get a list of entities (a meaningful term) I'm going to take the binary of the file and check whether it exists in the dictionary There is no better way to achieve this goal

The following is sample text I want to identify entities (displayed in double quotes)

The Vulcan are humanoid species in the fictional "Star Trek" universe. They evolved on the Vulcan planet and are known for their rational and logical undisturbed survival They were the first alien species to come into formal contact with humans and later became one of the founding members of the United planetary alliance

Thank you, Barra

Solution

I think what you are talking about is still a new research topic, not a simple problem of applying mature algorithms

I can't give you a simple "do this" answer, but here are some fingertips:

>I think WordNet can work (I'm not sure where bigrams / trigrams enter), but you should treat WordNet lookup as part of a hybrid system rather than finding all and final results of named entities > then, First, apply some simple common sense criteria (uppercase word sequence; try and adapt to frequent lowercase function words, such as' of 'to these; sequence composed of "known title" and initial words); > Look for word sequences that you don't statistically want to be adjacent to each other as candidates for entities; > Can you create dynamic web search? (your system will check the uppercase sequence "IBM" and see if you find Wikipedia entries with text pattern "IBM is... [organization | company |...]". > see if anything here and in the "information extraction" literature will give you some ideas: http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html

The fact is, when you look at the literature there, people don't seem to use very complex and perfect algorithms So I think there is a lot of space to view your data, explore and see what you can think of... Good luck!

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>