Ikanalyzer combines Lucene to realize Chinese word segmentation (example explanation)
1. Basic introduction
With the more and more extensive application of word segmentation in the field of information retrieval, the technology of word segmentation is not strange to everyone. The processing of English word segmentation is relatively simple. English word segmentation can be basically realized through the process of splitting words, excluding stop words and extracting word stems. For Chinese word segmentation alone, due to the complexity of semantics, word segmentation is not as simple as English word segmentation. It is generally realized through relevant word segmentation tools. At present, paoding word segmentation and ikanalyzer are commonly used. Here we mainly talk about the basic use of ikanalyzer through a simple demo. Ikanalyzer is an open source word segmentation toolkit based on Java. It is independent of Lucene project and provides the default implementation of Lucene.
2. Ikanalyzer combined with Lucene to realize simple Chinese word segmentation
Let's explain it through a basic demo. The steps are as follows:
Step 1: prepare related jar dependencies, lucene-core-5.1 0.jar、ik. Jar, then create a new project and introduce related dependent projects. The project structure is as follows:
IkDemo-src -con. funnyboy. ik -IKAnalyzer. cfg. xml -stopword. dic -ext.dic -Reference Libraries -lucene-core-5.1. 0.jar -ik. jar
IKAnalyzer. cfg. XML: configure the extension dictionary and stop dictionary as follows:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYstem "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <entry key="ext_dict">ext.dic;</entry> <entry key="ext_stopwords">stopword.dic;</entry> </properties>