Ikanalyzer combines Lucene to realize Chinese word segmentation (example explanation)

1. Basic introduction

With the more and more extensive application of word segmentation in the field of information retrieval, the technology of word segmentation is not strange to everyone. The processing of English word segmentation is relatively simple. English word segmentation can be basically realized through the process of splitting words, excluding stop words and extracting word stems. For Chinese word segmentation alone, due to the complexity of semantics, word segmentation is not as simple as English word segmentation. It is generally realized through relevant word segmentation tools. At present, paoding word segmentation and ikanalyzer are commonly used. Here we mainly talk about the basic use of ikanalyzer through a simple demo. Ikanalyzer is an open source word segmentation toolkit based on Java. It is independent of Lucene project and provides the default implementation of Lucene.

2. Ikanalyzer combined with Lucene to realize simple Chinese word segmentation

Let's explain it through a basic demo. The steps are as follows:

Step 1: prepare related jar dependencies, lucene-core-5.1 0.jar、ik. Jar, then create a new project and introduce related dependent projects. The project structure is as follows:

IkDemo-src      -con. funnyboy. ik -IKAnalyzer. cfg. xml      -stopword. dic -ext.dic -Reference Libraries      -lucene-core-5.1. 0.jar      -ik. jar

IKAnalyzer. cfg. XML: configure the extension dictionary and stop dictionary as follows:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYstem "http://java.sun.com/dtd/properties.dtd"> 
<properties> <comment>IK Analyzer 扩展配置</comment>
   <entry key="ext_dict">ext.dic;</entry>
   <entry key="ext_stopwords">stopword.dic;</entry>
</properties>
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>