Example explanation of Lucene’s implementation of index and query
0 Introduction
With the development of the world wide web and the advent of the era of big data, a large amount of digital information is produced, stored, transmitted and transformed every day. How to find the information that meets your needs in a certain way from a large amount of information, make it orderly and make use of it has become a difficult problem. Full text retrieval technology is the most common information query application. In life, we use search engines to find information in blog forums. The core principle of these searches is the full-text retrieval technology to be realized in this paper. With the realization of document information digitization, the effective storage and timely and accurate extraction of information is the foundation for every company, enterprise and unit. There are many mature theories and methods for English full-text retrieval. Lucene, an open source full-text retrieval engine, is a sub project of the Jakarta project team of Apache Software Foundation. Its purpose is to provide a simple and easy-to-use toolkit for software developers to realize the function of full-text retrieval in the target system. Lucene does not support Chinese, but there are many open-source Chinese word splitters that can index Chinese content. Based on the study of Lucene's core principles, this paper realizes the crawling and retrieval of Chinese and English web pages respectively.
1 Lucene introduction
1.1 introduction to Lucene
Lucene is a full-text search engine toolkit written in Java, which constructs two core functions of index and search, and they are independent of each other, which makes it easy for developers to expand. Lucene provides rich APIs and can easily interact with the information stored in the index. It should be noted that it is not a complete full-text retrieval application, but provides indexing and search functions for applications. That is, if you want Lucene to really work, you need to do some necessary secondary development on its basis.
Lucene's structure design is similar to the database design, but Lucene's index is very different from the database. Both the database and Lucene index are built for convenience, but the database is only built for some fields, and the data needs to be converted into formatted information and saved. Full text retrieval is to index all information in a certain way. The differences and similarities between the two searches are shown in Table 1-1.
Table 1-1: comparison between database retrieval and Lucene retrieval
1.2 overall structure of Lucene
The release form of Lucene software package is a jar file. The version update is fast and the version gap is large. 5.3.1 is used in this paper 1. See table 1-2 for the main sub packages used.
Table 1-2: sub packages and functions
1.3 Lucene architecture design
Lucene is very powerful, but fundamentally speaking, it mainly includes two parts: one is to index and store words after segmentation from text content; The second is to return the results according to the query criteria, that is, to establish the index and query.
As shown in Figure 1-1, this paper throws out external interfaces and information sources, focusing on indexing and querying the text content crawled by the web page.
Figure 1-1: Lucene architecture design
2. JDK installation and environment variable configuration
1. JDK Download:
Download the compressed package conforming to the system version on the official website of Oracle. The website is as follows. Click Install and install according to the prompts. During the installation process, you will be prompted whether to install JRE, and click Yes.
http://www.oracle.com/technetwork/java/javase/downloads/index.html