Java – find the location of the search hit from Lucene

Using Lucene, what is the recommended method to find a match in the search results?

More specifically, assume that the indexed document has a field "fulltext", which stores the plain text content of some documents In addition, suppose that for one of these files, the content is "fast brown fox jumps over lazy dog" Next, search for "fox dog" Obviously, this document is very popular

In this case, can Lucene be used to provide a matching area similar to finding a document? So in this case, I want to produce something similar:

[{match: "fox",startIndex: 10,length: 3},{match: "dog",startIndex: 34,length: 3}]

I doubt it can be through org apache. lucene. search. The content provided in the highlight package I'm not sure about the overall approach

Solution

I use termfreqvector This is a working demonstration that prints the term location and the start and end term indexes:

public class Search {
    public static void main(String[] args) throws IOException,ParseException {
        Search s = new Search();  
        s.doSearch(args[0],args[1]);  
    }  

    Search() {
    }  

    public void doSearch(String db,String querystr) throws IOException,ParseException {
        // 1. Specify the analyzer for tokenizing text.  
        //    The same analyzer should be used as was used for indexing  
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);  

        Directory index = FSDirectory.open(new File(db));  

        // 2. query  
        Query q = new QueryParser(Version.LUCENE_CURRENT,"contents",analyzer).parse(querystr);  

        // 3. search  
        int hitsPerPage = 10;  
        IndexSearcher searcher = new IndexSearcher(index,true);  
        IndexReader reader = IndexReader.open(index,true);  
        searcher.setDefaultFieldSortScoring(true,false);  
        TopscoreDocCollector collector = TopscoreDocCollector.create(hitsPerPage,true);  
        searcher.search(q,collector);  
        scoreDoc[] hits = collector.topDocs().scoreDocs;  

        // 4. display term positions,and term indexes   
        System.out.println("Found " + hits.length + " hits.");  
        for(int i=0;i<hits.length;++i) {  

            int docId = hits[i].doc;  
            TermFreqVector tfvector = reader.getTermFreqVector(docId,"contents");  
            TermPositionVector tpvector = (TermPositionVector)tfvector;  
            // this part works only if there is one term in the query string,// otherwise you will have to iterate this section over the query terms.  
            int termidx = tfvector.indexOf(querystr);  
            int[] termposx = tpvector.getTermPositions(termidx);  
            TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);  

            for (int j=0;j<termposx.length;j++) {  
                System.out.println("termpos : "+termposx[j]);  
            }  
            for (int j=0;j<tvoffsetinfo.length;j++) {  
                int offsetStart = tvoffsetinfo[j].getStartOffset();  
                int offsetEnd = tvoffsetinfo[j].getEndOffset();  
                System.out.println("offsets : "+offsetStart+" "+offsetEnd);  
            }  

            // print some info about where the hit was found...  
            Document d = searcher.doc(docId);  
            System.out.println((i + 1) + ". " + d.get("path"));  
        }  

        // searcher can only be closed when there  
        // is no need to access the documents any more.   
        searcher.close();  
    }      
}
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>