Java – find the location of the search hit from Lucene
Using Lucene, what is the recommended method to find a match in the search results?
More specifically, assume that the indexed document has a field "fulltext", which stores the plain text content of some documents In addition, suppose that for one of these files, the content is "fast brown fox jumps over lazy dog" Next, search for "fox dog" Obviously, this document is very popular
In this case, can Lucene be used to provide a matching area similar to finding a document? So in this case, I want to produce something similar:
[{match: "fox",startIndex: 10,length: 3},{match: "dog",startIndex: 34,length: 3}]
I doubt it can be through org apache. lucene. search. The content provided in the highlight package I'm not sure about the overall approach
Solution
I use termfreqvector This is a working demonstration that prints the term location and the start and end term indexes:
public class Search { public static void main(String[] args) throws IOException,ParseException { Search s = new Search(); s.doSearch(args[0],args[1]); } Search() { } public void doSearch(String db,String querystr) throws IOException,ParseException { // 1. Specify the analyzer for tokenizing text. // The same analyzer should be used as was used for indexing StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); Directory index = FSDirectory.open(new File(db)); // 2. query Query q = new QueryParser(Version.LUCENE_CURRENT,"contents",analyzer).parse(querystr); // 3. search int hitsPerPage = 10; IndexSearcher searcher = new IndexSearcher(index,true); IndexReader reader = IndexReader.open(index,true); searcher.setDefaultFieldSortScoring(true,false); TopscoreDocCollector collector = TopscoreDocCollector.create(hitsPerPage,true); searcher.search(q,collector); scoreDoc[] hits = collector.topDocs().scoreDocs; // 4. display term positions,and term indexes System.out.println("Found " + hits.length + " hits."); for(int i=0;i<hits.length;++i) { int docId = hits[i].doc; TermFreqVector tfvector = reader.getTermFreqVector(docId,"contents"); TermPositionVector tpvector = (TermPositionVector)tfvector; // this part works only if there is one term in the query string,// otherwise you will have to iterate this section over the query terms. int termidx = tfvector.indexOf(querystr); int[] termposx = tpvector.getTermPositions(termidx); TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx); for (int j=0;j<termposx.length;j++) { System.out.println("termpos : "+termposx[j]); } for (int j=0;j<tvoffsetinfo.length;j++) { int offsetStart = tvoffsetinfo[j].getStartOffset(); int offsetEnd = tvoffsetinfo[j].getEndOffset(); System.out.println("offsets : "+offsetStart+" "+offsetEnd); } // print some info about where the hit was found... Document d = searcher.doc(docId); System.out.println((i + 1) + ". " + d.get("path")); } // searcher can only be closed when there // is no need to access the documents any more. searcher.close(); } }