Java – Lucene index problem with “–” character

2019-04-30 • Java

I came across a Lucene index that indexes words with "–" characters

It applies to some words that contain "–" but not all. I can't find the reason why it doesn't work

The field I am searching for is parsed and contains versions of words with and without "–" characters

I'm using the parser: org apache. lucene. analysis. standard. StandardAnalyzer

Here is an example:

If I search for "GSX - *", I get a result. The index field contains "Suzuki gsx-r 1000 gsx-r1000 gsxr"

But if I search for "V - *", I have no results The index field of the expected result contains: "Suzuki DL 1000 v-strom dl1000v-stromvstrom V strom"

If I search for "v-strom" without "*", it works, but if I just search for "v-str", for example, I don't get results (it should be a result, because it is a real-time search of online stores)

So what is the difference between the two expected results? Why does it apply to "GSX" but not to "V -"?

Solution

Standard analyzer treats hyphens as white space, I believe So it converts your query "GSX - *" to "GSX *" and "V - *" because it also eliminates single letter tokens The field content you see in the search results is the stored value of the field, which is completely independent of the terms indexed for the field

So what you want is "V - strom" as a whole to become an index term Standard analyzer is not suitable for this text Maybe you can go with whitespace analyzer or simple analyzer If you still haven't cut it, you can choose to throw your analyzer together or separate it, and then use tokenfilters to combine it further the Lucene Analysis package Javadoc. Provides a very good explanation

BTW, it is not necessary to enter all variants in the index, such as v-strom, v-strom, etc The idea is to make the same parser standardize all these variants into the same string in the index and parse the query at the same time

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Cost per class in a Java application – fewer large classes or several smaller classes

< <上一篇

Java – unit testing, static and factory

下一篇>>

搜索内容

Java – Lucene index problem with “–” character

Solution

热门文章