Java – Lucene index problem with “–” character
I came across a Lucene index that indexes words with "–" characters
It applies to some words that contain "–" but not all. I can't find the reason why it doesn't work
The field I am searching for is parsed and contains versions of words with and without "–" characters
I'm using the parser: org apache. lucene. analysis. standard. StandardAnalyzer
Here is an example:
If I search for "GSX - *", I get a result. The index field contains "Suzuki gsx-r 1000 gsx-r1000 gsxr"
But if I search for "V - *", I have no results The index field of the expected result contains: "Suzuki DL 1000 v-strom dl1000v-stromvstrom V strom"
If I search for "v-strom" without "*", it works, but if I just search for "v-str", for example, I don't get results (it should be a result, because it is a real-time search of online stores)
So what is the difference between the two expected results? Why does it apply to "GSX" but not to "V -"?
Solution
Standard analyzer treats hyphens as white space, I believe So it converts your query "GSX - *" to "GSX *" and "V - *" because it also eliminates single letter tokens The field content you see in the search results is the stored value of the field, which is completely independent of the terms indexed for the field
So what you want is "V - strom" as a whole to become an index term Standard analyzer is not suitable for this text Maybe you can go with whitespace analyzer or simple analyzer If you still haven't cut it, you can choose to throw your analyzer together or separate it, and then use tokenfilters to combine it further the Lucene Analysis package Javadoc. Provides a very good explanation
BTW, it is not necessary to enter all variants in the index, such as v-strom, v-strom, etc The idea is to make the same parser standardize all these variants into the same string in the index and parse the query at the same time