Java – add token to Lucene tokenstream
I wrote a tokenfilter, which adds tokens to the stream
Tests show that it works, but I don't fully understand why
I would appreciate it if someone could clarify the semantics In particular, in (*), does restoring the state mean that we either overwrite the current token or the token created before capturing the state?
I probably did it
private final LinkedList<String> extraTokens = new LinkedList<String>(); private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); private State savedState; @Override public boolean incrementToken() throws IOException { if (!extraTokens.isEmpty()) { // Do we not loose/overwrite the current termAtt token here? (*) restoreState(savedState); termAtt.setEmpty().append(extraTokens.remove()); return true; } if (input.incrementToken()) { if (/* condition */) { extraTokens.add("fo"); savedState = captureState(); } return true; } return false; }
Does this mean that for the input stream of spaces, the string "a B C" is marked
(a) -> (b) -> (c) -> ...
Where BB is the new synonym of B, when restorestate is used, the diagram will be constructed as follows:?
(a) / \ (b) (bb) \ / (c) | ...
attribute
Given that the text foo bar Baz and fo are handles to Foo and qux and are synonyms of bar Baz, have I built the correct property sheet?
+--------+---------------+-----------+--------------+-----------+ | Term | startOffset | endOffset | posIncrement | posLenght | +--------+---------------+-----------+--------------+-----------+ | foo | 0 | 3 | 1 | 1 | | fo | 0 | 3 | 0 | 1 | | qux | 4 | 11 | 0 | 2 | | bar | 4 | 7 | 1 | 1 | | baz | 8 | 11 | 1 | 1 | +--------+---------------+-----------+--------------+-----------+
Solution
one
How the attribute based API works, each tokenstream in the analyzer chain will modify the status of some attributes every time incrementtoken() is called The last element in your chain generates the final token
Whenever the client of the parser chain calls incrementtoken(), the last tokenstream will set the state of some properties to any value required to represent the next tag If not, it may call incrementtoken () on its input to make the previous tokenstream work This continues until the last tokenstream returns false, indicating that no more tokens are available
A capturestate copies the state of all attributes calling tokenstream to a state, and a restorestate overwrites the state of each attribute, which was previously captured (given as a parameter)
The token filter works by calling input Incrementtoken(), so that the previous tokenstream will set the status of attributes to the next token Then, if the condition you defined is true (for example, termatt is "B"), it will add a "BB" to a stack, save the state somewhere and return true so that the client can use the token In the next call to incrementtoken(), it will not use input incrementToken(). Regardless of the current state, it represents a token that has been consumed previously Then, the filter restores the state so that everything is exactly the same as before, and then generates "BB" as the current token and returns true so that the client can use the token Only on the next call will it consume the next token (again) from the previous filter
This will not actually produce the graph you display, but insert "BB" after "B", so it is true
(a) -> (b) -> (bb) -> (c)
So why did you save the country first? When generating tokens, you want to ensure that, for example, phrase queries or highlighting will work properly When your words "a B C" and "BB" are synonymous with "B", you will want the phrase query "B C" to work, as well as "BB C" You must tell the index that "B" and "BB" are in the same position Lucene uses the location increment as the default value for each, and the location increment is 1, which means that each new token (read, call incrementtoken()) is after the previous location Therefore, in the final position, the production flow is
(a:1) -> (b:2) -> (bb:3) -> (c:4)
And what you really want
(a:1) — -> (b:2) -> — (c:3) \ / -> (bb:2) ->
Therefore, in order to generate the drawing, you must set the position increment of the inserted "BB" to 0
private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class); // later in incrementToken restoreState(savedState); posIncAtt.setPositionIncrement(0); termAtt.setEmpty().append(extraTokens.remove());
Restorestate ensures that other attributes, such as offset, token type, etc., are retained, and you only need to change other attributes required by the use case Yes, you are overwriting any state before restorestate, so it is your responsibility to use it in the right place As long as you don't call input Incrementtoken (), you will not input the stream in advance, so you can do whatever you want
two
The jammer only changes the token, usually does not generate a new token, and does not change the position increment or offset In addition, when position increment means that the current term should be in the position after the previous token, qux with increment of 1 should be used because it is the next token, and bar should be increased by 0 because it is in the same position as qux The table would rather look like
+--------+---------------+-----------+--------------+-----------+ | Term | startOffset | endOffset | posIncrement | posLenght | +--------+---------------+-----------+--------------+-----------+ | fo | 0 | 3 | 1 | 1 | | qux | 4 | 11 | 1 | 2 | | bar | 4 | 7 | 0 | 1 | | baz | 8 | 11 | 1 | 1 | +--------+---------------+-----------+--------------+-----------+
As a basic rule, for multiple synonyms, "ABC" is synonymous with "a B C", as you should see
>Positionincrement ("ABC") > 0 (increment of the first token) > positionincrement (*) > = 0 (position cannot be reversed) startoffset ("ABC") = = startoffset ("a") and endoffset ("ABC") = = endoffset ("C")
In fact, tokens in the same (start | end) position must have the same (start | end) offset
I hope this will help uncover some light