Java – ws4j returns infinity for the similarity measure that should return 1
I have a very simple code from this example. I use Lin, path and Wu Palmer similarity measures to calculate the similarity between two words My code is as follows:
import edu.cmu.lti.lexical_db.ILexicalDatabase; import edu.cmu.lti.lexical_db.NictWordNet; import edu.cmu.lti.ws4j.RelatednessCalculator; import edu.cmu.lti.ws4j.impl.Lin; import edu.cmu.lti.ws4j.impl.Path; import edu.cmu.lti.ws4j.impl.WuPalmer; public class Test { private static ILexicalDatabase db = new NictWordNet(); private static RelatednessCalculator lin = new Lin(db); private static RelatednessCalculator wup = new WuPalmer(db); private static RelatednessCalculator path = new Path(db); public static void main(String[] args) { String w1 = "walk"; String w2 = "trot"; System.out.println(lin.calcRelatednessOfWords(w1,w2)); System.out.println(wup.calcRelatednessOfWords(w1,w2)); System.out.println(path.calcRelatednessOfWords(w1,w2)); } }
When two words are the same, except that the scores are expected If two words are the same (for example, W1 = "walk"; W2 = "walk";), All three metrics I have should return 1.0 But instead, they are returning 1.7976931348623157e308
I've used ws4j (actually the same version) before, but I've never seen this behavior The online search did not produce any clues What problems may arise here?
Attachment: the fact that Lin, Wu Palmer and path measures should return 1 can also be verified by the online demo provided by ws4j
Solution
I have a similar problem. That's what happens here I hope others who encounter this problem will find help through reply
If you notice, the online presentation allows you to select word meaning by specifying words in the following format: word #pos_ tag #word_ sense. For example, the gender of a noun with the meaning of the first word will be gender #n#1
Your code snippet defaults to the first word sense When I calculate the wupalmer similarity between gender and gender, it will return 0.26 If I use the online demo, it will return 1.0 But if we use "gender #n#1" and "gender #n#1", the online presentation will return 0.26, so there is no difference The online demonstration calculates the maximum value of all POS tag / word detection pairs This is a corresponding code fragment that can solve this problem:
ILexicalDatabase db = new NictWordNet(); WS4JConfiguration.getInstance().setMFS(true); RelatednessCalculator rc = new Lin(db); String word1 = "gender"; String word2 = "sex"; List<POS[]> posPairs = rc.getPOSPairs(); double maxscore = -1D; for(POS[] posPair: posPairs) { List<Concept> synsets1 = (List<Concept>)db.getAllConcepts(word1,posPair[0].toString()); List<Concept> synsets2 = (List<Concept>)db.getAllConcepts(word2,posPair[1].toString()); for(Concept synset1: synsets1) { for (Concept synset2: synsets2) { Relatedness relatedness = rc.calcRelatednessOfSynset(synset1,synset2); double score = relatedness.getscore(); if (score > maxscore) { maxscore = score; } } } } if (maxscore == -1D) { maxscore = 0.0; } System.out.println("sim('" + word1 + "','" + word2 + "') = " + maxscore);
In addition, this will give you 0.0 similarities in non stemmed forms, such as' gender 'and' sex ' You can use the porter Stemmer included in ws4j to ensure that words are killed before they are needed
I hope this can help!