Java – how to effectively calculate cosine similarity between millions of strings
I need to calculate the cosine similarity between strings in the list For example, I have a list of more than 10 million strings, and each string must determine its own similarity to each other string in the list What is the best algorithm I can use to accomplish this task effectively and quickly? Is divide and conquer algorithm applicable?
edit
I want to determine which strings are most similar to a given string and can obtain a measure / score related to similarity I think what I want to do is consistent with clusters. The number of clusters was not known at first
Solution
Use transpose matrix This is what mahout does on Hadoop to quickly complete this task (or just use mahout)
In essence, the naive way of calculating cosine similarity is not good Because you finally calculated a lot of things with 0 * Instead, you'd better work in the column and leave all 0. 0 there