Java – how to effectively calculate cosine similarity between millions of strings

2020-04-02 • Java

I need to calculate the cosine similarity between strings in the list For example, I have a list of more than 10 million strings, and each string must determine its own similarity to each other string in the list What is the best algorithm I can use to accomplish this task effectively and quickly? Is divide and conquer algorithm applicable?

edit

I want to determine which strings are most similar to a given string and can obtain a measure / score related to similarity I think what I want to do is consistent with clusters. The number of clusters was not known at first

Solution

Use transpose matrix This is what mahout does on Hadoop to quickly complete this task (or just use mahout)

In essence, the naive way of calculating cosine similarity is not good Because you finally calculated a lot of things with 0 * Instead, you'd better work in the column and leave all 0. 0 there

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Method of Android parsing JSON data format

< <上一篇

Android image compression upload Foundation

下一篇>>

搜索内容

Java – how to effectively calculate cosine similarity between millions of strings

Solution

热门文章