Match two very large vectors with tolerance (fast! But workspace reserved)

Consider that I have two vectors One is the reference vector / list, which includes all values of interest and a sample vector that can contain any possible value Now I want to find the matching of my samples in the reference list, and have a certain tolerance. The tolerance is not fixed and depends on the comparison value in the vector:

matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

Rounding two vectors is no choice!

For example, consider:

referencelist <- read.table(header=TRUE,text="value  name
154.00312  A
154.07685  B
154.21452  C
154.49545  D
156.77310  E
156.83991  F
159.02992  G
159.65553  H
159.93843  I")

sample <- c(154.00315,159.02991,154.07688,156.77312)

So I got the result:

name value      reference
1    A   154.00315  154.00312
2    G   159.02991  159.02992
3    B   154.07688  154.07685
4    E   156.77312  156.77310

All I can do is use, for example, external functions like

myDist <- outer(referencelist,sample,FUN=function(x,y) abs(((x - y)/y)*10^6))
matches <- which(myDist < 0.5,arr.ind=TRUE)
data.frame(name = referencelist$name[matches[,1]],value=sample[matches[,2]])

Or I can use the for () loop

But my special problem is that the reference vector has about 1 * 10 ^ 12 entries, and my sample vector is about 1 * 10 ^ 7 So by using outer (), I can easily break all the workspace constraints, and by using for () or chained for () loops, this will take days / weeks to complete

Anyone who knows how to do this quickly in R is still accurate, but works on the largest computer 64 GB memory?

Thanks for your help!

Best wishes

Solution

Your game conditions

abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

Can be rewritten as

sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)

eps = 0.5E-6.

Using this, we can use unequal join to find all matches (not only the most recent!) in the reference list of each sample:

library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample,lower = sample * (1 - eps),upper = sample * (1 + eps)),on = .(ref > lower,ref < upper),.(name,value,reference = x.ref)]

It reproduces the expected results:

In response to op's comment, suppose we have a modified referencelist2, where f = 154.00320, then this will also be captured:

setDT(referencelist2)[.(value = sample,reference = x.ref)]
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>