Match two very large vectors with tolerance (fast! But workspace reserved)
Consider that I have two vectors One is the reference vector / list, which includes all values of interest and a sample vector that can contain any possible value Now I want to find the matching of my samples in the reference list, and have a certain tolerance. The tolerance is not fixed and depends on the comparison value in the vector:
matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5
Rounding two vectors is no choice!
For example, consider:
referencelist <- read.table(header=TRUE,text="value name 154.00312 A 154.07685 B 154.21452 C 154.49545 D 156.77310 E 156.83991 F 159.02992 G 159.65553 H 159.93843 I") sample <- c(154.00315,159.02991,154.07688,156.77312)
So I got the result:
name value reference 1 A 154.00315 154.00312 2 G 159.02991 159.02992 3 B 154.07688 154.07685 4 E 156.77312 156.77310
All I can do is use, for example, external functions like
myDist <- outer(referencelist,sample,FUN=function(x,y) abs(((x - y)/y)*10^6)) matches <- which(myDist < 0.5,arr.ind=TRUE) data.frame(name = referencelist$name[matches[,1]],value=sample[matches[,2]])
Or I can use the for () loop
But my special problem is that the reference vector has about 1 * 10 ^ 12 entries, and my sample vector is about 1 * 10 ^ 7 So by using outer (), I can easily break all the workspace constraints, and by using for () or chained for () loops, this will take days / weeks to complete
Anyone who knows how to do this quickly in R is still accurate, but works on the largest computer 64 GB memory?
Thanks for your help!
Best wishes
Solution
Your game conditions
abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5
Can be rewritten as
sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)
eps = 0.5E-6.
Using this, we can use unequal join to find all matches (not only the most recent!) in the reference list of each sample:
library(data.table) options(digits = 10) eps <- 0.5E-6 # tol * 1E6 setDT(referencelist)[.(value = sample,lower = sample * (1 - eps),upper = sample * (1 + eps)),on = .(ref > lower,ref < upper),.(name,value,reference = x.ref)]
It reproduces the expected results:
In response to op's comment, suppose we have a modified referencelist2, where f = 154.00320, then this will also be captured:
setDT(referencelist2)[.(value = sample,reference = x.ref)]