Java – an algorithm for calculating public group membership using big data

I need to write a program to calculate the number of times two users are in the same group Users are assigned by user name and group ID

john 32
john 21
jim 21
jim 32
bob 32

I want results:

john-jim 2 
john-bob 1
jim-bob 1

It sounds trivial But the problem is: I have 18 million groups and 300000 users There are also many memberships (I expect an average of at least 50 per user, maybe more) This means a lot of data and processing

I have written five different programs, none of which can reduce the amount of data: it is as slow as PostgreSQL query Run out of memory and run in the map in Java working memory (the first heap space, after optimization, I got a rare "exceeding the GC overhead limit") Continuous writing to the database from Java is too slow (even if optimized with batch queries) More and more desperate, I tried something more strange, such as writing all pairs into an array, sorting them (O (n log (n)), and then calculating them as PEU à PEU But there is still too much data to store in memory

Any ideas about algorithms? Or impossible?

Solution

RDBMS is specially used for sorting and other operations Performing this operation outside the database is almost impossible to approach in performance Do it with SQL!

This will complete the work (simplified update):

SELECT t1.usr || '-' || t2.usr,count(*) AS ct
FROM   usr_grp t1
JOIN   usr_grp t2 USING (grp_id) 
WHERE  t2.usr > t1.usr   -- prevent dupes and get sorted pair
GROUP  BY t1.usr,t2.usr;

As you said, depending on the number of overlaps you have, this can produce a large number of rows So it will never be fast

Ask a question: what is the purpose of generating millions of rows that no one can deal with? Are you sure this operation makes sense from the beginning?

To make it faster, you can

>Upgrade! PostgreSQL 8.4 is rather outdated by now. In particular, PostgreSQL 9.2 focuses on big data For jobs like this, you can expect better performance No one should run 8.4 0. For security reasons only, you also missed a lot of bug fixes The current release point is 8.4 17. Websites I refer to:

>Use an integer as the user's proxy key, so only in usr_ Handle integers in GRP Make tables and indexes smaller and process faster If the N: m table (usr_grp) has a larger cardinality than the table usr, it should be faster, even if it means additional connections

SELECT u1.usr  || '-' || u2.usr,count(*) AS ct
FROM   usr_grp t1
JOIN   usr_grp t2 USING (grp_id) 
JOIN   usr u1 ON t1.usr_id = u1.usr_id
JOIN   usr u2 ON t2.usr_id = u2.usr_id
WHERE  t2.usr_id > t1.usr_id
GROUP  BY u1.usr_id,u2.usr_id;

>Create a multi column index (if not already) grp_ I must come first Why does this matter?

CREATE INDEX usr_grp_gu_idx ON usr_grp(grp_id,usr_id);

>Put a large amount of ram into the machine and increase work_ MEM and shared_ Buffers settings

test case

I took the number @ oldcurmudgeon reported as his test case and created a similar test case in PostgreSQL

-> SQLfiddle demo.

About 250 milliseconds in this common test database The result is not sorted (no order by) because it has not been specified Compared with 2.5 minutes, reported below Factor 600

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>