Java – an algorithm for calculating public group membership using big data
I need to write a program to calculate the number of times two users are in the same group Users are assigned by user name and group ID
john 32 john 21 jim 21 jim 32 bob 32
I want results:
john-jim 2 john-bob 1 jim-bob 1
It sounds trivial But the problem is: I have 18 million groups and 300000 users There are also many memberships (I expect an average of at least 50 per user, maybe more) This means a lot of data and processing
I have written five different programs, none of which can reduce the amount of data: it is as slow as PostgreSQL query Run out of memory and run in the map in Java working memory (the first heap space, after optimization, I got a rare "exceeding the GC overhead limit") Continuous writing to the database from Java is too slow (even if optimized with batch queries) More and more desperate, I tried something more strange, such as writing all pairs into an array, sorting them (O (n log (n)), and then calculating them as PEU à PEU But there is still too much data to store in memory
Any ideas about algorithms? Or impossible?
Solution
RDBMS is specially used for sorting and other operations Performing this operation outside the database is almost impossible to approach in performance Do it with SQL!
This will complete the work (simplified update):
SELECT t1.usr || '-' || t2.usr,count(*) AS ct FROM usr_grp t1 JOIN usr_grp t2 USING (grp_id) WHERE t2.usr > t1.usr -- prevent dupes and get sorted pair GROUP BY t1.usr,t2.usr;
As you said, depending on the number of overlaps you have, this can produce a large number of rows So it will never be fast
Ask a question: what is the purpose of generating millions of rows that no one can deal with? Are you sure this operation makes sense from the beginning?
To make it faster, you can
>Upgrade! PostgreSQL 8.4 is rather outdated by now. In particular, PostgreSQL 9.2 focuses on big data For jobs like this, you can expect better performance No one should run 8.4 0. For security reasons only, you also missed a lot of bug fixes The current release point is 8.4 17. Websites I refer to:
>Use an integer as the user's proxy key, so only in usr_ Handle integers in GRP Make tables and indexes smaller and process faster If the N: m table (usr_grp) has a larger cardinality than the table usr, it should be faster, even if it means additional connections
SELECT u1.usr || '-' || u2.usr,count(*) AS ct FROM usr_grp t1 JOIN usr_grp t2 USING (grp_id) JOIN usr u1 ON t1.usr_id = u1.usr_id JOIN usr u2 ON t2.usr_id = u2.usr_id WHERE t2.usr_id > t1.usr_id GROUP BY u1.usr_id,u2.usr_id;
>Create a multi column index (if not already) grp_ I must come first Why does this matter?
CREATE INDEX usr_grp_gu_idx ON usr_grp(grp_id,usr_id);
>Put a large amount of ram into the machine and increase work_ MEM and shared_ Buffers settings
test case
I took the number @ oldcurmudgeon reported as his test case and created a similar test case in PostgreSQL
-> SQLfiddle demo.
About 250 milliseconds in this common test database The result is not sorted (no order by) because it has not been specified Compared with 2.5 minutes, reported below Factor 600