Java – effectively find duplicates in constrained many to many datasets?

I have to write a batch version of our webapp

The workflow of this task is:

1) Using the browser, upload the files of the following tables:

# ObjectID,CategoryID
Oid1,Cid1
Oid2,Cid1
Oid3,Cid2
Oid4,Cid2
[etc.]

The file is likely to have dozens to hundreds of lines, but it can definitely have thousands of lines

In an ideal world, a given object ID will only appear in the file once (reflecting the fact that an object can only be assigned to one category). However, since the file is created outside our control, it can not be guaranteed that this is actually correct. This possibility must be handled

2) The server will receive the file, parse it, preprocess it and display the following page:

723 objects to be assigned to 126 categories
142 objects not found
 42 categories not found

Do you want to continue?

[Yes]     [No]

3) If the user clicks the Yes button, the server will actually do the work

Because I don't want to parse the file in steps (2) and (3), as part of (2), I need to establish a viable container request and retain useful data representation. Easily provide data to fill the "Preview" page and let me finish my actual work effectively (obviously we have meetings, and we usually keep very little memory session state.)

There is an existence

assignObjectsToCategory(Set<ObjectId> objectIds,CategoryId categoryId)

Functions used to complete allocation through UI It is hoped that batch operations can also use this API. Besides simplicity, it also has many other business logic assignments. We need the same business logic to run this batch assignment. This batch assignment has been completed

Initially, if the file is "illegally" specified, it doesn't matter if multiple categories of a given object - one of the categories associated with the object and the file can be assigned to apply

Therefore, when I experienced obsolescence, I initially thought that in step (2), I would create and put the file into the cross request container, map < categoryid, set < objectid > (especially HashMap quick search and insertion), and then when to do what I can do, just iterate on the map, pull out the associated set < objectid > for each categoryid and pass them to assignobjectstocategory()

However, the requirements for how to handle duplicate objectids have changed They now have to deal with the following:

>If objectid appears more than once in the file and is associated with the same categoryid at all times, assign objects of this category. > If objectid appears more than once in the file and is associated with a different categoryid, consider this error and mention it on the preview page

This seems to confuse my map < categoryid, set < objectid > > strategy because it does not provide a good method to detect objectid I. the file just read out has been associated with categoryid

So my question is how to most effectively detect and track these duplicate objectids?

What I think of is using "forward" and "reverse" maps:

public CrossRequestContainer
{
    ...

    Map<CategoryId,Set<ObjectId>> objectsByCategory;  // HashMap
    Map<ObjectId,List<CategoryId>> categoriesByObject; // HashMap
    Set<ObjectId> illegalDuplicates;

    ...
}

Then, when each (objectid, categoryid) pair is read in, it will be put into two maps Once the file is fully read in, I can do:

for (Map.Entry<ObjectId,List<CategoryId>> entry : categoriesByObject.entrySet()) {
    List<CategoryId> categories = entry.getValue();
    if (categories.size() > 1) {
        ObjectId object = entry.getKey();
        if (!all_categories_are_equal(categories)) {
            illegalDuplicates.add(object);
            // Since this is an "illegal" duplicate I need to remove it
            // from every category that it appeared with in the file.
            for (CategoryId category : categories) {
                objectsByCategory.get(category).remove(object);
            }
        }
    }
}

At the end of this cycle, objectsbycategory will no longer contain any "illegal" duplicates and illegalduplicates. All "illegal" duplicates will be reported as needed Then, I can traverse objectsbycategory, get set < objectid > for each category, and call assignobjectstocategory() to perform allocation

However, although I think this will work, I am worried about storing data twice, especially when the input file is large And I'm afraid I'll miss something: efficiency, which will be very slow

Is there a way to do this without using dual memory, but can still run quickly? I missed something that would run a lot even with dual memory, slower than I expected?

Solution

Considering the limitations you give, I can't do this with less memory

One possible optimization method is to maintain only the category list of objects listed in multiple categories, otherwise only map objects to categories, that is:

Map<CategoryId,Set<ObjectId>> objectsByCategory;  // HashMap
Map<ObjectId,CategoryId> categoryByObject; // HashMap
Map<ObjectId,Set<CategoryId>> illegalDuplicates;  // HashMap

Yes, this adds another container, but it will contain (hopefully) only a few entries; In addition, the memory requirements of the categorybyobject mapping are reduced (one list overhead per entry)

Of course, the logic is a little more complicated When a copy is initially found, the object should be deleted from the categorybyobject map and added to the illegalduplicates map Before adding any objects to the categorybyobject map, you need to first check the illegalduplicates map

Finally, after building the other two mappings, building the objectsbycategory mapping in a separate loop may not compromise performance, and it will simplify the code

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>