Solution to list and stream de duplication in Java

problem

At present, the Internet technology is mature, and more and more tend to be decentralized, distributed and streaming computing, which makes a lot of things previously done on the database side put on the Java side. Today, someone asked, if the database field does not have an index, how should it be duplicated according to the field? Everyone agrees to do it in Java, but how?

answer

Suddenly think of the articles that have been written before, find them and have a look. The method is to rewrite the hashcode and equals methods of the objects in the list, throw them into the HashSet, and then take them out. This is the answer recited and written like a dictionary when I first learned Java. For example, in an interview, I met people who claim to have worked in Java for 3 years, asked the difference between set and HashMap, and asked how to implement it. In other words, beginners only recite the characteristics. But when you really use it in a project, you need to make sure it's true. Because endorsement is useless, we can only believe in the result. You need to know how HashSet helped me get rid of the weight. Another way of thinking, can I redo it without HashSet? The simplest and most direct way is to compare with historical data every time. If it is different, insert it at the end of the team. HashSet just accelerates the process.

First, give the object we want to sort

The goal is to take out the user whose ID is not repeated. In order to prevent wrangling, give a rule. Just take out the data with the unique ID at will, and don't worry about which is counted when the ID is the same.

In the most intuitive way

This method is to use an empty list to store the traversed data.

Using HashSet

Everyone who has memorized the features knows that HashSet can be de duplicated, so how to de duplicate it? A little further back, according to hashcode and equals methods. So how to do it according to these two? People who haven't seen the source code can't continue, and the interview is over.

In fact, HashSet is implemented by HashMap (when I haven't seen the source code, I always intuitively thought that the key of HashMap is implemented by HashSet, on the contrary). We won't expand the description here. We can understand it by looking at the construction method and add method of HashSet.

Therefore, it can also be seen that the de duplication of HashSet is implemented according to HashMap, and the implementation of HashMap completely depends on hashcode and equals methods. Now it is completely open. If you want to use HashSet, you must be optimistic about these two methods.

In this topic, we should de duplicate according to ID, so our comparison basis is ID. Amend as follows:

Among them, objects calls the hashcode of arrays, as shown above. Multiply by 31 equals x < < 5-x.

The final realization is as follows:

Using java stream to remove duplicate

Back to the original question, the reason for raising this question is that if you want to bring the database side back to the Java side, the amount of data may be large, such as 10W. For big data, using stream correlation function is the simplest. Stream also provides the distinct function. So how should I use it?

I don't see lambda as a parameter, that is, no user-defined conditions are provided. Fortunately, Javadoc has marked the de duplication standard:

We know that we must also recite such a criterion: when equals returns true, the return value of hashcode must be the same This is a little confusing when reciting, but you won't feel awkward as long as you understand the implementation of HashMap. HashMap first locates according to hashcode method, and then compares equals method.

Therefore, to use distinct to implement de duplication, you must override the hashcode and equals methods unless you use the default.

So why on earth did you do this? Click in to see the realization.

The internal is implemented by reduce. When you think of reduce, you instantly think of a way to implement distinguishbykey yourself. As long as I use reduce, the calculation part is to take out the elements of the stream and compare them with my own built-in HashMap. If there is any, skip it and if not, put it in. In fact, the idea is still the most straightforward method at the beginning.

Of course, if it is a parallel stream, it is not necessarily the first one, but random.

The above method is the best and non-invasive method found so far. But if you have to use distinct. Hashcode and equals can only be overridden like the HashSet method.

Summary

Whether you can use these things or not, you can only practice by yourself, otherwise it is difficult to take them out at once when you really want to use them, or you will take risks to use them. If you really want to use it boldly, it is also necessary to understand the rules and implementation principles. For example, how is the implementation of linkedhashset different from that of HashSet.

A simple linkedhashset source code is attached:

Supplement:

Method of removing duplicate data from list collection in Java

1. Loop through all elements in the list and delete duplicates

2. Kick out duplicate elements through HashSet

3. Delete duplicate elements in ArrayList and keep the order

4. Traverse the objects in the list and use list Contain(), if it does not exist, it will be put into another list collection

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>