Java – how do I cross two arrays of sorted integers without duplicates?
This is an interview question. I use it as a programming exercise
Input: two sorted integer arrays a and B are allocated in ascending order and different sizes N and m respectively
Output: a sorted integer array C containing the elements appearing in a and B in ascending order
Limit: duplicate is not allowed in C
Example: for inputs a = {3,6,8,9} and B = {4,5,9,10,11}, the output should be C = {6,9}
Thank you for your answer, all! In short, there are two main approaches to this problem:
My original solution was to keep two pointers, one for each array, swap the scan array from left to right, and select the matching elements at the same time Therefore, when the element of our current array is larger than the second array, we continue to increase the pointer of the second array until we find the current first array element or hypercube (find one) I keep all matches in a single array, and once we reach the end of any input array, it returns
Another way we can do this is to scan one of the linear arrays and use binary search to find a match in the second array This would mean o (n * log (m)) time if we scan a and each of its n elements for a binary search on B (O (log (m)) time)
I have implemented two methods and conducted an experiment to see the two comparisons (details can be found here) When n has 1 million elements and M is about 70 times that of N, the binary search method seems to win
Solution
This problem is essentially reduced to a join operation, followed by a filter operation (remove duplicates and keep only internal matches)
Because the inputs have been sorted, the connection can be effectively realized through the merge join of O (O (size (a) size (b))
The filter operation will be o (n), because the connected output is sorted, and to remove duplicates, all you need to do is check whether each element is the same as the previous element Filtering only internal matches is trivial. You just discard any mismatched elements (external connections)
Parallel mechanisms (in connections and filters) have the opportunity to achieve better performance For example, the Apache pig framework on Hadoop provides a merge connection for parallel implementation
There is a clear tradeoff between performance and complexity (and thus maintainability) So I would say that a good answer to the interview question really needs to consider the performance requirements
>Setting comparison – o (nlogn) – if there is no performance problem, it is relatively slow and very simple Simple victory. > Merge connection filter – o (n) – is fast and prone to coding errors. Using if performance is a problem Ideally, try to do this with an existing library, or even use a database if appropriate. > Parallel implementation – o (n / P) – is very fast and requires other infrastructure in place. If the volume is very large, it is expected to grow, which is a major performance bottleneck
(also note that the function in intersectsortedarrays is essentially a modified merge connection, where the filter is completed during the connection, and you can filter later, although the memory capacity increases slightly)
Last thought
In fact, I suspect that most modern commercial RDBMS provide thread parallelism when implementing joins, so the Hadoop version provides machine level parallelism (distribution) From a design point of view, perhaps a good and simple solution is to put the data on the database, index on a and B (effectively sort the data), and use SQL internal connection