File based merge sorting of large data sets based on Java
Given a large dataset that is not suitable for memory, are there any libraries or APIs that perform sorting in Java?
Solution
Java provides a general sorting routine as part of a larger solution A common way to sort data is that it is too large to fit all memory. This is:
1) Read data that matches the main memory, assuming it is 1 GB
2) 1 GB quicksort (here is where Java built-in sorting is used from the Collections Framework)
3) Write sorted 1 GB disks to "chunk-1"
4) Repeat steps 1-3 until all data is complete and save each data block in a separate file Therefore, if your original data is 9 GB, there will now be 9 batches of data marked "chunk-1" through "chunk-9"
5) You now only need a final merge sort, merging the 9 sorted blocks into a fully sorted dataset Merge sorting will be very effective for these pre - sorted blocks It will basically open nine file readers (one per block), plus one file writer (for output) Then compare the first data element in each read file, select the minimum value and write it to the output file The reader enters the next data element from the selected value, repeats the 9-way comparison process of finding the minimum value, and writes the answer to the output file again This process is repeated until all data is read from all block files
6) Once you have read all the completed data in step 5, your output file now contains a fully sorted data set
Using this method, you can easily write a general "megasort" utility, which uses a file name and maxmemory parameter, and effectively sorts files by using temporary files I bet you can find at least a few implementations here, but if not, you can scroll your own as described above